Identification of Probable Real Words: An Entropy-Based Approach

This paper proposes a method for identifying probably real words among out-of-vocabulary (OOV) words in text. The identification of real words is done based on entropy of probability of character trigrams as well as the morphological rules of English. It also generates possible parts-of-speech (POS) of the identified real words on the basis of lexical formation rules and word endings. The method shows high performance both in precision and in recall. This method is very useful in recognizing domain-specific technical terms, and has successfully been embedded in a glossary extraction system, which identifies single or multi word glossary items and builds a domain-specific dictionary.

By: Youngja Park

Published in: RC22635 in 2002

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC22635.pdf

Questions about this service can be mailed to reports@us.ibm.com .