Recognition of New Words Based on Entropy and Morphological Rules

No lexicon could be expected to contain every possible word of a language, given the dynamic nature of languages and the creativity of human beings. Words unknown to the lexicon cause a lot of problems to natural language processing (NLP) systems which depend on lexical information such as part-of-speech tagging and terminology identification. Recent advances in technology speed up the creation of new, especially domain-specific, words; thus, timely and proper recognition of new words is very important for building reliable NLP systems.

This paper proposes methods for identifying probable real words among out-of-vocabulary (OOV) words in text and for generating possible parts-of-speech (POS) of the discovered new words. The identification of new words is performed based on the morphological rules of the language for derived new words and based on entropy of character trigrams for newly coined words. The POS guessing is done on the basis of lexical formation rules and word endings respectively. The proposed methods show promising results in both precision and recall.

By: Youngja Park

Published in: RC22978 in 2003

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc22978.pdf

Questions about this service can be mailed to reports@us.ibm.com .