A word-based Japanese language model

This paper deals with a word-based language model of Japanese. In Japanese, word boundaries are not stable and grammatical units do not necessarily coincide with human intuition. For accurate segmentation it is therefore necessary to create a vocabulary set that covers human utterance units. In our word-segmentation method, a model of word boundary is described by morphological parameters (i.e. part of speech), which are learned by comparing results of human segmentation with those of Japanese morphological analyzer. Then by using pseudo-random number and the model, it is determined whether each morpheme transition is a word boundary. As a result, we obtain a vocabulary set and learning data for Japanese language model automatically. According to our experiments using articles from three newspaper and appended texts in network-based forums, about 44,000 words cover 94-98\% of all words in the test data, and the average numbers of words per sentence are 12-19\% smaller than those of morphemes. The parameters of word segmentation model and language model are quite different in newspaper articles and forum's texts. However, the difference does not exist in the probabilities of common events, but in the kinds of events. Therefore the language model, which was created from newspaper articles and forum's text, gave the satisfactory results for both test set.

By: N. Itoh, M. Nishimura, S. Ogino, and K. Yamasaki

Published in: RT0288 in 2002

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rt0288.pdf

Questions about this service can be mailed to reports@us.ibm.com .