Efficient Domain-Adaptive Word Segmentation with Larger Context and Co-Training

Word segmentation is very important to many natural language processing tasks. In this paper we present a new word segmentation approach that efficiently and effectively combines word and character level larger context features with local features typically used in the CRF/MaxEnt models. We compare several feature and model combination strategies on multiple word segmentation tasks. Additionally, we propose a co-training method to train domain-specific LMs from unlabeled in-domain data. Thanks to the low cost of LM training (compared with the CRF/MaxEnt model training), it enables rapid development of domain-adaptive word segmenters with superior performance, achieving or outperforming state-of-the-art performances on several Chinese word segmentation tasks. When applying the improved segmenter to statistical machine translation, we observe consistent improvement in Bleu scores on multiple domain English-Chinese translation test sets.

By: Fei Huang, Abraham Ittycheriah, Salim Roukos

Published in: RC25411 in 2013

rc25411.pdf

Questions about this service can be mailed to reports@us.ibm.com .