We present a novel approach for SMT domain adaptation with different level of bilingual data clustering. We first merge bilingual corpora into topic-relevant clusters based on multiple features that capture corpus similarities. After initial corpus clustering, we select the most representative phrase pairs and sentence pairs for each cluster, then apply a refined sentence-level clustering using this seed data. As each sentence pair is re-assigned to the most likely cluster, the seed data for each cluster keeps growing, with the models being updated iteratively. At decoding time, for each input sentence we select the most relevant top K clusters, and combine their phrase tables with the baseline phrase table using dynamic weights. Experiments show 1.0-2.0 points of gain in BLEU on various test sets over an English-to-Chinese baseline system built with general models. Similar improvement is also observed on a Chinese-to-English MT system.
By: Fei Huang, Bing Xiang
Published in: RC25416 in 2013
Questions about this service can be mailed to reports@us.ibm.com .