Towards Cross-Lingual Sentence Similarity Comparison: A Generalized Non-Negative Matrix Factorization Approach

We describe a novel approach to cross-lingual sentence similarity comparison based on the joint factorization of language specific document-term matrices. Our approach takes two language specific document-term matrices representing a parallel corpus and obtains language specific projections intended to map sentences from their respective languages into a reduced rank common subspace. These projections are obtained using a joint multiplicative non-negative matrix factorization with sparsity constraints that minimizes the joint divergence. Our technique is demonstrated in a cross-lingual SMT/human translation classification task attaining a 73% accuracy which is significantly better than previously obtained using a long-span n-gram classifier.

By: Juan M. Huerta

Published in: RC25273 in 2012


