Towards Cross-Lingual Sentence Similarity Comparison: A Generalized Non-Negative Matrix Factorization Approach

We describe a novel approach to cross-lingual sentence similarity comparison based on the joint factorization of language specific document-term matrices. Our approach takes two language specific document-term matrices representing a parallel corpus and obtains language specific projections intended to map sentences from their respective languages into a reduced rank common subspace. These projections are obtained using a joint multiplicative non-negative matrix factorization with sparsity constraints that minimizes the joint divergence. Our technique is demonstrated in a cross-lingual SMT/human translation classification task attaining a 73% accuracy which is significantly better than previously obtained using a long-span n-gram classifier.

By: Juan M. Huerta

Published in: RC25273 in 2012

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc25273.pdf

Questions about this service can be mailed to reports@us.ibm.com .