Linguini: Language Indentification for Multilingual Documents

        We present in this paper Linguini, a vector-space based categorizer tailored for high-percision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration, and the features used. We found that Linguini could identify the language of documents as short as 5-10% accuracy of the size of average Web documents with 100% accuracy.

        We also present how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond the monolingual analysis. This approach can be applied to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

By: John M. Prager

Published in: RC21347 in 1998

This Research Report is not available electronically. Please request a copy from the contact listed below. IBM employees should contact ITIRC for a copy.

Questions about this service can be mailed to reports@us.ibm.com .