Performance Prediction for Exponential Language Models

We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set cross-entropy for n-gram language models. We build models over varying domains, data set sizes, and n-gram orders, and perform linear regression to see whether we can model test set performance as a simple function of training set performance and various model statistics. Remarkably, we discover a very simple relationship that predicts test performance with a correlation of 0.9996. We provide analysis of why this relationship holds, and show how this relationship can be used to motivate two heuristics for improving existing language models. We use the first heuristic to develop a novel class-based language model that outperforms a baseline word trigram model by up to 28% in perplexity and 2.1% absolute in speech recognition word-error rate on Wall Street Journal data. We use the second heuristic to provide a new motivation for minimum discrimination information (MDI) models (Della Pietra et al., 1992), and show how this method outperforms other methods for domain adaptation on a Wall Street Journal data set.

By: Stanley F. Chen

Published in: RC24671 in 2008


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to .