27 June 2006

Beating an N-Gram

Our entry into the NIST MT eval this year has a recapitalization component, currently being done by a large language model (500mb, gzipped) together with SRILM's "disambig" tool. I was recently conscripted to try to improve this. After about a week's worth of work, using a bunch of tools and rules and a bunch of other stuff (though not 100% tuned), I've eeked out a 0.25 increase in BLEU scores on the 2003 and 2004 test data for three of our MT systems (phrase-based, syntax-based and Hiero). This could easily be upped to maybe 0.5 with a few more hours of work (there are tokenization differences between my training and test data).

The story that it's hard to beat an n-gram language model is fairly ubiquitous. In fact, my most useful features for classification are based on the n-best outputs of disambig using the large LM. There are older results that suggest that once you have enough data, using stupid techniques is sufficient. This seems to be more true for NLP than for many other areas I know about, though perhaps this is because there exist NLP problems for which it is even possible to get a "sufficient" amount of data.

While this story is true for English, I'm not sure that it would hold for other languages. My guess is that n-gram models work significantly worse in languages with more free word order and (though this usually comes as a package deal) stronger morpology. As a counter-argument, though, some of my friends who have contact with people who do commercial speech recognition in languages other than English do actually use vanilla n-gram models in those other languages. I don't know whether this is for lack of an alternative or just because they remain so good even outside of English.

I don't know the secret sauce to beating an n-gram. They appear to work so well because language (where, by "language" I mean "English") is really not that ambiguous, in a Shannon-experiment sense. Moreover, raw English words seem to really be just about right when it comes to information content. Sure, we can get some mileage by looking at suffixes or bigrams, but, comparing to, say, German (where one word can correspond to 2-3 English words), English really seems to strike the right balance of amount of information in a word to sparsity (where "right" = "right for NLP"). I'm sure other languages fall fairly close by and I recall seeing a study comparing Shannon-style tests in multiple languages (anyone know a pointer?), but, if pressed, this is my guess as to why n-grams work so well. To beat them, it seems we are almost forced to look at problems with less data, or problems which are significantly noisier.

6 comments:

hal said...

I should mention that by "n-gram model" I mean very well tuned n-gram model with interpolated Kneser-Ney smoothing. Obviously a lot of work has gone into getting smoothing to work out well, and it is possible that this is where the majority of the power is coming from, though, when not scrambling for tiny improvements, I tend to doubt it.

Anonymous said...

If you look closely at the literature, you'll find a lot of the "improvements" on n-gram models are improvements over lousy baseline models (low order, bad smoothing, small amounts of data, etc.)

We just built a very large character n-gram model for spelling checking. It was trained on 50GB of domain-specific data. Works great as the source in a noisy-channel spelling corrector. Results from MSN (Brill, Moore, Cucerzan, etc.) show roughly the same thing.

Case restoration as done by IBM's TrueCasing system is just another noisy channel model spelling corrector.

As to free word order languages, most of them are not free *word* order, but free *phrase* order. Much of the power of n-grams comes from the phrasal level. They break down in English when you get long separations of related terms by apositives. As in "The President, who blah blah blah, declared ..."
where you don't have a chance of predicting "signed" on the basis of "President" with an n-gram. In fact, you wind up predicting it based on something it's less related to. Attempts to use parsing to sort these issues out have had mixed success at best. Parsers just don't provide very good language models. Probably because most decisions, even of a parser, are local, even in languages with more flexible phrase order.

As to the "English is just right", I think that's a perspective derived from a field where most of the work's in English. Sentence detection, for instance, is a pain because of our overloading of the period -- not a problem in Chinese.

As to scaling, check out my paper from last year's ACL workshop:
Scaling High-Order Character Language Models to Gigabytes. I basically showed that as the data gets up into the 1GB range for 8-grams, the smoothing you use is irrelevant -- Dirichlet, Kneser-Ney and what we went with, Witten-Bell, were all within three decimal places after that much data. Pretty much inline with the Brill and Banko conclusions that more data's better. N-grams are great because you can just keep piling in more data usefully.

Character LMs are tough to beat for classification. Both Bill Teahan and Ian Witten's work on compression-based models and Fuchun Peng's recent Waterloo thesis have shown this in a wide variety of settings. I just rebuilt Pang and Lee's sentiment analysis and character n-grams did as well as their SVM-based classifiers.

Chinese character LMs work very well, too. We used them to do word segmentation for the latest SIGHAN (Teahan did this first in his CL paper of about 2000). And using words rather than tokens worked very well for Chinese named entity recognition. Scores were about the same as for English (86% balanced F measure on person/location/organization).

hal said...

Wow, Bob said a lot :).

I have to admit something of a personal bias: it irks me that vanilla relative frequency works so darn well on so many things, and I'm really curious why this is. This is maybe a separate issue, though.

I think the "no data like more data" is fairly well accepted, though I would generally add "from the right domain" at the end. The free word versus free phrase order is an important distiction, and one I hadn't previously really thought about.

I find it utterly fascinating that character LMs work so well. Word ngrams I can buy because words are sort of whole semantic units. But I wonder if a high-ish order character ngram is essentially just mimicking a word ngram, but with a little smoothing over prefixes and suffixes? I.e., if you stem, or just chop words off at four characters, is there a significant difference between the character-based ngram and the word ngram? It seems that the character-based ngram has to be wasting probability mass on things that just don't occur. Probably more so than a word ngram would. But maybe this is okay, for the purposes of classification.

In some sense, it is nice that character based models can go somewhere. This reminds me of attempts to do computer vision by looking only at pixels (ala Yann LeCun).

Are there any similar results for freer word order languages than E and C? For instance, Russian or Hebrew or Arabic?

Anonymous said...

Character n-grams are not just short word n-grams. Let me explain why.

First, there's not only the prefix/suffix issue Hal brought up, but also the cross-word issue. A character 8-gram will be a long n-gram for short words "out of a"
is only 8 characters.

Second, the usual application of token n-grams includes (a) case normalization, (b) punctuation removal, and (c) whitespace removal. That's why these had to be reinserted for IBM's "Entropy of English" paper (early 90s CL). I asked Joshua Goodman about his and Stanley Chen's papers, and they never modeled the generation of text.

Doing the same thing before running a character language model *increases* cross-entropy on both NY Times gigaword article bodies and MEDLINE citation abstracts.

Most token LMs are defective in that they assume a fixed vocabulary size. In practice, this means they don't sum to 1.0 over arbitrary sequences of tokens (or sequences of tokens of the same length). In short, they're *cheating* if you view them as arbitrary text analyzers, and thus they can't be used for tasks such as compression. We back our token models off to character models rather than a constant (as in Kneser-Ney, Dirichlet, Witten-Bell, etc. applied to tokens).

Google's found the one new token per document rule holds into the billions of documents, so there doesn't seem to be a natural upper bound. This problem's even worse (in theory) in highly inflected languages like Turkish or Basque. And it's problematic for Chinese, where the best systems at word segmentation are only 97% accurate or so.

Character n-grams under all the usual smoothing schemes converge to the ML estimate with enough data. So the more data they see, the less they "waste" on unseen events. And because they typically smoothe over n-gram levels, their estimates of unseen tokens are much better (partly the prefix/suffix issue).

hal said...

Wouldn't also the fact that character ngrams have a tiny vocabulary (say, around 100) contribute significantly to the fact that the different smoothing techniques don't make that much of a difference?

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花