19 July 2007

What's the Use of a Crummy Translation?

I'm currently visiting Microsoft Research Asia (in Beijing) for two weeks (thanks for having me, guys!). I speak basically no Chinese. I took one half of a semester about 6 years ago. I know much more Japanese; enough so that I can read signs that indicate direction, dates and times, but that's about it... the remainder is too divergent for me to make out at all (perhaps a native Japanese speaker would feel differently, but certainly not a gaijin like me).

My experience here has reminded me of a paper that Ken Church and Ed Hovy wrote almost 15 years ago now, Good Applications for Crummy Machine Translation. I'm not sure how many people have read it recently, but it essentially makes the following point: MT should enter the users world in small steps, only insofar as it is actually going to work. To say that MT quality has improved significantly in 15 years is probably already an understatement, but it is still certainly far from something that can even compare to translation quality of a human, even in the original training domain.

That said, I think that maybe we are a bit too modest as a community. MT output is actually relatively readable these days, especially for relatively short input sentences. The fact that "real world companies" such as Google and LanguageWeaver seem to anticipate making a profit off of MT shows that at least a few crazies out there believe that it is likely to work well enough to be useful.

At this point, rather than gleefully shouting the glories of MT, I would like to point out the difference between the title of this post and the title of the Church/Hovy paper. I want to know what to do with a crummy translation. They want to know what to do with crummy machine translation. This brings me back to the beginning of this post: my brief experience in Beijing. (Discourse parsers: I challenge you to get that dependency link!)

  • This voucher can not be encashed and can be used for one sitting only.
  • The management reserves the right of explanation.
  • Office snack is forbidden to take away.
  • Fizzwater bottles please recycle.
The first two are at my hotel, which is quite upscale; the second two are here on the fridge at Microsoft. There are so many more examples, in subway stations, on the bus, on tourism brochures, in trains, at the airport, I could go on collecting these forever. The interesting this is that although two of these use words that aren't even in my vocabulary (encashed and fizzwater), one is grammatical but semantically nonsensical (what are they explaining?) and one is missing an indirect object (but if it had one, it would be semantically meaningless), I still know what they all mean. Yes, they're sometimes amusing and worth a short chuckle, but overall the important points are gotten across: no cash value; you can be kicked out; don't steal snacks; recycle bottles.

The question I have to ask myself is: are these human translations really better than something a machine could produce? My guess is that machine translation outputs would be less entertaining, but I have a hard time imagine that they would be less comprehensible. I guess I want to know: if we're holding ourselves to the standard of a level of human translation, what level is this? Clearly it's not the average translation level that large tourism companies in China hold themselves to. Can we already beat these translations? If so, why don't we relish in this fact?

12 July 2007

Multiclass learning as multitask learning

It's bugged me for a little while that when learning multiclass classifiers, the prior on weights is uniform (even when they're learned directly and not through some one-versus-rest or all-versus-all reduction).

Why does this bug me? Consider our favorite task: shallow parsing (aka syntactic chunking). We want to be able to identify base phrases such as NPs in running text. The standard way to do this is to do an encoding of phrase labels into word labels and apply some sequence labeling algorithm. The standard encoding is BIO. A sentence like "The man ate a sandwich ." would appear as "B-NP I-NP B-VP B-NP I-NP O" with "B-X" indicating the beginning of a phrase of type X, and "I-X" indicating being "inside" such a phrase ("O", assigned to "." is "outside" of a chunk).

If we train, eg., a CRF to recognize this, then (typically) it considers B-NP to be a completely independent of I-NP; just as independent as it is of "O". Clearly this is a priori a wrong assumption.

One way I have gotten around this problem is to actually explicitly parameterize my models with per-class features. That is, rather than having a feature like "current word is 'the'" and making K copies of this feature (one per output label); I would have explicitly conjoined features such as "current word is 'the' and label is 'B-NP'". This enables me to have features like "word=the and label is B-?" or "word=the and label is ?-NP", which would get shared across different labels. (Shameless plug: megam can do this effortlessly.)

But I would rather not have to do this. One thing I could do is to make 2^K versions of each feature (K is still the number of labels), where each encodes some subset of active features. But for large-K problems, this could get a bit unwieldy. Pairwise features would be tolerable, but then you couldn't get the "B-?" sort of features I want. There's also no obvious kernel solution here, because these are functions of the output label, not the input.

It seems like the right place for this to happen is in the prior (or the regularizer, if you're anti-probabilistic models). Let's say we have F features and K classes. In a linear model, we'll learn F*K weights (okay, really F*(K-1) for identifiability, but it's easier to think in terms of F*K). Let's say that a prior we know that classes j and k are related. Then we want the prior to favor w(:,j) to be similar to w(:,k). There are a variety of ways to accomplish this: I think that something along the lines of a recent NIPS paper on multitask feature learning is a reasonable way to approach this.

What this approach lacks in general is the notion that if classes j and k "share" some features (i.e., they have similar weights), then they're more likely to "share" other features. You could do something like task clustering to achieve this, but that seems unideal since I'd really like to see a single unified algorithm.

Unfortunately, all attempts (on my part) to come up with a convex regularizer that shares these properties has failed. I actually now think that it is probably impossible. The problem is essentially that there is bound to be some threshold controlling whether the model things classes j and k are similar and below this threshold the regularizer will prefer w(:,j) and w(:,k) independently close to zero; above this threshold, it will prefer w(:,j) and w(:,k) to be close to each other (and also close to zero). This is essentially the root of the non-convexity.

08 July 2007

Collapsed Gibbs

(The contents of this post are largely due to a conversation with Percy Liang at ACL.)

I'm a big fan of Gibbs sampling for Bayesian problems, just because it's so darn easy. The standard setup for Gibbs sampling over a space of variables a,b,c (I'll assume there are no exploitable independences) is:

  1. Draw a conditioned on b,c
  2. Draw b conditioned on a,c
  3. Draw c conditioned on a,b
This is quite a simple story that, in some cases, be "improved." For instance, it is often possible to jointly draw a and b, yielding:
  1. Draw a,b conditioned on c
  2. Draw c conditioned on a,b
This is the "blocked Gibbs sampler." Another variant, that is commonly used in our community, is when one of the variables (say, b) can be analytically integrated out, yielding:
  1. Draw a conditioned on c
  2. Draw b conditioned on a,c
  3. Draw c conditioned on a,b
This is the "collapsed Gibbs sampler." In fact, we can often collapse b out entirely and, in cases where we don't actually care about it's value, we get:
  1. Draw a conditioned on c
  2. Draw c conditioned on a
To make this concrete, consider Mark Johnson's EMNLP paper on unsupervised part of speech tagging. Here, there are essentially two types of variables. One set are the tags assigned to each word. The second set are the parameters (eg., probability of word given tag). The standard Gibbs sampler would perform the following actions:
  1. For each word token, draw a tag for that word conditioned on the word itself, the tag to the left, and the "probability of word given tag" parameters.
  2. For each tag type (not token), draw a multinomial parameter vector for "probability of word given tag" conditioned on the current assignment of tags to words.
It turns out that if our prior on "p(word|tag)" is Dirichlet, we can collapse out the second step by integrating over these parameters. This yields a one-step collapsed Gibbs sampler:
  1. For each word token, draw a tag for that word conditioned on the word itself, the tag to the left, and all other current assignments of tags to words.
My general M.O. has been: if you can collapse out a variable, you should. This seems intuitively reasonable because you're now sampling over a smaller space and so it should be easier.

The point of this post is that acknowledge that this may not always be the case. In fact, it's sort of obvious in retrospect. There are many models for which auxiliary variables are added just to make the sampling easier. This is, in effect, un-collapsing the sampler. If "always collapse" is a good rule to follow, then people would never add auxiliary variables.

While this is a convincing argument (for me, at least), it's not particularly intuitive. I think that the intuition comes from considering the mixing rate of the Markov chain specified by the standard Gibbs sampler and the collapsed Gibbs sampler. It seems that essentially what's happening by using a collapsed sampler is that the variance of the Markov chain is decreasing. In the tagging example, consider a frequent word. In the collapsed setting, the chance that the tag for a single token of this word will change in a Gibbs step is roughly inversely proportional to its term frequency. This means that the collapsed sampler is going to have a tendency to get stuck (and this is exactly what Mark's results seem to suggest). On the other hand, in the uncollapsed case, it is reasonably plausible that a large number of tags could change for a single word type "simultaneously" due to a slightly different draw of the "p(word|tag)" parameter vector.

(Interestingly, in the case of LDA, the collapsed sampler is the standard approach and my sense is that it is actually somehow not causing serious problems here. But I actually haven't seen experiments that bear on this.)

02 July 2007

ACL and EMNLP 2007 Report

ACL/EMNLP just concluded. Overall, I thought both conferences were a success, though by now I am quite ready to return home. Prague was very nice. I especially enjoyed Tom Mitchell's invited talk on linking fMRI experiments to language. They actually use lexical semantic information to be able to identify what words people are thinking about when they scan their brains. Scary mind-reading stuff going on here. I think this is a very interesting avenue of research---probably not one I'll follow myself, but one that I'm really happy someone is persuing.

This is a really long post, but I hope people (both who attended and who didn't) find it useful. Here are some highlights, more or less by theme (as usual, there are lots of papers I'm not mentioning, the majority of which because I didn't see them):

Machine Translation:

The overall theme at the Stat-MT workshop was that it's hard to translate out of domain. I didn't see any conclusive evidence that we've figure out how to do this well. Since domain adaptation is a topic I'm interested in, I'd like to have seen something. Probably the most interesting paper I saw here was about using dependency order templates to improve translation, but I'm actually not convinced that this is actually helping much with the domain issues: the plots (eg., Fig 7.1) seem to indicate that the improvement is independent of domain when compared to the treelet system. They do do better than phrases though. There was a cute talk about trying character-based models for MT. Doesn't seem like this does much of interest, and it only really works for related languages, but it was nice to see someone try. This idea was echoed later in EMNLP but none of the talks there really stood out for me. The only other theme that I picked up at Stat-MT (I didn't stay all day) was that a lot of people are doing some form of syntactic MT now. Phrase-based seems to be on its way out (modulo the next paragraph).

There were also a lot of talks using Philipp Koehn's new Moses translation system, both at Stat-MT as well as at ACL and EMNLP. I won't link you to all of them because they all tried very similar things, but Philipp's own paper is probably a good reference. The idea is to do factored translation (ala factored language modeling) by splitting both the input words and the output words into factors (eg., lemma+morphology) and translating each independently. The plus is that most of your algorithms for phrase-based translation remain the same, and you can still use max-Bleu traning. These are also the cons. It seems to me (being more on the MT side) that what we need to do is rid ourselves of max-Bleu, and then just switch to a purely discriminative approach with tons of features, rather than a linear combination of simple generative models.

There were also a lot of word-alignment talks. The most conclusive, in my mind, was Alex Fraser's (though I should be upfront about bias: he was my officemate for 5 years). He actually introduced a new generative alignment model (i.e., one that does have "IBM" in the name) that accounts for phrases directly in the model (no more symmetrization, either). And it helps quite a bit. There was also a paper on alignments tuned for syntax by John DeNero and Dan Klein that I liked (I tried something similar previously in summarization, but I think their model makes more sense). (A second bias: I have known John since I was 4 years old.)

The google folks had a paper on training language models on tera-word corpora. The clever trick here is that if your goal is a LM in a linear model (see below), it doesn't matter if its normalized or not. This makes the estimation much easier. They also (not too surprisingly) find that when you have lots of words, backoff doesn't matter as much. Now, if only the google ngram corpus included low counts. (This paper, as well as many other google papers both within and without MT, makes a big deal of how to do all this computation in a map-reduce framework. Maybe it's just me, but I'd really appreciate not reading this anymore. As a functional programming languages guy, map-reduce is just map and fold. When I've written my code as a map-fold operation, I don't put it in my papers... should I?)

The talk that probably got the most attention at EMNLP was on WSD improving machine translation by Marine Carpuat and Dekai Wu. I think Marine and Dekai must have predicted a bit of difficulty from the audience, because the talk was put together a bit tongue in cheek, but overall came across very well. The key to getting WSD to help is: (a) integrate in the decoder, (b) do WSD on phrases not just words, and (c) redefine the WSD task :). Okay, (c) is not quite fair. What they do is essentially train a classifier to do phrase prediction, rather than just using a t-table or a phrase-table. (Actually, Berger and the Della Pietras did this back in the 90s, but for word translation.) Daniel Marcu nicely complimented that he would go back to LA and tell the group that he saw a very nice talk about training a classifier to predict phrases based on more global information, but that he may not mention that it was called WSD. They actually had a backup slide prepared for this exact question. Personally, I wouldn't have called it WSD if I had written the paper. But I don't think it's necessarily wrong to. I liken it to David Chiang's Hiero system: is it syntactic? If you say yes, I think you have to admit that Marine and Dekai's system uses WSD. Regardless of what you call it, I think this paper may have quite a bit of impact.

(p.s., MT-people, listen up. When you have a model of the form "choose translation by an argmax over a sum over features of a weight times a feature value", please stop refering to it as a log-linear model. It's just a linear model.)

Machine Learning:

Daisuke Okanohara and Jun'ichi Tsujii presented a paper on learning discriminative language models with pseudo-negative samples. Where do they get their negative samples? They're just "sentences" produced by a trigram language model! I find it hard to believe no one has done this before because it's so obvious in retrospect, but I liked it. However, I think they underplay the similarity to the whole-sentence maxent language models from 2001. Essentially, when one trains a WSMELM, one has to do sampling to compute the partition function. The samples are actually generated by a base language model, typically a trigram. If you're willing to interpret the partition funciton as a collection of negative examples, you end up with something quite similar.

There were two papers on an application of the matrix-tree theorem to dependency parsing, one by the MIT crowd, the other by the Smiths. A clever application (by both sets of authors) of the m-t theorem essentially allows you to efficiently (cubic time) compute marginals over dependency links in non-projective trees. I think both papers are good and if this is related to your area, it's worth reading both. My only nit pick is the too-general title of the MIT paper :).

John Blitzer, Mark Drezde and Fernando Pereira had a very nice paper on an application of SCL (a domain adaptation technique) to sentiment classification. Sentiment is definitely a hot topic (fad topic?) right now, but it's cool to see some fancy learning stuff going on there. If you know SCL, you know roughly what they did, but the paper is a good read.

One of my favorite papers at ACL was on dirichlet process models for coreference resolution by Aria Haghighi and Dan Klein. I'd say you should probably read this paper, even if it's not your area.

One of my favorite papers at EMNLP was on bootstrapping for dependency parsing by David Smith and Jason Eisner. They use a clever application of Renyi entropy to obtain a reasonable bootstrapping algorithm. I was not aware of this, but during the question period, it was raised that apparently these entropy-based measures can sometimes do funky things (make you too confident in wrong predictions). But I think this is at least somewhat true for pretty much all semi-supervised or bootstrapping models.

Random other stuff:

I learned about system combination from the BBN talk. The idea here is to get lots of outputs from lots of models and try to combine the outputs in a meaningful way. The high-level approach for translation is to align all the outputs using some alignment technique. Now, choose one as a pivot. For each aligned phrase in the pivot, try replacing it with the corresponding phrase from one of the other outputs. It's kinda crazy that this works, but it helps at least a few bleu points (which I'm told is a lot). On principle I don't like the idea. It seems like just a whole lot of engineering. But if you're in the "get good results" game, it seems like just as good a strategy as anything else. (I'm also curious: although currently quite ad-hoc, this seems a lot like doing an error-correcting output code. Does anyone know if it has been formalized as such? Do you get any gains out of this?)

My final blurb is to plug a paper by MSR on single-document summarization. Yes, that's right, single-document. And they beat the baseline. The cool thing about this paper is that they use the "highlights" put up on many CNN news articles as training. Not only are these not extracts, but they're also "out of order." My sense from talking to the authors is that most of the time a single highlight corresponds to one sentence, but is simplified. I actually downloaded a bunch of this data a month ago or so (it's annoying -- CNN says that you can only download 50 per day and you have to do it "manually" -- it's unclear that this is actually enforceable or if it would fall under fair use, but I can understand from Microsoft's perspective it's better safe than sorry). I was waiting to collect a few more months of this data and then release it for people to use, so check back later. (I couldn't quite tell if MSR was going to release their version or not... if so, we should probably talk and make sure that we don't waste effort.)

Wrapping Up:

Since we're ending on a summarization note, here's a challenge: create a document summarization system that will generate the above post from the data in the anthology. (Okay, to be fair, you can exclude the information that's obtained from conversations and questions. But if we start videotaping ACL, then that should be allowable too.)

I put pictures from Prague up on my web site; feel free to email me if you want a high-res version of any of them. Also, if I was talking to you at the conference about sequential Monte Carlo, email me -- I have new info for you, but I can't remember who you are :).