natural language processing blog: 2008/07

This issue seems to come up implicitly or explicitly in a large variety of circumstances. The general set up is this. Suppose that for some task, we have the "de facto" approach "A." I come up with a new approach "B" that I want to argue is better than "A." The catch is that for whatever reasons, I have to limit B in some way (perhaps in terms of the amount of data I train in on). So, is the fair comparison to a similarly limited version of A or to the full-blown A?

Let's instantiate this with two examples. (Yes, these are examples of papers that have been published; you can see if you can figure out which ones they are. I'm just not sufficiently creative to make something up right now.)

Let's say I'm doing syntactic language modeling. In this case, the baseline would be (say) a trigram language model. My model requires parse trees to train. So I train my language model on the Penn Treebank. Now, when I compare to an ngram model, is the fair comparison to an ngram model trained on the Treebank, or one trained on (say) all WSJ from ten years?
Now let's say I'm doing MT. My baseline is Moses and I build some fancy MT system that can only train on sentences of length 10 or less. Should I compare to Moses trained on sentences of length 10 or less, or all sentences?

The obvious response is "report both." But this is just deferring the problem. Instead of you deciding which is the appropriate comparison, you are leaving the decision up to someone else. Ultimately, someone has to decide.

The advantage to comparing to a gimped version of model A is that it tells you: if this is the only data available to me, this is how much better I do than A. Of course, in no real world situation (for many problems, including the two I listed above) will this really be the only data you have. Plus, if you compare to a non-gimped A, you'll almost always lose by a ridiculously large margin.

On the other hand, comparing against a non-gimped A is a bit unfair. Chances are that quite some time has gone in to optimizing (say, for speed) algorithms for solving A. Should I, as a developer of new models, also have to be a hard-core optimizer in order to make things work? I'm thinking back to the introduction of SVMs twenty years ago. Back then, SVM training would take thousands of times longer than naive Bayes. Today, (linear) SVM training really isn't that much slower (maybe a small constant times longer).

Yet from the perspective of a consumer, there's something fundamentally unpleasant about having to gimp an existing system.... you have to ask yourself: why should I care?

I suppose this is where human judgement comes in. If I can reasonably imagine that the new system B might possibly be scaled up and, if so, I think it would continue to do well, then I'm not unhappy with a gimped comparison. For example, I can probably buy the syntactic language modeling example above (and to a lesser degree the MT example). I have a harder time with grammar induction on 10 word sentences because my prior beliefs state that 10 word sentences are syntactically really different from real sentences.

(Incidentally, although this issue didn't come up in my recent reviewing, I suppose that if I were reviewing a paper that had this issue, it probably wouldn't hurt if the authors were to ease me through this imagination process. For instance, in grammar induction, maybe you can show statistics that say that the distribution of context free rules is not so dissimilar between all sentences and short sentences. This almost never happens, but I think it would be useful both during reviewing as well as just for posterity.)

I know it's a bit delayed, but better late than never, eh? I have to say: I thought ICML/UAI/COLT this year was fantastic. The organizers all did a fantastic job. I can't possibly list everything I liked, but here are some highlights:

The tutorials were incredible. I'm reaching that point where I often skip out on tutorials because there's nothing new enough. This time, there was an embarrassment of riches: I couldn't go to everything I wanted to. In the end, I went to the Smola+Gretton+Fukumizu tutorial on distributional embeddings and the Krause+Guestrin tutorial on submodularity. Both were fantastic. My only complaint is that while all the plenary talks at ICML/UAI/COLT were video taped, the tutorials were not! Anyway, I will blog separately about both of these tutorials in the near future; also check out their web pages: the slides are probably reasonably self-explanatory.

The first day began with an invited talk by John Winn, who does (at least used to do) probabilistic modeling for computer vision. I'm not sure if this was the intended take-away message, but the thing I liked best about the talk is that he listed about a dozen things that you would have to model, were you to model "vision." He then talked about problems and techniques in terms of which of these you model, which of them you fix, and which of them you treat as "noise." I think this is a great way to think about modeling any kind of complex real world phenomenon (hint hint!).

My favorite morning talk was Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. This is a really clever use of slice sampling, which everyone should know about! The idea is to resample entire sequences of hidden states in one go. The trouble with this in an infinite model is that I don't know how to do forward-backward over an infinite state space. The trick is to slice the probability mass on the lattice and only sample over those state/observation pairs that stick out above the slice. This is effectively a fully Bayesian way to do beam-like methods, and is certainly no limited to infinite HMMs.

That afternoon, there was the joint award session, with a "ten year" award (great idea!) going to Blum and Mitchell's original co-training paper from COLT 1998. This is clearly a very influential paper, and one I will blog about separately. I particularly liked two of the best paper award recipients:

Percy Liang and Michael Jordan for An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators. The idea here is to look at what various estimators are like in the limit of infinite data. The conclusion (as I read it) is that if your model is correct (i.e., the true distribution is realized by a certain set of parameters in your model family), then you're asymptotically better using generative models. If the model is even a tiny tiny bit incorrect, then you're better using discriminative. The key weakness of this analysis seems to be that, since we're looking only at asymptotics, there's no notion of a "sortof-wrong" model -- the wrongness gets blown up in the limit.
Shai Shalev-Shwartz and Nati Srebo for SVM Optimization: Inverse Dependence on Training Set Size. The idea here is that having lots of data should be a good thing, not something that makes us say "ugh, now my SVM will take forever to run." The key to getting an inverse dependence on the size of the training set is to sort of change the problem. The idea is that if you have 1,000 training points, you may get 88% accuracy and it may take an hour. If you have 1,000,000 training points, we're not going to ask that we find something with 90% accuracy, but rather still aim for 88% accuracy but try to get there faster. The analysis is based on the idea that if we have lots more data, we can be a bit looser with finding the exact best model parameters, which is what usually takes so long. We can afford to have not exactly the best parameters, since we're still going for 90% accuracy, not 88%.

Although I didn't see the talk, I enjoyed the poster on Training Structural SVMs when Exact Inference is Intractable by Thomas Finley and Thorsten Joachims. They investigate the difference between approximate inference methods that are overgenerating versus undergenerating. The former is exemplified by a conversion from an IP to an LP. The latter by, say, search. I don't think there are huge cut and dry situations where one is better than the other, but I think this is a really good step to understanding these models better.

Tuesday morning, there was a great talk on Efficient Projections onto the L1-Ball for Learning in High Dimensions by John Duchi, Shai Shalev-Shwartz, Yoram Singer and Tushar Chandra. Here, there's trying to extend the Pegasos algorithm to do L1 regularized SVMs. The difficult step is, after taking a sub-gradient step, projecting yourself back on to the L1 ball. They have a very clever algorithm for doing this that makes use of red-black trees to maintain a notion of when a dimension was last updated (necessary in order to get complexity that depends on the sparseness of feature vectors). The only worry I have about this approach is that I'm very happy dealing with arrays for everything, but once someone tells me that my algorithm needs to use a tree structure, I start worrying. Not because of big-O issues, but just because trees mean pointers, pointers mean pointer chasing, and pointer chasing (especially out of cache) is a nightmare for actual efficiency.

All the papers in the Tuesday 10:40am online learning section I found interesting. Confidence-weighted linear classification (Mark Dredze, Koby Crammer and Fernando Pereira) presents a technique for keeping track of how confident you are about your weights for various features (if you squint it looks very PAC-Bayes) and they get very good online performance with only one or two passes over the data. Francesco Orabona, Joseph
Keshet and Barbara Caputo had a paper on memory-bounded online perceptrons called the Projectron. The idea is simple and effective; however, the big take-away I had from this paper has to do with the baseline they compare against. If you only allow your perceptron to maintain, say, 100 support vectors, just use the first 100 points that the perceptron updates on. This does pretty much as well as most of the past work in this problem! How did this not get noticed before? Finally, Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari talked about bandit problems; this is interesting work, but still seems far from practical. The idea of naively picking an unknown label with uniform probability over a gigantic set of labels is just not going to work in the real world. But I think it's a really good start.

The entire NLP section on the last afternoon was good. David Chen and Ray Mooney talked about how to generate sportscasts from RoboCup trials; Ronan Collobert and Jason Weston talked about using neural nets to solve every NLP problem in the world (more on this in another post); Jason Wolfe, Aria Haghighi and Dan Klein talked about how to parallelize the M step in addition to the E step; and then Percy Liang, me, and Dan Klein talked about whether its reasonable or not to get rid of features in structured prediction (more on this in another post).

The next day was workshop day. The two workshops I went to were the Prior Knowledge in Text workshop (that I co-organized with Marc Dymetman, Guillame Bouchard and Yee Whye Teh), and the Nonparametric Bayes workshop (where I talked a tiny bit about HBC). I'll relate the news of our own workshop at a later date; you can see some about the NPBayes workshop here.

Then came UAI and COLT. I must admit by this time I was a bit conferenced out, so I missed a bunch of sessions. However, there were still some papers that stood out for me.

In UAI, David Mimno and Andrew McCallum presented Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, essentially a conditional variant of LDA wherein the distribution over hyperparameters is given by a generalized linear model and can use arbitrary features. One audience question that came up (David, if you're reading, maybe you can answer this!) had to do with putting the GLM on the "alpha" hyperparameter. The fact that there was such a big improvement over baseline suggests that LDA-like models are highly sensitive to the setting of their hyperparameters: this is a bit surprising. I (like the questioner) was a bit surprised that tweaking a hyperparameter could have such a big influence! Nevertheless, very cool.

The best paper went to David Sontag, Talya Meltzer, Amir Globerson, Tommi Jaakkola and Yair Weiss for Tightening LP Relaxations for MAP using Message Passing. The idea is to take your marginal polytope initially defined by simple single-node marginal constraints and iteratively add pairwise, 3-wise, etc., constraints. There are some implementation details that allow you to do warm restarts. They managed to scale really amazingly well. This furthers my confidence that LPs are a really great way to do MAP inference in graphical models.

Kuzman Ganchev, Joao Graca, John Blitzer and Ben Taskar talked about Multi-View Learning over Structured and Non-Identical Outputs. I see this as a bit of an extension to the "alignment by agreement" and the "agreement-based learning" work, but now the difference is that the output spaces don't have to "match up."

Marina Meila and Le Bao had a poster on Estimation and clustering with infinite rankings, where they give a nonparametric model over rankings, focusing on properties of the distribution. Kurt Miller, Tom Griffiths and Mike Jordan had an extension to the Indian Buffet Process that intentionally removes exchangeability in order to model cases where we know the data is not exchangeable. Chong Wang, Dave Blei and David Heckerman had a nice result on how to do topic modeling over continuous time (modeled as Brownian motion). The cool thing is that by going continuous, they're able to get a more efficient algorithm that for previous work that functioned in discrete time steps.

At COLT, there were also a good number of good papers. Shai Ben-David, Tyler Lu and David Pal ask: Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning. I'll ruin the suspense: No, it does not. (At least not without strong assumptions about the label distribution.) Nina Balcan, Avrim Blum and Nati Srebo
show that learning with similarity functions is even better than we thought.

One paper at COLT that I especially liked because I wasn't aware of much of the previous work they cited was by Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou and Jufu Feng On the Margin Explanation of Boosting Algorithms. The history here is roughly like this. Boosting came along and worked pretty well. People started trying to explain it from the perspective of margin-based analysis. Then, Breiman came along and described an algorithm arc-gv that provably generates a better margin than AdaBoost (and does so in practice as well), yet works worse! This paper attempts to re-analyze boosting from the perspective of an "Equilibrium Margin," which provides sharper bounds and results that actually agree with what is observed empirically.

natural language processing blog

29 July 2008

What Makes a Fair Comparison

21 July 2008

ICML/UAI/COLT 2008 Retrospective

08 July 2008

ICML Business Meeting

02 July 2008

ICML, UAI and COLT 2008 all WhatToSee'd

About Me

Labels

My Blog List

Blog Archive