## 29 May 2007

### Quality vs. quantity in data annotation

I'm in the process of annotating some data (along with some students---thanks guys!). While this data isn't really in the context of an NLP application, the annotation process made me think of the following issue. I can annotate a lot more data if I'm less careful. Okay, so this is obvious. But it's something I hadn't really specifically thought of before.

So here's the issue. I have some fixed amount of time in which to annotate data (or some fixed amount of dollars). In this time, I can annotate N data points with a noise-rate of eta_N. Presumably eta_N approaches one half (for a binary task) as N increases. In other words, as N increases, (1-2 eta_N) approaches zero. A standard result in PAC learning states that a lower bound on the number of examples required to achieve 1-epsilon accuracy with probability 1-delta with a noise rate of eta_N when the VC-dimension is h is (h+log(1/delta))/(epsilon (1-2 eta)^2)).

This gives us some insight into the problem. This says that it is worth labeling more data (with higher noise) only if 1/(1-2 eta_N)^2 increases more slowly than N. So if we can label twice as much data and have the noise of this annotation increase by less than a factor of 0.15, then we're doing well. (Well in the sense that we can keep the bound the same an shrink either \epsilon or delta.)

So how does this hold up in practice? Well, it's hard to tell exactly for real problems because running such experiments would be quite time-consuming. So here's a simulation. We have a binary classification problem with 100 features. The weight vector is random; the first 50 dimensions are Nor(0,0.2); the next 35 are Nor(0,5); the final 15 are Nor(m,1) where m is the weight of the current feature id minus 35 (feature correlation). We vary the number of training examples and the error rate. We always generate equal number of positive and negative points. We train a logistic regression model with hyperparameters tuned on 1024 (noisy) dev points and evaluate on 1024 non-noisy test points. We do this ten times for each setting and average the results. Here's a picture of accuracy as a function of data set size and noise rate:

And here's the table of results:

N\eta      0   0.01   0.02   0.05    0.1    0.2   16  0.341  0.340  0.351  0.366  0.380  0.420   32  0.283  0.276  0.295  0.307  0.327  0.363   64  0.215  0.221  0.227  0.247  0.266  0.324  128  0.141  0.148  0.164  0.194  0.223  0.272  256  0.084  0.099  0.100  0.136  0.165  0.214  512  0.038  0.061  0.065  0.087  0.113  0.164 1024  0.023  0.034  0.044  0.059  0.079  0.123

The general trend here seems to be that if you don't have much data (N<=256), then it's almost always better to get more data at a much higher error rate (0.1 or 0.2 versus 0.0). Once you have a reasonable amount of data, then it starts paying to be more noise-free. Eg., 256 examples with 0.05 noise is just about as good as 1024 examples with 0.2 noise. This roughly concurs with the theorem (at least in terms of the trends).

I think the take-home message that's perhaps worth keeping in mind is the following. If we only have a little time/money for annotation, we should probably annotate more data at a higher noise rate. Once we start getting more money, we should simultaneously be more careful and add more data, but not let one dominate the other.

## 25 May 2007

### Math as a Natural Language

Mathematics (with a capital "M") is typically considered a formal language. While I would agree that it tends to be more formal than, say, English, I often wonder whether it's truly formal (finding a good definition of formal would be useful, but is apparently somewhat difficult). In other words, are there properties that the usual natural languages have that math does not. I would say there are only a few, and they're mostly unimportant. Secondly, note that I bolded the "a" above. This is because I also feel that Mathematics is more like a collection of separate languages (or, dialects if you prefer since none of them has an army -- except perhaps category theory) that a single language.

First regarding the formality. Typically when I think of formal languages, I think of something lacking ambiguity. This is certainly the case for the sort of math that one would type into matlab or a theorem prover or mathematica. But what one types in here is what I would call "formal math" precisely because it is not the language that mathematicians actually speak. This is perhaps one reason why these tools are not more widely used: translating between the math that we speak and the math that these tools speak is somewhat non-trivial. The ambiguity issue is perhaps the most forceful argument that math is not completely formal: operators get overloaded, subscripts and conditionings get dropped, whole subexpressions get elided (typically with a "..."), etc. And this is not just an issue of venue: it happens in formal math journals as well.

It is often also the case that even what gets publishes is not really the mathematics that people speak. Back when I actually interacted with real mathematicians, there was inevitably this delay for "formally" writing up results, essentially breaking apart developed shorthand and presenting things cleanly. But the mathematics that appears in math papers is not really in its natural form -- at least, it's often not the language that mathematicians actually work with.

To enumerate a few points: Math...

• has a recursive structure (obviously)
• is ambiguous
• is self-referential ("see Eq 5" or "the LHS" or simply by defining something and giving it a name)
• has an alphabet and rules for combining symbols
To some degree, it even has a phonology. One of the most painful classes I ever took (the only class I ever dropped) was an "intro to logic for grad CS students who don't know math" class. This class pained me because the pedagogical tool used was for the professor to hand out typed notes and have the students go around and each read a definition/lemma/theorem. After twenty minutes of "alpha superscript tee subscript open parenthesis eye plus one close parethesis" I was ready to kill myself. It has no morphology that I can think of, but neither really does Chinese.

Moving on to the "a" part, I propose a challenge. Go find a statistician (maybe you are one!). Have him/her interpret the following expression: . Next, go find a logician and have him/her interpret . For people in the respective fields, these expressions have a very obvious meaning (I'm guessing that most of the readers here know what the second is). I'm sure that if I had enough background to drag up examples from other fields, we could continue to perplex people. In fact, even though about 8 years ago I was intimately familiar with the first sort of expression, I actually had to look it up to ensure that I got the form right (and to ensure that I didn't make a false statement). This reminded me somewhat of having to look up even very simple words and expressions in Japanese long after having used it (somewhat) regularly. I think that the diversity of meaning of symbols and subexpressions is a large part of the reason why most computer algebra systems handle only a subset of possible fields (some do algebra, some calculus, some logic, etc.). I believe in my heart that it would be virtually impossible to pin down a single grammar (much less a semantics!) for all of math.

So what does this have to do with an NLP blog? Well, IMO Math is a natural language, at least in all the ways that matter. So why don't we study it more? In particular, when I download a paper, what I typically do is first read the abstract, then flip to the back to skim to results, and then hunt for the "main" equation that will explain to me what they're doing. For me, at least, this is much faster than trying to read all of the text, provided that I can somewhat parse the expression (which is only a problem when people define too much notation). So much information, even in ACL-style papers, but more-so in ICML/NIPS-style papers, is contained in the math. I think we should try to exploit it.

## 23 May 2007

### ICML 2007 papers up

See here; there seem to be a lot of good-looking papers (I just printed about 20). It's already indexed on WhatToSee so go ahead and try that out. Here are top words from titles:

• learn (51) -- shocking!
• model (16)
• kernel (16) -- an I thought this was just a fad
• cluster (13)
• algorithm (13)
• structur (10) -- this is both for "structured prediction" as well as "structure learning" as well as other things
• machin (10)
• function (10) -- as in "learning X functions" -- X is similarity or distance or basis or...
• data (10)
• supervis (8)
• process (8) -- both Gaussian and Dirichlet
• linear (8)
• featur (8)
• discrimin (8)
• classif (8)
• analysi (8)

## 15 May 2007

### Whence JCLR?

Journal publication is not too popular for NLPers -- we tend to be a conference driven bunch. While I could care less about some arguments for journals (eg., the folks on tenure committees like them), I do feel that they serve a purpose beyond simply acting as an archive (which things like arxiv.org, the ACL anthology, citeseer, rexa, etc. do anyway). In particular, a journal paper is often a place where you get to really set the stage for your problem, describe your algorithms so that they're actually reimplementable, and go in to serious error analysis. Certainly not every paper that appears in a *ACL should continue on to the journal path, but many times a handful of papers could be merged.

One significant problem is that we're currently really limited in our choice of publication venues. Computational Linguistics (MIT Press) is definitely the place to publish a journal paper if you can. Unfortunately, CL only puts out four issues per year, each with about 4-5 papers. Sure there aren't hundreds of good papers per year, but I have to believe there are more than 16-20. Moreover, I don't feel that CL actually mirrors the *ACL proceedings -- there are many papers published in CL that I don't think match with the general sensitivities of *ACL. In addition to the small number of CL papers, the turn around time is quite slow. I was personally very impressed with my turnaround time two years ago (date of submission -> date of publication was about a year) and I know that Robert Dale (who's editing now) has done a lot to try to improve this. But still, a year is a long time. And I've heard of papers that take several years to get through. Finally, it's not open. I hate pay-for-access journals almost as much as I had pay-for-access conference proceedings. Sure, if you attend an *ACL you get it for "free" and most universities have agreements, but this is more a principle thing than a practical thing.

Things were similar in machine learning land about six years ago (though in fact I think they were worse). The big journal there was Machine Learning (published by Springer). They had roughly the same problems, to the extent that a large fraction of the editorial board resigned to found the Journal of Machine Learning Research (JMLR). JMLR has since become very successful, publishes dozens of papers per year, and has incredibly quick turnaround (I have seen a journal version of a NIPS paper appear in JMLR before NIPS even happens). The creation of JMLR was greatly assisted by the SPARC group, which helps fledgling journal get off the ground.

I would love to see a similar thing happen in the NLP community. I, personally, cannot make this happen (I don't have enough weight to throw around), but in talking to colleagues (the majority of whom also don't have enough weight) this seems to be something that many people would be in favor of. I don't think it has to be a "JCLR is better than CL" sort of thing; I think it's very possible for both to co-exist, essentially serving slightly different purposes for slightly different communities. In particular, aside from fast turnaround and online pubs, some things that I would love to see happen with such a journal are: Strongly encouraged sharing of code/data (if one could build in some sort of copyright protection for private data, this would be even better since it would let more people share); and a built-in board for paper discussion (probably with membership); ability for authors to easily submit addenda.

A while back I went through the SPARC suggestions of how to begin such a thing and it's very non-trivial. But it's doable. And I'd be willing to help. The biggest thing that would be required would be a bunch of people with white hair who are willing to commit body-and-soul to such a move.

## 09 May 2007

### WhatToSee

I've been using a small number of Perl scripts for helping me decide what papers to read and I just put them up on the web for people to play with. See http://hal3.name/WhatToSee. It's currently queued with a few recent years of ACL, NAACL, ICML and NIPS. Yes, it would be more useful if conferences were to publish the proceedings before the conference, but don't complain to me about that. Feel free to submit additional indices if you want (an index has to be an online webpage that contains links that point to online .PDF files -- if no such page exists, you can always create one on your own webpage and point to that). There are probably ways to break it, but hopefully that won't happen frequently.

## 07 May 2007

### Non-linear models in NLP

If you talk to (many) "real" machine learning people, they express profound disbelief that almost everything we do in NLP is based on linear classifiers (maxent/lr, svms, perceptron, etc.). We only rarely use kernels and, while decision tress used to be popular, they seem to have fallen out of favor. Very few people use them for large-scale apps. (Our NYU friends are an exception.)

There are two possible explanations for this. (1) we really only need linear models; (2) we're too lazy to use anything other than linear models (or, alternative, non-linear models don't scale). My experience tells me that for most of our sequence-y problems (parsing, tagging, etc.), there's very little to be gained by moving to, eg., quadratic SVMs. I even tried doing NE tagging with boosted decision trees under Searn, because I really wanted it to work nicely, but it failed. I've also pondered the idea of making small decision trees with perceptrons on the leaves, so as to account for small amounts of non-linearity. Using default DT construction technology (eg., information gain), this doesn't seem to help either. (Ryan McDonald has told me that other people have tried something similar and it hasn't worked for them either.) Perhaps this is because IG is the wrong metric to use (there exist DTs for regression with linear models on the leaves and they are typically learned so as to maximize the "linearness" of the underlying data, but this is computationally too expensive).

One counter-example is the gains that people have gotten by using latent variable models (eg., Koo and Collins), which are essentially non-linearified linear models. In somewhat a similar vein, one could consider "edge" features in CRFs (or any structured prediction technique) to be non-linear features, but this is perhaps stretching it.

Part of this may be because over time we've adapted to using features that don't need to be non-linearified. If we went back and treated each character in a word as a single feature and then required the learning algorithm to recover what important features (like, "word is 'Bush'") then clearly non-linearity would be required. This is essentially what vision people do, and exactly the cases where things like deep belief networks really shine. But so long as we're subjecting our learning algorithms to featuritis (John Langford's term), perhaps there's really not much to gain.

## 03 May 2007

### Future DUC Plans

(Guest post by Lucy Vanderwende -- Thanks, Lucy!)

There was a general sigh of relief upon hearing the proposal for the next DUC to occur in November 2008, giving us all a nice long lead time for system development. The change comes about by co-locating DUC with TREC, which predictably occurs once a year, in November, rather than the erratic schedule that DUC has followed in the past few years. As a result, system submissions will no longer be due at the same time as the deadline for submission to a major NLP conference: another deep sigh of relief. Perhaps we will see more DUC workshop papers extended and published as conference publications, now that there is more time to do system analysis after the DUC results have been made available. Did you know that we should send NIST a reference to every paper published that uses the DUC data?

Co-locating with TREC has another change in store for DUC (though not yet set in stone). DUC will, likely, have two tracks: summarization and question-answering (the QA track will move out of TREC and into DUC). At the DUC2007 workshop, Hoa Trang Dang gave us an overview of the types of questions and question templates that are currently the focus of TREC QA. Workshop participants were very interested, and the opportunities for synergy between the two communities seem plentiful.

For summarization, DUC will continue to have a main task and a pilot task. NIST's proposal is for the 2008 main task to be Update Summarization (producing short, ~100 words, multi-document summaries in the context of a set of earlier articles); everyone at the workshop was excited by this proposal, and several groups have already worked on update summarization since it was the pilot for 2007. NIST's proposal for the 2008 pilot task is summarization of opinions from blogs; this is less certain and there are some remaining questions, for example, whether it would be "single blog" or "multiple blog" summarization, summarization of initial blog post or blog post + comments, and what the summary size ought to be. There was a lot of enthusiasm for this topic, especially because there was a TREC track on blogs in which for the last two years they have tagged blogs as positive or negative wrt opinions, i.e., there's at least some training data.