27 April 2007

EMNLP papers, tales from the trenches

First, EMNLP/CoNLL papers have been posted. I must congratulate the chairs for publishing this so fast -- for too many conferences we have to wait indefinitely for the accepted papers to get online. Playing the now-standard game of looking at top terms, we see:

  1. model (24)
  2. translat (17)
  3. base (16)
  4. learn (13) -- woohoo!
  5. machin (12) -- mostly as in "machin translat"
  6. word (11) -- mostly from WSD
  7. structur (10) -- yay, as in "structured"
  8. disambigu (10) -- take a wild guess
  9. improv (9) -- boring
  10. statist, semant, pars, languag, depend, approach (all 7)
EMNLP/CoNLL this year was the first time I served as area chair for a conference. Overall it was a very interesting and enlightening experience. It has drastically changed my view of the conference process. I must say that overall it was a very good experience: Jason Eisner did a fantastic job at keeping things on track.

I want to mention a few things that I noticed about the process:
  1. The fact that reviewers only had a few weeks, rather than a few months to review didn't seem to matter. Very few people I asked to review declined. (Thanks to all of you who accepted!) It seems that, in general, people are friendly and happy to help out with the process. My sense is that 90% of the reviews get done in the last few days anyway, so having 3 weeks or 10 weeks is irrelevant.

  2. The assignment process of papers to reviewers is hard, especially when I don't personally know many of the reviewers (this was necessary because my area was broader than me). Bidding helps a lot here, but the process is not perfect. (People bid differently: some "want" to review only 3 papers, so "want" half...) If done poorly, this can lead to uninformative reviews.

  3. Papers were ranked 1-5, with half-points allowed if necessary. None of my papers got a 5. One got a 4.5 from one reviewer. Most got between 2.5 and 3.5, which is highly uninformative. I reviewed for AAAI this year, where you had to give 1,2,3 or 4, which forces you to not abstain. This essentially shifts responsibility from the area chair to the reviewer. I'm not sure which approach is better.

  4. EMNLP asked for many criteria to be evaluated by reviewers; more than in conferences past. I thought this was very useful to help me make my decisions. (Essentially, in addition to high overall recommendation, I looked for high scores on "depth" and "impact.") So if you think (like I used to) that these other scores are ignored: be assured, they are not (unless other area chairs behave differently).

  5. Blindness seems like a good thing. I've been very back and forth on this until now. This was the first time I got to see author names with papers and I have to say that it is really hard to not be subconsciously biased by this information. This is a debate that will never end, but for the time being, I'm happy with blind papers.

  6. 20-25% acceptance rate is more than reasonable (for my area -- I don't know about others). I got 33 papers, of which three basically stood out as "must accepts." There were then a handful over the bar, some of which got in, some of which didn't. There is certainly some degree of randomness here (I believe largely due to the assignment of papers to reviewers), and if that randomness hurt your paper, I truly apologize. Not to belittle the vast majority of papers in my area, but I honestly don't think that the world would be a significantly worse place is only those top three papers had gotten in. This would make for a harsh 9% acceptance rate, but I don't have a problem with this.

    I know that this comment will probably not make me many friends, but probably about half of the papers in my area were clear rejects. It seems like some sort of cascaded approach to reviewing might be worth consideration. The goal wouldn't be to reduce the workload for reviewers, but to have them concentrate their time on papers that stand a chance. (Admittedly, some reviewers do this anyway internally, but it would probably be good to make it official.)

  7. Discussion among the reviewers was very useful. I want to thank those reviewers who added extra comments at the end to help me make the overall decisions (which then got passed up to the "higher powers" to be interpolated with other area chair's decisions). There were a few cases where I had to recruit extra reviewers, either to get an additional opinion or because one of my original reviewers went AWOL (thanks, all!). I'm happy to say that overall, almost all reviews were in on time, and without significant harassment.
So that was my experience. Am I glad I did it? Yes. Would I do it again? Absolutely.

Thanks again to all my reviewers, all the authors, and especially Jason for doing a great job organizing.

8 comments:

Anonymous said...

Thansk for the post, but what can you say about the effect the sample selection has on each reviewer's score?

I mean obviously everyone is inclined (consiously or not) to have a quasi-normal distribution over his scores -- a very strong paper usually raises the reviwer's standars when looking at the next one; you just can't help! of course there are clear accept or reject papers, and I beleive what I am addressing her usually pertains to the in-between categories (you metioned something related to this in ur post).

Have you seen this effect on ur area, perhaps by looking at how the reviwers of those top three papers assign scores to other papers?? is it a problem and how it can be solved?

it looks to me like a chicken and egg problem as you need to have a rough idea about the paper before assiging it and it supports your cascading approach. Do you think knowing the authors of the paper would help in this cascading approach (at least at the area chair scale in assigning papers) or it will just add to the randomness and increase the bias?

Anonymous said...

in other words, whose rule its to smooth this variance, and how did you handle this yourself. I just feel it is overwhelming to go and read every single on the border paper -- and the problem get exacerbated as you move to the top of the reviewing heirarchy .

hal said...

There wasn't much variance in what I pushed forward as the top three papers. But that's somewhat of a chicken-and-egg statement. The may not have been the top three had they had variance :). For good looking paper that did have high variance, I asked authors to discuss. Usually they did not change their scores. (Sometimes by 0.5 points.) But the discussion was quite helpful in understanding why there was high variance. Typically it was due to the fact that different reviewers actually *want* different things out of a paper.

I think it's important to separate out variance due to noise and variance due to reviewer preferences. I would venture that for the majority of the pretty good papers in the track (maybe 10ish), the primary variance that existed was due to reviewer preferences.

Of course, reviewer assignments introduce variance, too, due primarily to the fact that different reviewer biases can show up as variance. If you assign a theoretically strong by empirically weak paper to three empirically-minded reviewers, the variance will be artificially low.

Anonymous said...

Aha, I like the last paragraph of you reply. looks like we must accept to live with some randomness in the proccess.

Here is a suggestion.

I always wondered why do a reviwer need to enter a final evalulation score if he/she already entered the breakdown of her score along different criteria. Why don't we just have a given theme of the conference, like what you mentined before about ACL, this year we would like to stress innovative ideas as oppossed to stressing on empircal rigor! The chair puts the weights on each criteria, reviwers enter detailed scores, and that is it, final recommendation is there as a weighted average.

I think this will enforce more overall consistency and somehow would downweights the reviewer preference bias.

agree/disagree??

Anonymous said...

More on the topic. I really don't know why we always throw away these valuable data!

I mean why we don't make public these detailed/overall score data (after some anonymization for sure). We can do cool stuff with it like:

- try to fit an overall regressor and understand how ppl dervie the final recommendation form the detailed one.
- fit a mixture of regressors and see if we can end up with meaningful semantic classes (empirically-minded ppl vs. theoritically minded ones).

and the opportunities are endless here. Of course you don't need to disclose the papers themselves although some sort of summaries (like word counts) would enable more cool stuff (topic models??).

at the least, automatic anlysis that provides enough summary can help the chair(s) focus their effort in the right direction and help them better smooth the overall individual decisions!

hal said...

FYI, Fernando Pereira has something to say.

I've also wondered why they don't do what you suggest. I think that people psychologically don't like mechanized reviewing, but that's not a good excuse. I think that what ends up happening in practice is that the overall-recs are used to filter the top, then the ACs read/skim those top papers, and perhaps order them by some combination of weights that they deem are appropriate. Why the first step couldn't happen automatically, I don't know. It would be an interesting multilevel regression
problem to see how people actually behave wrt giving overall scores :).

So this just shifting the bias to the area chair, so you'd better hope the area chairs have good biases, or that the higher-ups can instill a good bias. Importantly, though, the ACs get a much more global picture than any single reviewer, so the bias effect is probably somewhat mitigated.

hal said...

Yay simultaneous posts.

I have many thoughts along the lines of making review information public, but I think that will have to wait for another post. I think the short answer is that people fear "repercussions" from various forms of non-anonymity (nymity?)

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花