## 27 December 2015

### NIPS 2015 Retrospective

There have already been lots of really nice overviews of NIPS 2015. I don't have much to add, largely because I missed almost every talk and a large fraction of the poster sessions.

Before beginning, I'm really happy and somewhat humbled to have been able to participate in the quizbowl demo, which won the best demo award: congrats to Mohit Iyyer, He He and Jordan Boyd-Graber for doing all the hard work!

Here are some things I did see, and that I liked, or that other people pointed me to as something I would probably like but I haven't actually seen.

• Logarithmic Time Online Multiclass prediction by Anna Choromanska and John Langford. This paper has been on arxiv for a while, and has been built into vw for a while also, so I've known about this stuff for probably a year now, but I still think it's great. The task is to build, in an online way, a binary tree for doing multiclass to binary reductions, in a way that gives you nice regret guarantees. The tree changes over time. One interesting property is that a single class can me represented in multiple leaves which gives added representational capacity to common/difficult classes, something most approaches lack.
• Probabilistic Line Searches for Stochastic Optimization by Maren Mahsereci and Philipp Hennig. I didn't see this paper, but I heard from several people that they liked it. The idea is that conventional optimization (eg BFGS) gets a lot of mileage out of line search, something we don't conventionally do in stochastic optimization (like sgd), but perhaps we should. It's on my reading list.
• Grammar as a Foreign Language by Orio Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever and Geoffrey Hinton. This paper thinks of parsing as the output of a sequence-to-sequence model, basically outputting a tree as a sequence of symbols. This doesn't work if you just train it on labeled data. Instead, you have to train a state of the art parser, run it on a huge pile of unlabeled data, and then train the seq2seq model to predict that. I like this because this is what Percy, Dan and I tried to get to work in the structured compilation paper, but could never make it work (probably because we lacked a sufficiently rich hypothesis class). It's basically more evidence, IMO, that seq2seq models are really good memorizers: I would be interested to see an analysis of how much generalization is actually learned (I suspect quite little).
• The Human Kernel by Andrew Wilson, Christoph Dann, Chris Lucas and Eric Xing. I like this paper in the same way I like Jerry Zhu's "machine teaching" work. It's about the interplay between human learning and machine learning. If you haven't read this paper, it's a fun read.
• Calibrated Structured Prediction by Volodymyr Kuleshov and Percy Liang. The observation in this paper is the confidence estimation is important for structured prediction problems, but faces its own challenges. They approach it from a calibration perspective and adjust the notions of calibration to be appropriate for structured problems and marginal inference.
On the not-so-great side, as Kate Crawford pointed out on twitter, the diversity level for one of the NIPS symposia was quite poor, and this goes for pretty much all of the events at NIPS (other symposia, workshops, program committee, organizing committee, etc.). This is not limited to NIPS: apparently, WiML is now slightly larger than AIStats and there were approximately as many men at WiML this year as there were women at AIStats. Nando recently discussed similar issues in his recent Reddit AMA. There's at least some awareness here, but awareness is not enough.

Overall, I enjoyed NIPS, especially catching up with old friends, doing a talking machines interview (should be posted early 2016), and seeing what's going on in a field whose size has increased dramatically since I started going eleven years ago!

## 30 October 2015

### How sausage got made once

When I was working on what turned into an old short paper (Markov Random Topic Fields) I decided it might be pedagogically interesting to keep a journal of what I was doing. This journal ended when I ran out of steam and I never got back to it. My whole original idea was, after the paper got published, post everything: the journal, the code, the paper, the paper reviews, etc. It's now been 6 years and that's not going to happen, but in case anyone finds it interesting, here in the report.

Anyway, here is the report. I'm posting this so that perhaps new students can see that things don't ever work the first time, faculty still have trouble getting their code to work, etc.

The progress of a research idea

============= DAY 1 =============

* Idea

Want to do a form of topic modeling, but where there is meta
information.  There are ways to do this, eg., Supervised LDA or
Dirichlet-Multinomial Regression.  These both operate on a *feature*
level.  For some tasks, it is more natural to operate over a graph.
Along these lines, there's Pachinko Allocation, but this posits a
graph over the vocabulary, not over the documents.  (Plus, it is a
DAG, which doesn't make sense for our application.)

Question: how can we augment a standard topic model (eg., LDA), with
an underlying graph, where we assume topics vary smoothly over the
graph?

* Technology

What technology exists for statistical modeling over graphs?  Sounds
like a Markov Random Field.  So let's marry topic models (LDA) with
MRFs, to give a "Topical Markov Random Field" (TMRF).

We think of LDA a generating documents by first choosing a topic
mixture \theta, and then choosing topics z=k for each word w, where w
is drawn from a multinomial \beta_k.

Where can a graph fit in this?  The first idea is to put an MRF over
\theta.

* MRF over \theta

If we have an MRF over theta, then two issues arise.  First, we almost
certainly can't collapse out theta as we might like.  Okay, we'll live
with that.

Second, from an MRF perspective, what do the potential functions look
like?

The simplest idea is to use pairwise potentials of the form e^{-dist},
where dist is the distance between two thetas that touch on the
graph.  What Distance metric should we use?  We could use
Bhattacharyya, Hellinger, Euclidean, LogitEuclidean, etc.  Let's start
with Hellinger.

What about a variance?  We could have lengths in the graph that are
either latent or known.  Let's say they're latent and our potentials
have the form e^{-dist/l}, where l is the length (so that if you're
far away, distance doesn't matter.

** Getting data together

We have about 1000 docs and three graphs over those docs.  We get them
in a reasonable format and then subsample about 400 of the docs.  We
do this both for speed and to make sure we don't overfit the model on
the data too much.

============= DAY 2 =============

** Implementation

We implement this with an option to use or not use graphs (so we can
tell if they're helping).  We collapse out \beta, but not \theta in
both cases, and we compute log posteriors.

We run first on some simple test data (testW) from HBC and find that
it seems to be doing something kind of reasonable.  We then run on
some real data and it puts everything in one cluster after about 20-30
Gibbs iterations.

Debugging: First, we turn off all graph stuff (sampling lengths) and
things are still broken.  Then we initialize optimally and things are
still broken.  Then we turn off resampling \theta and things are still
broken.  The problem is that I'm using the collapsed \beta incorrectly
when sampling z.  I fix it and things work as expected (i.e., not
everything goes in the same cluster).

** Evaluating

So now the code seems to be working, so we want to evaluate.  We run a
model with an without a graph (where the graph is something we expect
will help).  The posterior probabilities coming out of the two
different models are all over the place.

So we do the standard trick of holding out 20% of the data as "test"
data and then evaluating log likelihood on the test.  Here, we do the
"bad" thing and just use 20% of the words in each document (so that we
already have \theta for all the documents).  Not great, but easy to
implement.  This time, no bugs.

At this point, it's a pain to recompile for every configuration change
and we'd like to be able to run a bunch of configs simultaneously.  So
we add a simple command line interface.

In order to evaluate, we plot either posteriors or held-out
likelihoods (usually the latter) as a function of iteration using
xgraph (interacts nicely with the shell and I'm used to it).

Things now seem mixed.  There's very little difference when you're not
using a graph between sampling \theta from the true posterior versus
using an MH proposal (this is good for us, since we have to use MH).
There is also little difference between the baseline LDA model and the
MRF models.

We turn off sampling of the lengths and just fix them at one.  For the
three graphs we're trying, only one seems to be doing any better than
the baseline LDA model.

** Optimization

Now that we're running experiments, we find that things are taking way
too long to run.  So we do some optimization.  First, we cache the sum
of all \beta_k posteriors.  This helps.  Second, we flip \beta from
\beta_{k,w} to \beta_{w,k}, which we've heard helps.  It doesn't.  We
put it back the other way.

All the time is being spent in resample_z, so we waste a half day
trying to figure out if there's a way to only resample a subset of the
zs.  For instance, track how often they change and only resample those
that change a lot (probabilistically).  This hurts.  Resampling those
with high entropy hurts.  I think the issue is that there are three
types of zs.  (1) those that change a lot because they have high
uncertainty but are rare enough that they don't really matter, (2)
those that change a lot and do matter, (3) those that just don't
change very much.  Probably could do something intelligent, but have

In order to really evaluate speed, we add some code that prints out
timing.

We do one more optimization that's maybe not very common.  resample_z
loops over docs, then words, then topics.  For each word, the loop
over topics is to compute p(z=k).  But if the next word you loop over
is the same word (they are both "the"), then you don't need to
recompute all the p(z=k)s -- you can cache them.  We do this, and then
sort the words.  This gives about a 20% speedup with no loss in
performance (posterior or held-out likelihood).

** Evaluating again

Since we had some success fixing the lengths at 1, we try fixing them
at 20.  Now that one graph is doing noticably better than the baseline
and the other two slightly better.  We try 5 and 10 and 50 and nothing
much seems to happen.  20 seems like a bit of a sweet spot.

============= DAY 3 =============

** A more rigorous evaluation

Running with lengths fixed at 20 seems to work, but there's always
variance due to randomness (both in the sampling and in the 80/20
split) that we'd like to account for.

So we run 8 copies of each of the four models (8 just because we
happen to have an 8 core machine, so we can run them simultaneously).

Now, we require more complex graphing technology than just xgraph.
We'll probably eventually want to graph things in matlab, but for now
all we really care about it how things do over time.  So we write a
small perl script to extract scores every 50 iterations (we've
switched from 200 to 300 just to be safe) and show means and stddevs
for each of the models.

While we're waiting for this to run, we think about...

* MRF over z?

Our initial model which may or may not be doing so well (we're waiting
on some experiments) assumes an MRF over \th.  Maybe this is not the
best place to put it.  Can we put it over z instead?

Why would we want to do this?  There are some technological reasons:
(1) this makes the MRF discrete and we know much better how to deal
with discrete MRFs.  (2) we can get rid of the MH step (though this
doesn't seem to be screwing us up much).  (3) we no longer have the
arbitrary choice of which distance function to use.  There are also
some technological reasons *not* to do it: it seems like it would be
computationally much more burdensome.

But, let's think if this makes sense in the context of our
application.  We have a bunch of research papers and our graphs are
authorship, citations, time or venue.  These really do feel like
graphs over *documents* not *words*.

We could turn them in to graphs over words by, say, connecting all
identical terms across documents, encouraging them to share the same
topic.  This could probably be done efficiently by storing an inverted
index.  On the other hand, would this really capture much?  My gut
tells me that for a given word "foo", it's probably pretty rare that
"foo" is assigned to different topics in different documents.  (Or at
least just as rare as it being assigned to different topics in the
same document.)  Note that we could evaluate this: run simple LDA, and
compute the fraction of times a word is assigned it's non-majority
topic across all the data, versus just across one documents.  I
suspect the numbers would be pretty similar.

The extreme alternative would be to link all words, but this is just
going to be computationally infeasible.  Moreover, this begins to look
just like tying the \thetas.

So for now, we put this idea on the back burner...

* MRF over \theta, continued...

We're still waiting for these experiments to run (they're about half
of the way there).  In thinking about the graph over z, though, it did
occur to me that maybe you have to use far more topics than I've been
using to really reap any benefits here.  We begin running with 8, and
then bumped it up to 20.  But maybe we really need to run with a lot
more.

So, I log in to two other machines and run just one copy with 50, 100,
150 and 200 topics, just for vanilla LDA.  The larger ones will be
slow, but we'll just have to go do something else for a while...

============= DAY 4 =============

* MRF over \theta, re-continued...

Both experiments finish and we see that: (1) with 20 topics and
lengths fixed at 10, there is no difference between raw LDA and TMRF.
(2) More topics is better.  Even 200 topics doesn't seem to be
overfitting.  Going 20, 50, 100, 150, 200, we get hidden log
likelihoods of -1.43, -1.40, -1.38, -1.36, -1.36 (all e+06).  The
significance (from the first experiments) seems to be around .002
(e+06), so these changes (even the last, which differs by 0.005) seem
to be real.  Since we weren't able to overfit, we also run with 300
and 400, and wait some more...

...and they finish and still aren't overfitting.  We get -1.35 and
-1.35 respectively (differing by about 0.004, again significant!).

This is getting ridiculous -- is there a bug somewhere?  Everything
we've seen in LDA-like papers shows that you overfit when you have a
ton of topics.  Maybe this is because our documents are really long?

** Model debugging

One thing that could be going wrong is that when we hide 20% of the
words, and evaluate on that 20%, we're skewing the evaluation to favor
long documents.  But long documents are probably precisely those that
need/want lots of topics.  Our mean document length is 2300, but the
std is over 2500.  The shortest document has 349 words, the longest
has 37120.  So, instead of hiding 20%, we try hiding a fixed number,
which is 20% of the mean, or 460.

============= DAY 5 =============

At this point, we've done a large number of evals, both with 20% hid,
and 460 words/doc hid (actually, the min of this and doclen/2), and
50-1600 (at *2) topics.  We do actually see a tiny bit of overfitting
at 800 or 1600 documents.

Then we do something we should have done a long time ago: go back and
skim through some LDA papers.  We look at the BNJ 2003 JMLR paper.  We
see that one of the selling points of LDA over LSI is that
it *doesn't* overfit!  Aaargh!  No wonder we haven't been able to get
substantial overfitting.

However, we also notice something else: aside from dropping 50 stop
words (we've been dropping 100), on one data set they don't prune rare
words at all, and on the other they prune df=1 words only.  We've been
pruning df<=5 or <=10 words (can't remember which).  Perhaps what's
going on is that the reason the graphs aren't helping is just because
there vocabulary (around 3k) isn't big enough for them to make a
difference!

We recreate the text, pruning only the df=1 words.  This leads to a
vocab of about 10k (which means inference will be ~3 times slower).
We run at 50, 100, 200 and 400 and we actualy see a tiny bit of
overfitting (maybe) on 400.  We accidentally only ran 100 iterations,
so it's a bit hard to tell, but at the very least there's
no *improvement* for going from 200 topics to 400 topics.  Strangely
running on the new text is actually about 5-20% *faster*, despite the
larger vocabulary!

** Re-running with graphs

At this point, we're ready to try running with graphs again.  Despite
the fact that it's slow, we settle on 200 topics (this took about 3.5
hours without graphs, so we will be waiting a while).  We also want to
run for more iterations, just to see what's going to happen: we do 200
again.

And again there's not much difference.  One of the graphs seems to be
just barely one std above everyone else, but that's nothing to write

============= DAY 6 =============

* Abstracts only?

At this point, things are not looking so spectacular.  Perhaps the
problem is that the documents themselves are so big that there's
really not much uncertainty.  This is reflected, to some degree, by
the lack of variance in the predictive perplexities.

So we rebuild the data on abstracts only.  This makes running
significantly faster (yay).  We run 5 copies each of 10, 20, 40, 80,
160 and 320 topics.  40 is a clear winner.  80 and above overfit

Now, we turn on graphs and get the following results (5 runs):

40-top-nog.default      -69239.4 (86.2629700392932)
40-top-nog.auth         -68920.0 (111.863756418243)
40-top-nog.cite         -68976.4 (174.920839238783)
40-top-nog.year         -69174.2 (133.769204228776)

If we compare default (not graphs) to auth (best), we see that we get
a 2-3 std separation.  This is looking promising, FINALLY!  Also, if
we plot the results, it looks like auth and cite really dominate.
Year is fairly useless.

It suggests that, maybe, we just need more data to see more of a
difference.

* Getting More Data

There are two ways we could get more data.  First, we could crawl
more.  Second, we could switch over to, say, PubMed or ACM.  This
would work since we only need abstracts, not full texts.  These sites

============= DAY 7 =============

Okay, ACM is a pain.  And it's really not that complete, so we switch
over to CiteSeer (don't know why we didn't think of this before!).  We
seed with docs from acl, cl, emnlp, icml, jmlr, nips and uai.  We
notice that CiteSeer is apparently using some crappy PDF extractor (it
misses ligatures!  ugh!) but it's not worth (yet!) actually
these seeds, we run 10 iterations of reference crawling, eventually
ending up with just over 44k documents.  We extract a subset
comprising 9277 abstracts, and six graphs: author, booktitle,
citation, url, year and time (where you connect to year-1 and year+1).
The 9k out of 44k are those that (a) have >=100 characters
"reasonable" in the abstract and (b) have connections to at least 5
other docs in *all* the graphs.  (We're no longer favoring citations.)
We then begin the runs again....



## 05 October 2015

### A small observation for prepubs on arxiv

The question of how "traditional conference publication" should react to arxiv prepublications is raised quite frequently. I'm not particularly shy about the fact that I'm not a fan, but that's not what this post is about. This post is about data.

In any discussion about the "arxiv question," proponents of the idea typically cite the idea that by posting papers early on arxiv, they are able to get feedback from the community about their work. (See for example here, which at least tries to be balanced even if the phrasing is totally biased, for instance in the poll at the end :P.)

At any rate, the question I was curious about is: is this actually borne out in practice?

I did the following experiment. Arxiv nicely lets us see revision histories. So we can see, for instance, whether a paper that was placed on arxiv before notifications for the corresponding conference have gone out, is updated more than a paper that was placed on arxiv after notifications.

For NIPS papers that were first posted to arxiv after camera ready, 75% were never updated and 19% were updated once (on average, they were updated 0.36 times +- 0.713 std).

For papers that were first posted to arxiv before notification, all were updated at least once. The real question is: how many times were they updated between posting to arxiv and acceptance to the conference. The answer is that 82% were never updated during that period. Of course, all were updated at some point later (after the camera ready deadline), and 55% were updated only once, and another 18% were updated twice.

[Note: I only count updated that come at least two week after the first posting to arxiv because before is more likely to be typo fixing, rather than real feedback from the community.]

The sample size is small enough that I can actually look at all of the ones that were updated between posting to arxiv and notification. One of these seems was first posted in mid-Feb, updated twice is late March, and then again in Nov (acceptance) and Dec (camera ready). Another is very similar. Two were most likely posted the previous year when it was submitted to AIStats (the dates match up) and then updated when submitted to NIPS. Those were the only four, and two of them seem like a legit possible case of update due to community feedback.

As far as the question of "rapid impact on the field" this is harder to answer. I took a random sample of ten papers from each of the groups (prepub versus non-prepub) and got citation counts from google scholar. The median citation count was 10 for both sets. The average was slightly higher for the prepub set (15 versus 11, but with giant standard deviations of 12 and 16). Considering the prepub set has been out at least 6 months longer (this is NIPS 2013 and 2014 so this is a sizeable percentage), this is a pretty small difference. And it's a difference that might be attributable to other factors like "famous people are perhaps more likely to prepub" [actually it's not clear the data play this out; in a totally unscientific study of "does Hal think this person is famous" and a sample of 20 for each, it's even split 10/10 in both sets].

Anyway, I'm posting this because I've heard this argument many times and I've always felt it's a bit dubious. I've never seen data to back it up. This data suggests it's not true. If someone really believes this argument, it would be nice to see it backed up with data!

[Notes: I took only papers that were marked on arxiv as having appeared in NIPS, and which were first posted to arxiv in 2013 or 2014; this is 175 papers. I then hand-checked them all to exclude things like workshops or just submissions, and labeled them as to whether they appeared in 2013 or 2014.  That left a sample of papers. The rest of the data was extracted automatically from the arxiv abstract. The total number that was posted before notification (the prepub cases) is 22 (or 27%) and the total number that were posted after notification is 59 (or 73%). So the sample is indeed small. Not much I can do about that.]

## 11 September 2015

### How long'll it take to say that?

tl;dr: Given a string of text in some language you might want to know how long it would take to speak it. Here are some perl/python "one-liners" to estimate that. Currently supports English (US), German, French, Italian, Spanish and Japanese. The estimates are all in seconds.

Brave are those who read the source code. I promise it's not intentionally obfuscated -- there's just a lot of unicoding going on, and then copious use of the right-handed saturn operator in perl.

Why would you want code for this rather than a dictionary? Dictionaries are limited to their vocabulary, which is also sensitive to things like tokenization. Having a completely parametric solution seemed much more generalizable.

Why would I possibly want this?

We've been doing a bunch of work recently on simultaneous machine interpretation (aka "real time machine translation"). None of us is a speech person, either in the "recognition" or "synthesis" sense. This unfortunately means that to date, all of our models treat each word as "equally long" when training and, perhaps more importantly, evaluating models. For instance, if we want to measure the décalage (aka time-lag, aka ear-voice-span) between when a Japanese word is "heard" and the corresponding English translation is "spoken", we've been assuming all words take precisely 1 second to speak. This is obviously ridiculous.

One quick and dirty alternative is to use a text-to-speech (aka speech synthesis) system to synthesize a sentence and use its length as an estimate of how long it would take a person to speak it. This is a totally plausible approach since decent open source synthesis software exists (we use such software to crease these one-liners), but it's slow and bloated and can't be easily distributed. This would be more accurate, but I was after a quick and filthy solution.

How do these scripts work?

There are two functions: one for estimating the amount of time it would take to speak a single word (sayWord in the code), and another for estimating the amount of time it would take to speak a sentence (sayit in the code). I'll first describe how sayWord works; sayit is pretty straightforward.

sayWord works by extracting a bunch of features from the word to be spoken (each of these is a particular regular expression) and evaluating a linear function of the counts of the matches of those regular expressions. The process by which coefficients were generated is explained later. The features are things like: number of characters, number of vowels, number of consonant groups, number of non-letters, number of vowel-consonant switches, number of digits of various lengths, and whether the word starts or ends with a vowel; and then, for each "reasonable" unicode character for European languages, the count of those characters. Not all of these features appears for each language because I used l1 regularization to prune down the feature set.

[Note: Japanese is different because it uses a different character set. The feature structure is basically the same, but replace "consonant" with "kana" and "vowel" with "kanji" and "reasonable character" with "each of the 100 most frequent Japanese characters" in Japanese Wikipedia.]

sayit attempts to pronounce a sentence by pronouncing each word individually (via sayWord) and then rescaling the resulting estimate because we ... don't ... pause ... between ... words. This rescaling is linear, and again estimated from data (explained below).

How are the coefficients estimated for words?

Basically I take a vocabulary of the 50k-100k most frequent words for each language, use a speech synthesis program to say them (for the European languages, I used MaryTTS; for Japanese I used Open JTalk).

I then extract all the features mentioned above, create a regression problem regressing on the number of seconds it takes to speak (after removing quiet time around the word) and throwing in some l1 regularization with vw to make sure that it didn't use too many features. I optimized quantile loss (aka absolute value loss) rather than squared loss.

One thing I did not do was weight the words by their frequency. I could have done this and the resulting regression weights change a bit but not too much. At the end it spends a lot of energy making sure it estimates the speaking time of "a" and "the" and "an" correctly. Because short words are typically high frequency (and vice versa) this meant that it tended to underestimate all other words. And because my relatively frequencies were from Wikipedia, they might not match yours. Keeping a uniform distribution felt like a better solution.

I also tried not regularizing and also using things like character ngram features to get a better fit. In the end I didn't include this either. You can about halve the error rate by doing these things, but I felt like having simpler, smaller models made more sense here. After all, the thing we're regressing on (speaking time from a TTS system) is sort of an artificial benchmark anyway, so getting a bit lower error is not obviously that meaningful.

Overall the mean absolute errors in prediction are:

• German: 51ms
• English: 55ms
• French: 59ms
• Italian: 41ms
• Japanese: 33ms
If you were to just predict the median on each, you would get mean absolute errors in the 550ms (Japanese) to 980ms (English) range, so this is quite an improvement.

How are the coefficients estimated for sentences?

For sentences, I fit a model of the form:
• time = const + a * [summed time for words] + b * [# of words]
The three parameters (const, a and b) were estimated by having the TTS systems speak entire sentences (about 40k from each language, taken randomly from Wikipedia) and then using the total time from the TTS (unioned across all languages) and the corresponding features. These are simply hardcoded and shared across languages. You could of course do a bit better by doing this on a language-by-language basis, but I didn't do this again for simplicity.

Conclusion

Chances are no one except me wants this. But if you do, please feel free to use it. I'd appreciate some sort of credit though :). And if you really want the dictionaries, you can find them here: de, en-US, fr, it, ja.

Thanks to Graham Neubig for providing the tokenized Japanese Wikipedia text and pointing me to Open JTalk!

## 08 September 2015

### Overfitting

Many students, in response to their assigned reading for today's undergrad ML class, asked me for a formal definition of overfitting. The fact that I don't really have one irked me, especially this this basically is the problem in machine learning. This blog post is an extended version of the answer I gave in class.

Intuitively, overfitting is when your learned model is doing much better on training data than it will do on test data from the same distribution. This happens because your model learns to fit random idiosyncrasies in the training data rather than model the true underlying trends.

I then pointed to a recent high-profile paper by Cynthia Dwork and colleagues on Holdout Reuse (appeared behind a paywall in Science and something similar seems to be in NIPS too). This paper presents some really cool ideas that use machinery from differential privacy to allow one to effectively reuse the same heldout ("development") data multiple times, without worrying about overfitting the holdout. But that's not what I wanted to highlight; I wanted to highlight the formal definition Dwork et al. give for overfitting. They (roughly) say that a model has overfit the training data when its test error is larger than its training error (by at least some small amount).

I feel like this definition captures some of the essence of overfitting, but it also misses what I would call my "everyday experience" with using machine learning in practical, real world settings. And my everyday experience is that you training error is always lower than your test error. To confirm to myself that I wasn't misremembering my experiences, I ran a few experiments on some simple classification tasks. Below is a plot of train/dev/test error on Reuters RCV1:

Here, the x-axis is regularization strength (this is all with libsvm, C=2^-xvalue) and the y-axis is error rate. There are error bars on dev/test but they're so small you can't really see them.

The thing I want to highlight in this figure is that the best performance, measured by dev or test performance, occurs when the regularization is a bit below 0. But the point at which the gap between dev performance and training performance starts to inflate is much earlier, maybe around -4. The title of the figure tells us that the test error achieved when the training error drops below dev error is 9.48%; but when dev bottomw out the test error is 6.08%. That's leaving 33% relative error on the table.

RCV1 isn't unique. Below are the same plots for MNIST on several binary splits:

For all three splits, some easier, some harder, the relative improvement is always around 20-30% by not using the definition provided by Dwork et al.

Having somewhat convinced myself that I'm not crazy, the immediate question is: what is a better definition of overfitting that captures my everyday experience? I completely agree that the definition quoted above is a necessary condition, but somehow it does not seem sufficient.

Even though this was my experience, I felt a bit flummoxed because I had a hard time pointing out exactly what I felt was missing in the definition above.

Here's what I ended up settling on. Overfitting typically happens, as stated above, because there are random idiosyncrasies in the training data that don't generalize. This is a finite sample effect. If I give you a dataset with 1000 training examples, and I add 100 completely random features, at least one of these features is going to correlate somewhat with the training labels. Overfitting occurs when the model chooses to use one of these completely random features, which helps drive the training error down, but hurts at test time.

But there's a "throwing the baby out with the bathwater" effect going on here. Although there may be completely random features that accidentally correlate slightly with the true label, there are also weak but nevertheless useful features that correlate a small amount with the true label. On a finite sample it's going to be virtually impossible to distinguish between these two.

By adopting the "training error < test error" rubric for overfitting, we're forcing ourselves to throw out both types of features. This is a trade-off, and empirically at least, it appears that this is a losing trade-off. It seems better to keep around a few of these weak-but-useful features in exchange for the risk of also keeping around a few truly-useless-and-misleading features. If this weren't the case, I would have a hard time explaining my "everyday experience."

So do I have a better suggestion?

Not really.

One really nice property of the Dwork et al. definition is that it's testable for a single model. That is, I hand you a learned function h and you can run h on some training data and some dev data and you can tell me if it's overfit. I don't have to draw curves like those above. I don't even have to know anything about how the model was built, and I don't have to sweep hyperparameters (which may or may not be sweepable).

I really like all of those properties. And I believe that if we came up with an alternative definition of overfitting that obeys these properties, we can reuse much of the awesome work in the Dwork et al. paper.

But I also feel like the definition above is overly conservative and it's risky to use.

## 09 June 2015

### Some NAACL 2013 statistics on author response, review quality, etc.

NAACL 2015 has just passed, NAACL 2013 is long in the past.

One bonus of being a program chair is that you get to have fun with data. In this post I'd like to review two pieces of data, one related to author response and one related to review quality assessment.

tldr: Generally, I think author response is useless, except insofar as it can be cathartic to authors and thereby provide some small psychological benefit. And in general people don't seem that dissatisfied with their papers' reviews, and this is largely independent of the outcome of the paper (conditioned on selection bias of those who responded).

On Author Response

NAACL and other conferences have for some time allowed authors to respond to their reviewers, ostensibly to correct errors, but more often just to argue their position. Most people I've talked to who are in favor of author response say that it made the difference for one paper between it being rejected (pre-response) and accepted (post-response). Of course this is unknowable because decisions aren't made pre-response, so what these authors are reporting is their guess as to accept/reject before response versus actual accept after response. My own internal guesses of accept/reject are often off the mark, so you can imagine I don't find this argument particularly strong. To give a sense as to how hard this prediction is, here is a plot showing whether your paper got accepted or not as a function of the mean overall score.

Here, the x-axis is the cumulative distribution of papers (there were more rejects than accepts, so the length of these is normalized to percentage) and the y-axis is the mean overall score for the paper.

Focusing just on the solid lines (long papers), you can see that there were one or two papers with average scores of 3.6 that got rejected, and several papers with average scores of 3.0 that got accepted. If instead you look at "probability of accept" given mean overall score, basically what you see is papers with a score of 2.8 or lower are almost certainly rejected; with 3.9 or higher are almost certainly accepted, and around 3.2 it's an absolute toss-up.

Now, you might ask: who chose to respond? Perhaps people with clear outcomes elected not to respond at all. Here are the numbers for that:

This is only for long papers (it looks the same for short). Each dot is a paper and red are rejects, blue are accepts; open circles means no response. Definitely people with scores less than 3 responded less frequently than those with scores above 3, but even the top scoring papers all included some response. There were some "hopeful" papers with scores less than 2.0 that responded, even though it was certainly not useful.

But wait, you might say: perhaps those papers with average score or 1.33 gave a response and their final scores went up to 4.5? Ok first of all, these are final scores plotted above, not pre-response scores. But we certainly can look at how scores changed between the initial reviews and the final reviews. (Note this conflates two things: author response and reviewer discussion. We'll get back to that later.)

Here, we again have one dot per paper (these are randomly perturbed with small variance so they're all visible). Along the x-axis is the original score of this paper; along the y-axis is how much this score changed between the initial review and the final review. Overall the average absolute change across all papers is 0.1. The vast majority (87%) didn't change at all.

And note: scores were just about as likely to go down as up. (Several scores went down with no response, which I find troublesome, except in cases where reviewers specifically said "I'm giving benefit of the doubt, but need a clarification here.")

You could also say: well, maybe the scores didn't change much, but the review text did. Again, it's pretty rare. Out of 430 reviews (long papers only), 3 reviews decreased in length (~50 words), 46 increased by at most 100 words, and 49 increased by over 100 words (wow, awesome!). But for 80% the review didn't change at all.

Now, getting back to the question of "was this due to AR or to reviewer discussion", this is of course impossible to de-conflate, but we can look at a similar plot as a function of discussion rather than AR:

Here, a few results stand out. First, there are slightly more score changes as a function of discussion than AR. Most of those worrying dots below y=0 that didn't have a response seem to have gone down due to discussions. Moreover, a few scores that went up (on the low-scoring, almost-certainly-rejected papers) did so with no discussion, suggesting that AR did change those scores. But around the x=3.2 position (the true borderlines), almost all of these with score changes had discussion, though plenty had discussion and no score change. As opposed to the x=3.2 position in AR in which they all have responses, and still the majority of scores don't change.

An obvious response at this point is that we still don't know whether the final decision really changed as a function of response. This is a causality question we cannot perform the counterfactual for, so we're not going to get a solid answer. However, we can look at the following:

On the x-axis we have the final score of the paper, and on the y-axis we have (a smoothed version of) the probability that this paper is accepted.

The two nearly identical curves (red and blue) correspond to blue=all long papers and red=all long papers for which the authors submitted any response. In this setting, there really is almost no difference.

The slightly different curve (the black one) corresponds only to papers in which the reviewers engaged in a discussion. Around the critical point (score in the 3-3.5 range), discussion uniformly decreased odds of acceptance, which is of course not surprising for anyone who has ever participated in discussions.

You should take all this with a grain of salt, because all three of these curves are within one standard deviation of the blue curve (dotted lines).

Finally, there's of course the chance that response and discussion are highly correlated: that is, response should perhaps spur discussion. This is probably false (by the above plot), but just to drive the point home, here is the data:

The x-axis is the amount of discussion and the y-axis is the length of the response. See all those dots along x=0? Those are the ones for which the authors responded (in one case with a 2000 word essay!) and for which there was absolutely no discussion. And most of them are rejected papers.

On Review Usefulness

Some of you may recall that we did a post-conference survey of how much you liked the reviews you got. We truly did this at a later point in time because we didn't get our acts together earlier, but now I'll claim we did it at a later point in time so that authors were no longer emotionally distraught by their reviews.

We presented each corresponding author with their old reviews, but hid the scores from them (I doubt any/many went back to look). They were then asked, for each review in turn, how informative it was and how helpful it was. They could choose 0=not, 1=sort of, or 2=very.

The first hypothesis one might have (certainly I had it!) is that people "liked" reviews that were nice to them and "disliked" reviews that were not. (I've heard this called the craigslist effect: if the transaction is successful, everyone's happy; else, everyone's unhappy.) This actually appears not really to be true. Here are the results:
rejects      1.24  +-0.57     1.25 +-0.72
accepts      1.38  +-0.48     1.27 +-0.70
(+- is one standard deviation)

Basically authors of accepted papers didn't particularly find their reviews much more informative (+0.14) or helpful (+0.02) than those of rejected papers.

[Note: the sample size is 110 rejects and 86 accepts, so there was a higher response rate on accepts since the acceptance rate is about 25%.]

You might suspect, however, that how much an author likes a review is correlated with that reviewer's score rather than the overall decision. We can look at that too (recall authors didn't see the scores when completing this survey):
0            3.04   +- 0.89        3.23  +- 0.87
1            3.22   +- 0.91        3.15  +- 0.93
2            3.21   +- 0.97        3.22  +- 0.96
Here, the first column (rating) is how the author scored the review. The second column is the average overall score that a reviewer gave an author who gave them a score of zero. For example, for reviews that were rated informative=2, the average score for those reviews was 3.21. Again, basically there's no effect.

Just to feel good about ourselves, authors do, in general, seem not-too-unhappy with their reviews.

Here's a histogram of informativeness for rejected papers:
0: ##############                                 (15%)
1: ############################################   (44%)
2: ########################################       (41%)
Here's the same histogram for accepted papers:
0: ########                                       ( 8%)
1: ############################################## (46%)
2: ############################################## (46%)
The only real difference is that rejected papers did have twice as large a frequency of "informative=0", but the rest was pretty close.

(And in case you're wondering, it's basically the same for "helpfulness." As pointed out by a friend, the inability to separate different aspects of the same object is typical when people are asked for intuitive judgements that don't involve reasoning. 70% of responses gave the same score for helpfulness and informativeness and another 29% gave scores within 1. Only 0.8% said inf=2 help=0 and 0.6% said inf=0 and help=2.)

Overall

I think of complaining about review(s|ers) as a staple hallway discussion, sort of an attractor state if you run out of other things to talk about. And we all remember the worst reviews we have had and often forget about the good ones that really do help our work get better. As Lillian Lee said at NAACL this year, it's usually our own fault when reviewers don't understand our papers. And keep in mind that all these numbers have huge randomness associated with them, for instance related to the NIPS experiment.

Despite this, it actually seems that we aren't too dissatisfied with our reviews. Only 11% of reviews were considered uninformative and, while I think we should strive to get this number down, I don't think it's that horrible. Especially when I consider the fact that 19% of papers have an average score of 2.0 or less, which basically means they were submitted not standing a chance (for a wide variety of reasons).

In general I don't think author response has an effect on the outcome that makes it worth the time and energy it takes. It might make authors feel better, but this is temporary when their paper almost certainly enjoys the same fate after response as before. I do think discussion is useful, and I'd rather see us use more time for discussion but cutting out response. Perhaps by reducing the number of papers any given area chair has to deal with. We should realize, though, that discussion almost always serves to lessen the likelihood of acceptance, probably because it's easier to argue against a paper than for, and reviewers don't really have any incentive to stand up for a paper they like, leading to a veto-effect.