30 June 2009

ICML/COLT/UAI 2009 retrospective

This will probably be a bit briefer than my corresponding NAACL post because even by day two of ICML, I was a bit burnt out; I was also constantly swapping in other tasks (grants, etc.). Note that John has already posted his list of papers.

  1. #317: Multi-View Clustering via Canonical Correlation Analysis (Chaudhuri, Kakade, Livescu, Sridharan). This paper shows a new application of CCA to clustering across multiple views. They use some wikipedia data in experiments and actually prove something about the fact that (under certain multi-view-like assumptions), CCA does the "right thing."
  2. #295: Learning Nonlinear Dynamic Models (Langford, Salakhutdinov,, Zhang). The cool idea here is to cut a deterministic classifier in half and use its internal state as a sort of sufficient statistic. Think about what happens if you represent your classifier as a circuit (DAG); then anywhere you cut along the circuit gives you a sufficient representation to predict. To avoid making circuits, they use neural nets, which have an obvious "place to cut" -- namely, the internal nodes.
  3. #364: Online Dictionary Learning for Sparse Coding (Mairal, Bach, Ponce, Sapiro). A new approach to sparse coding; the big take-away is that it's online and fast.
  4. 394: MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification (Zhu, Ahmed, Xing). This is a very cute idea for combining objectives across topic models (namely, the variational objective) and classification (the SVM objective) to learn topics that are good for performing a classification task.
  5. #393: Learning from Measurements in Exponential Families (Liang, Jordan, Klein). Suppose instead of seeing (x,y) pairs, you just see some statistics on (x,y) pairs -- well, you can still learn. (In a sense, this formalizes some work out of the UMass group; see also the Bellare, Druck and McCallum paper at UAI this year.)
  6. #119: Curriculum Learning (Bengio, Louradour, Collobert, Weston). The idea is to present examples in a well thought-out order rather than randomly. It's a cool idea; I've tried it in the context of unsupervised parsing (the unsearn paper at ICML) and it never helped and often hurt (sadly). I curriculum-ified by sentence length, though, which is maybe not a good model, especially when working with WSJ10 -- maybe using vocabulary would help.
  7. #319: A Stochastic Memoizer for Sequence Data (Wood, Archambeau, Gasthaus, James, Whye Teh). If you do anything with Markov models, you should read this paper. The take away is: how can I learn a Markov model with (potentially) infinite memory in a linear amount of time and space, and with good "backoff" properties. Plus, there's some cool new technology in there.
  8. A Uniqueness Theorem for Clustering Reza Bosagh Zadeh, Shai Ben-David. I already talked about this issue a bit, but the idea here is that if you fix k, then the clustering axioms become satisfiable, and are satisfied by two well known algorithms. Fixing k is a bit unsatisfactory, but I think this is a good step in the right direction.
  9. Convex Coding David Bradley, J. Andrew Bagnell. The idea is to make coding convex by making it infinite! And then do something like boosting.
  10. On Smoothing and Inference for Topic Models Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh. If you do topic models, read this paper: basically, none of the different inference algorithms do any better than the others (perplexity-wise) if you estimate hyperparameters well. Come are, of course, faster though.
  11. Correlated Non-Parametric Latent Feature Models Finale Doshi-Velez, Zoubin Ghahramani. This is an indian-buffet-process-like model that allows factors to be correlated. It's somewhat in line with our own paper from NIPS last year. There's still something a bit unsatisfactory in both our approach and their approach that we can't do this "directly."
  12. Domain Adaptation: Learning Bounds and Algorithms. Yishay Mansour, Mehryar Mohri and Afshin Rostamizadeh. Very good work on some learning theory for domain adaptation based on the idea of stability.
Okay, that's it. Well, not really: there's lots more good stuff, but those were the things that caught my eye. Feel free to tout your own favorites in the comments.

25 June 2009

Should there be a shared task for semi-supervised learning in NLP?

(Guest post by Kevin Duh. Thanks, Kevin!)

At the NAACL SSL-NLP Workshop recently, we discussed whether there ought to be a "shared task" for semi-supervised learning in NLP. The panel discussion consisted of Hal Daume, David McClosky, and Andrew Goldberg as panelists and audience input from Jason Eisner, Tom Mitchell, and many others. Here we will briefly summarize the points raised and hopefully solicit some feedback from blog readers.

Three motivations for a shared task

A1. Fair comparison of methods: A common dataset will allow us to compare different methods in an insightful way. Currently, different research papers use different datasets or data-splits, making it difficult to draw general intuitions from the combined body of research.

A2. Establish a methodology for evaluating SSLNLP results: How exactly should a semi-supervised learning method be evaluated? Should we would evaluate the same method for both low-resource scenarios (few labeled points, many unlabeled points) and high-resource scenarios (many labeled points, even more unlabeled points)? Should we evaluate the same method under different ratios of labeled/unlabeled data? Currently there is no standard methodology for evaluating SSLNLP results, which means that the completeness/quality of experimental sections in research papers varies considerably.

A3. Encourage more research in the area: A shared task can potentially lower the barrier of entry to SSLNLP, especially if it involves pre-processed data and community support network. This will make it easier for researchers in outside fields, or researchers with smaller budgets to contribute their expertise to the field. Furthermore, a shared task can potentially direct the community research efforts in a particular subfield. For example, "online/lifelong learning for SSL" and "SSL as joint inference of multiple tasks and heterogeneous labels" (a la Jason Eisner's keynote) were identified as new, promising areas to focus on in the panel discussion. A shared task along those lines may help us rally the community behind these efforts.

Arguments against the above points

B1. Fair comparison: Nobody really argues against fair comparison of methods. The bigger question, however, is whether there exist a *common* dataset or task where everyone is interested in. At the SSLNLP Workshop, for example, we had papers in a wide range of areas ranging from information extraction to parsing to text classification to speech. We also had papers where the need for unlabeled data is intimately tied in to particular components of a larger system. So, a common dataset is good, but what dataset can we all agree upon?

B2. Evaluation methodology: A consistent standard for evaluating SSLNLP results is nice to have, but this can be done independently from a shared task through, e.g. an influential paper or gradual recognition of its importance by reviewers. Further, one may argue: what makes you think that your particular evaluation methodology is the best? What makes you think people will adopt it generally, both inside and outside of the shared task?

B3. Encourage more research: It is nice to lower the barriers to entry, especially if we have pre-processed data and scripts. However, it has been observed in other shared tasks that often it is the pre-processing and features that matter most (more than the actual training algorithm). This presents a dilemma: If the shared task pre-processes the data to make it easy for anyone to join, will we lose the insights that may be gained via domain knowledge? On the other hand, if we present the data in raw form, will this actually encourage outside researchers to join the field?

Rejoinder

A straw poll at the panel discussion showed that people are generally in favor of looking into the idea of a shared task. The important question is how to make it work, and especially how to address counterpoints B1 (what task to choose) and B3 (how to prepare the data). We did not have enough time during the panel discussion to go through the details, but here are some suggestions:

  • We can view NLP problems as several big "classes" of problems: sequence prediction, tree prediction, multi-class classification, etc. In choosing a task, we can pick a representative task in each class, such as name-entity recognition for sequence prediction, dependency parsing for tree prediction, etc. This common dataset won't attract everyone in NLP, but at least it will be relevant for a large subset of researchers.

  • If participants are allowed to pre-process their own data, the evaluation might require participant to submit a supervised system along with their semi-supervised system, using the same feature set and setup, if possible. This may make it easier to learn from results if there are differences in pre-processing.

  • There should also be a standard supervised and semi-supervised baseline (software) provided by the shared task organizer. This may lower the barrier of entry for new participants, as well as establish a common baseline result.
Any suggestions? Thoughts?

20 June 2009

Why I Don't Buy Clustering Axioms

In NIPS 15, Jon Kleinberg presented some impossibility results for clustering. The idea is to specify three axioms that all clustering functions should obey and examine those axioms.

Let (X,d) be a metric space (so X is a discrete set of points and d is a metric over the points of X). A clustering function F takes d as input and produces a clustering of the data. The three axioms Jon states that all clustering functions should satisfy are:

  1. Scale invariance: For all d, for all a>0, F(ad) = F(d). In other words, if I transform all my distances by scaling them uniformly, then the output of my clustering function should stay the same.
  2. Richness: The range of F is the set of all partitions. In other words, there isn't any bias that prohibits us from producing some particular partition.
  3. Consistency: Suppose F(d) gives some clustering, C. Now, modify d by shrinking distances within clusters of C and expanding distances between clusters in C. Call the new metric d'. Then F(d') = C.
Kleinberg's result is that there is no function F that simultaneously satisfies all these requirements. Functions can satisfy two, but never all three. There have been a bunch of follow on papers, including one at NIPS last year and one that I just saw at UAI.

If you think about these axioms a little bit, they kind of make sense. My problem is that if you think about them a lot of bit, you (or at least I) realize that they're broken. The biggest broken one is consistency, which becomes even more broken when combined with scale invariance.

What I'm going to do to convince you that consistency is broken is start with some data in which there is (what I consider) a natural clustering into two clusters. I'll then apply consistency a few times to get something that (I think) should yield a different clustering.

Let's start with some data. The colors are my opinion as to how the data should be clustered:

I hope you agree with my color coding. Now, let's apply consistency. In particular, let's move some of the red points, only reducing inter-clustering distances. Formally, we find the closest pair of points and move things toward those.

The arrows denote the directions these points will be moved. To make the situation more visually appealing, let's move things into lines:
Okay, this is already looking funky. Let's make it even worse. Let's apply consistency again and start moving some blue points:
Again, let's organize these into a line:
And if I had given you this data to start with, my guess is the clustering you'd have come up with is more like:
This is a violation of consistency.

So, what I'd like someone to do is to argue to my why consistency is actually a desirable property.

I can come up with lots of other examples. One reason why this invariance is bad is because it renders the notion of "reference sizes" irrelevant. This is of course a problem if you have prior knowledge (eg., one thing measured in millimeters, the other in kilometers). But even in the case where you don't know knowledge, what you can do is take the following. Take data generated by thousands of well separated Gaussians, so that the clearly right thing to do is have one cluster per Gaussian. Now, for each of these clusters except for one, shrink them down to single points. This is possible by consistency. Now, your data basically looks like thousands-1 of clusters with zero inter-cluster distances and then one cluster that's spread out. But now it seems that the reasonable thing is to put each data point that was in this old cluster into its own cluster, essentially because I feel like the other data shows you what clusters should look like. If you're not happy with this, apply scaling and push these points out super far from each other. (I don't think this example is as compelling as the one I drew in pictures, but I still think it's reasonable.

Now, in the UAI paper this year, they show that if you fix the number of clusters, these axioms are now consistent. (Perhaps this has to do with the fact that all of my "weird" examples change the number of clusters -- though frankly I don't think this is necessary... I probably could have arranged it so that the resulting green and blue clusters look like a single line that maybe should just be one cluster by itself.) But I still feel like consistency isn't even something we want.

(Thanks to the algorithms group at Utah for discussions related to some of these issues.)

UPDATE 20 JUNE 2009, 3:49PM EST

Here's some data to justify the "bad things happen even when the number of clusters stays the same" claim.

Start with this data:


Now, move some points toward the middle (note they have to spread to the side a bit so as not to decrease intra-cluster distances).

Yielding data like the following:

Now, I feel like two horizontal clusters are most natural here. But you may disagree. What if I add some more data (ie., this is data that would have been in the original data set too, where it clearly would have been a third cluster):

And if you still disagree, well then I guess that's fine. But what if there were hundreds of other clusters like that.

I guess the thing that bugs me is that I seem to like clusters that have similar structures. Even if some of these bars were rotated arbitrarily (or, preferably, in an axis-aligned manner), I would still feel like there's some information getting shared across the clusters.

14 June 2009

NL Generation: A new problem

Those who talk to me a lot over the years know that I really think that generation is a cool and interesting problem, but one that is hampered by a lack of clarity of what it is, or at least what the input/output is. It's like the problem with defining summarization, but one hundred times worse.

I have no idea how it came up. I think I was talking to a bunch of people from Microsoft Research and we were talking about translation on the web and what not. And how this is not the business model of Language Weaver. And how when you're on the web you make money by advertising.

And voila! The most incredible NL Generation task occurred to me! (I apologize if you ran in to me at all during NAACL because this was pretty much all I could talk about :P.) The initial idea for the task was embedded in MT, though it needn't be. But I'll tell it in the MT setting.

So I go to some web page in some weirdo language (say, French) that I don't understand (because I was a moron and took Latin instead of French or Spanish in middle school and high school). So I ask my favorite translation system (Google or Microsoft or Babelfish or whatever) to translate it. However, the translation system takes certain liberties with the translation. In particular, it might embed a few "product placements" in the text. For instance, maybe it's translating "Je suis vraiment soif" into English (if this is incorrect, blame Google). And perhaps it decides that instead of translating this as "I'm really thirsty," it will translate it as "I'm really thirsty for a Snapple," or "I'm really thirsty: I could go for a Snapple," perhaps with a link to snapple.com.

Product placement is all over the place, even in America where it's made fun of and kept a bit at bay. Not so in China: any American remotely turned off by the Coca-cola cups from which the judges on American Idol drink twice a week would be appalled by the ridiculous (my sentiment) product placement that goes on over there. The presence of the link would probably give away that it's an ad, though of course you could leave this off.

But why limit this to translation. You could put such a piece of technology directly on blogger, or on some proxy server that you can go through to earn some extra cash browsing the web (thanks to someone -- can't remember who -- at NAACL for this latter suggestion). I mean, you could just have random ads pop up in the middle of text on any web page, for instance one you could host on webserve.ca!

(See what I did there?)

So now here's a real generation problem! You're given some text. And maybe you're even given adwords or something like that, so you can assume that the "select which thing to advertise" problem has been solved. (Yes, I know it's not.) Your job is just to insert the ad in the most natural way in the text. You could evaluate in at least two ways: click through (as is standard in a lot of this advertising business) and human judgments of naturalness. I think the point of product placement is to (a) get your product on the screen more or less constantly, rather than just during commercial breaks which most people skip anyway, and (b) perhaps appeal to people's subconscious. I don't know. My parents (used to) do advertising like things but I know next to nothing about that world.

Okay, so this is slightly tongue in cheek, but not entirely. And frankly, I wouldn't be surprised if something like it were the norm in five years. (If you want to get more fancy, insert product placements into youtube videos!)

12 June 2009

NAACL-HLT 2009 Retrospective

I hope this post will be a small impetus to get other people to post comments about papers they saw at NAACL (and associated workshops) that they really liked.

As usual, I stayed for the whole conference, plus workshops. As usual, I also hit that day -- about halfway through the first workshop day -- where I was totally burnt out and was wondering why I always stick around for the entire week. That's not to say anything bad about the workshops specifically (there definitely were good ones going on, in fact, see some comments below), but I was just wiped.

Anyway, I saw a bunch of papers and missed even more. I don't think I saw any papers that I actively didn't like (or wondered how they got in), including short papers, which I think is fantastic. Many thanks to all the organizers (Mari Ostendorf for organizing everything, Mike Collins, Lucy Vanderwende, Doug Oard and Shri Narayanan for putting together a great program, James Martin and Martha Palmer for local arrangements -- which were fantastic -- and all the other organizers who sadly we -- i.e., the NAACL board -- didn't get a chance to thank publicly).

Here are some things I thought were interesting:

  1. Classifier Combination Techniques Applied to Coreference Resolution (Vemulapalli, Luo, Pitrelli and Zitouni). This was a student research workshop paper; in fact, it was the one that I was moderating (together with Claire Cardie). The student author, Smita, performed this work while at IBM; though her main research is on similar techniques applied to really cool sounding problems in recognizing stuff that happens in the classroom. Somehow classifier combination, and general system combination, issues came up a lot at this conference (mostly in the hallways where someone was begrudgingly admitting to working on something as dirty as system combination). I used to think system combination was yucky, but I don't really feel that way anymore. Yes, it would be nice to have one huge monolithic system that does everything, but that's often infeasible. My main complaint with system combination stuff is that in many cases I don't really understand why it's helping, which means that unless it's applied to a problem I really care about (of which there are few), it's hard for me to take anything away. But I think it's interesting. Getting back to Smita's paper, the key thing she did to make this work is introduce the notion of alignments between different clusterings, which seemed like a good idea. The results probably weren't as good as they were hoping for, but still interesting. My only major pointers as a panelist were to try using different systems, rather than bootstrapped versions of the same system, and to take a look at the literature on consensus clustering, which is fairly relevant for this problem.

  2. Graph-based Learning for Statistical Machine Translation (Alexandrescu and Kirchhoff). I'd heard of some of this work before in small group meetings with Andrei and Kathrin, but this is the first time I'd seen the results they presented. This is an MT paper, but really it's about how to do graph-based semi-supervised learning in a structured prediction context, when you have some wacky metric (read: BLEU) on which you're evaluating. Computation is a problem, but we should just hire some silly algorithms people to figure this out for us. (Besides, there was a paper last year at ICML -- I'm too lazy to dig it up -- that showed how to do graph-based stuff on billions of examples.)

  3. Intersecting Multilingual Data for Faster and Better Statistical Translations (Chen, Kay and Eisele). This is a very simple idea that works shockingly well. Had I written this paper, "Frustrating" would probably have made it into the title. Let's say we want an English to French phrase table. Well, we do phrase table extraction and we get something giant and ridiculous (have you ever looked at those phrase pairs) that takes tons of disk space and memory, and makes translation slow (it's like the "grammar constant" in parsing that means that O(n^3) for n=40 is impractical). Well, just make two more phrase tables, English to German and German to French and intersect. And viola, you have tiny phrase tables and even slightly better performance. The only big caveat seems to be that they estimate all these things on Europarl. What if your data sets are disjoint: I'd be worried that you'd end up with nothing in the resulting phrase table except the/le and sometimes/quelquefois (okay, I just used that example because I love that word).

  4. Quadratic Features and Deep Architectures for Chunking (Turian, Bergstra and Bengio). I definitely have not drunk the deep architectures kool-aid, but I still think this sort of stuff is interesting. The basic idea here stems from some work Bergstra did for modeling vision, where they replaced a linear classifier(y = w'x) with a low rank approximation to a quadratic classifier (y = w'x + sqrt[(a'x)^2 + (b'x)^2 + ... ]). Here, the a,b,... vectors are all estimated as part of the learning process (eg., by stochastic gradient descent). If you use a dozen of them, you get some quadratic style features, but without the expense of doing, say, an implicit (or worse, explicit) quadratic kernel. My worry (that I asked about during the talk) is that you obviously can't initialize these things to zero or else you're in a local minimum, so you have to do some randomization and maybe that makes training these things a nightmare. Joseph reassured me that they have initialization methods that make my worries go away. If I have enough time, maybe I'll give it a whirl.

  5. Exploring Content Models for Multi-Document Summarization (Haghighi and Vanderwende). This combines my two favorite things: summarization and topic models. My admittedly biased view was they started with something similar to BayeSum and then ran a marathon. There are a bunch of really cool ideas in here for content-based summarization.

  6. Global Models of Document Structure using Latent Permutations (Chen, Branavan, Barzilay and Karger). This is a really cool idea (previously mentioned in a comment on this blog) based on using generalized Mallow's models for permutation modeling (incidentally, see a just-appeared JMLR paper for some more stuff related to permutations!). The idea is that documents on a similar topic (eg., "cities") tend to structure their information in similar ways, which is modeled as a permutation over "things that could be discussed." It's really cool looking, and I wonder if something like this could be used in conjunction with the paper I talk about below on summarization for scientific papers (9, below). One concern raised during the questions that I also had was how well this would work for things not as standardized as cities, where maybe you want to express preferences of pairwise ordering, not overall permutations. (Actually, you can do this, at least theoretically: a recent math visitor here, Mark Huber, has some papers on exact sampling from permutations under such partial order constraints using coupling from the past.) The other thing that I was thinking during that talk that I thought would be totally awesome would be to do a hierarchical Mallow's model. Someone else asked this question, and Harr said they're thinking about this. Oh, well... I guess I'm not the only one :(.

  7. Dan Jurafsky's invited talk was awesome. It appealed to me in three ways: as someone who loves language, as a foodie, and as an NLPer. You just had to be there. I can't do it justice in a post.

  8. More than Words: Syntactic Packaging and Implicit Sentiment (Greene and Resnik). This might have been one of my favorite papers of the conference. The idea is that how you say things can express your point of view as much as what you say. They look specifically at effects like passivization in English, where you might say something like "The truck drove into the crowd" rather than "The soldier drove the truck into the crowd." The missing piece here seems to be identifying the "whodunnit" in the first sentence. This is like figuring out subjects in languages that like the drop subjects (like Japanese). Could probably be done; maybe it has been (I know it's been worked on in Japanese; I don't know about English).

  9. Using Citations to Generate Surveys of Scientific Paradigms (Mohammad, Dorr, Egan, Hassan, Muthukrishan, Qazvinian, Radev and Zajic). I really really want these guys to succeed. They basically study how humans and machines create summaries of scientific papers when given either the text of the paper, or citation snippets to the paper. The idea is to automatically generate survey papers. This is actually an area I've toyed with getting in to for a while. The summarization aspect appeals to me, and I actually know and understand the customer very well. The key issue I would like to see addressed is how these summaries vary across different users. I've basically come to the conclusion that in summarization, if you don't pay attention to the user, you're sunk. This is especially true here. If I ask for a summary of generalization bound stuff, it's going to look very different than if Peter Bartlett asks for it.

  10. Online EM for Unsupervised Models (Liang and Klein). If you want to do online EM read this paper. On the other hand, you're going to have to worry about things like learning rate and batch size (think Pegasos). I was thinking about stuff like this a year or two ago and was wondering how this would compare to doing SGD on the log likelihood directly and not doing EM at all. Percy says that asymptotically they're the same, but who knows what they're like in the real world :). I think it's interesting, but I'm probably not going to stop doing vanilla EM.
I then spent some time at workshops.

I spent the first morning in the Computational Approaches to Linguistic Creativity workshop, which was just a lot of fun. I really liked all of the morning talks: if you love language and want to see stuff somewhat off the beaten path, you should definitely read these. I went by the Semantic Evaluation Workshop for a while and learned that the most frequent sense baseline is hard to beat. Moreover, there might be something to this discourse thing after all: Marine tells us that translators don't like to use multiple translations when one will do (akin to the one sense per discourse observation). The biggest question in my head here is how much the direction of translation matters (eg., when this heuristic is violated, is it violated by the translator, or the original author)? Apparently this is under investigation. But it's cool because it says that even MT people shouldn't just look at one sentence at a time!

Andrew McCallum gave a great, million-mile-an-hour invited talk on joint inference in CoNLL. I'm pretty interested in this whole joint inference business, which also played a big role in Jason Eisner's invited talk (that I sadly missed) at the semi-supervised learning workshop. To me, the big question is: what happens if you don't actually care about some of the tasks. In a probabilistic model, I suppose you'd marginalize them out... but how should you train? In a sense, since you don't care about them, it doesn't make sense to have a real loss associated with them. But if you don't put a loss, what are you doing? Again,in probabilistic land you're saved because you're just modeling a distribution, but this doesn't answer the whole question.

Al Aho gave a fantastically entertaining talk in the machine translation workshop about unnatural language processing. How the heck they managed to get Al Aho to give an invited talk is beyond me, but I suspect we owe Dekai some thanks for this. He pointed to some interesting work that I wasn't familiar with, both in raw parsing (eg., how to parse errorfull strings with a CFG when you want to find the closest in edit distance string that is parseable by a CFG) and natural language/programming language interfaces. (In retrospect, the first result is perhaps obvious had I actually thought much about it, though probably not so back in 1972: you can represent edit distance by a lattice and then parse the lattice, which we know is efficient.)

Anyway, there were other things that were interesting, but those are the ones that stuck in my head somehow (note, of course, that this list is unfairly biased toward my friends... what can I say? :P).

So, off to ICML on Sunday. I hope to see many of you there!

09 June 2009

Navigating ICML

I couldn't find a schedule for ICML that had paper titles/authors written in, so I joined the existing schedule with the abstracts to create a printable schedule with titles. You can find organizer-created schedules for UAI and COLT already. In addition, I've put ICML 2009 up on WhatToSee, so have fun! (I haven't done UAI and/or COLT because their papers haven't appeared online yet.)

08 June 2009

The importance of input representations

As some of you know, I run a (machine learning) reading group every semester. This summer we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them. The week before I went out of town, we read two papers about inferring features from raw data; one was a deep learning approach; the other was more Bayesian. (As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent. In general, I tend to think that modeling things as explicitly dependent is a good idea.)

Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated). They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know. But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation. Why not feed .jpg and .png files directly into your system?

On the language side, an analogy is the bag of words representation. Yes, it's simple. But only simple if you know the language. If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do? What about Chinese? There, it's well known that word segmentation is hard. There's already a huge amount of information in a bag of words format.

The question is: does it matter?

Here's an experiment I did. I took the twenty newsgroups data (standard train/test split) and made classification data. To make the classification data, I took a posting, fed it through a module "X". "X" produced a sequence of tokens. I then extract n-gram features over these tokens and throw out anything that appears less than ten times. I then train a multiclass SVM on these (using libsvm). The only thing that varies in this setup is what "X" does. Here are four "X"s that I tried:

  1. Extract words. When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.
  2. Extract characters. This leads to character unigrams, character bigrams, etc.
  3. Extract bits from characters. That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.
  4. Extract bits from a gzipped version of the posting. This is the same as (3), but before extracting the data, we gzip the file.
The average word length for me is 3.55 characters, so a character ngram with length 4.5 is approximately equivalent to a bag of words model. I've plotted the results below for everything except words (words were boring: BOW got 79% accuracy, going to higher ngram length hurt by 2-3%). The x-axis is number of bits, so the unigram character model starts out at eight bits. The y-axis is accuracy:
As we can see, characters do well, even at the same bit sizes. Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier. Zipped bits do markedly worse than raw bits. The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files. But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.

So yes, it does seem that the input representation matters. This isn't shocking, but I've never seen anyone actually try something like this before.