The weather was beautiful in San Diego last week during the North American chapter of the Association for Computational Linguistics conference, better known as NAACL. Lots of interesting stuff on the inside of the meeting hotel as well. The conference and affiliated workshops took all week, so there is too much material for me to describe in a reasonable-length blog posting. Check the link above for the schedule and proceedings, then enjoy the kinds of papers you like. Instead of describing all (or even many) of the talks I attended, I’ll try to describe the context and a few high points. The workshops on Thursday and Friday were really great too, so I’ll address them in another post.
Deep Learning was the major semi-official theme of the conference. Is it fashionable (yes), is it over-hyped (probably), is it all a bunch of BS (definitely not), etc. During lunch on Monday I was chatting with a grad student who asked something to the effect of whether I thought neural nets were going to remain ascendent this time, as opposed to their earlier rides on the hype cycle. “Probably not.” was my answer, which elicited an astonished look. I went on to explain that this is partially due to known issues, such as the difficulty of getting explanations of their decisions (although there has been work on reducing that problem). It is partially due to the way that ensemble models tend to work better. A neural net may make the decisions in an ensemble, but there will be other methods in there as well. On reflection, as I write this post, it just seems unlikely that we have finally come across the great secret of AI and now its just a matter of hyperparameter optimization. I just don’t think things are going to be that easy. I can hope that neural network methods don’t hit the depths of despair that they have in the past, but I have little doubt that something new will come along in five years and be the newly fashionable thing.
NAACL is a heavily academic conference, despite considerable participation by research groups from Google, Microsoft, Facebook, etc. This was highlighted in the opening keynote when the speaker, Regina Barzilay of MIT, mentioned her surprise in finding out that most current medical NLP work is rules-based. Certainly learning-based methods have dominiated the academic literature in the last 10 or more years, but rules still dominate commercial practice. Rule-based methods have characteristics that fit commercial constraints well. You can start simply. You don’t need a large existing training and test set. The rules tend to be high precision, which is a good thing since errors in precision are more blatant to people than errors in recall. Importantly, you can understand why a set of rules gave the decision that it did, as opposed to having an uninterperatable matrix of numbers. That reason in particular is important to medical practitioners. All of this has been known in the commercial community for years. In 2013 it was nicely described to the academic community in the paper “Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!” by Laura Chiticariu and colleagues at IBM’s Aladen research lab. You may not be able to get a paper about a rules-based system into an NLP conference, but that doesn’t mean they are not a pragmatic way to get things done.
Enough about the context. Here are a few papers that particularly caught my interest:
Dynamic Entity Representation with Max-pooling Improves Machine Reading by Sosuke Kobayashi et al. I tweeted about this paper right after the talk wound up. As mentioned then, it looks like a very interesting way to accumulate a word vector for a named entity based on its contexts in a document, and also on its interactions with other named entities. There are a couple of caveats to keep in mind. This paper uses gold mentions of entity boundaries, so it is not clear how well it will work in the wild when it has to find its own entities. Also, special contexts for embeddings tend to help with intrinsic evaluation tasks, but not make much difference in extrinsic evaluations which are more important.
How can I say that special contexts don’t make a big difference? Check out: “The Role of Context Types and Dimensionality in Learning Word Embeddings” by Oren Melamud et al. This paper appeared on arXiv a few months ago. The short version is that if you are limited on the amount of space you want to use for your embeddings, you should just use a basic neighborhood. If you have lots of space, you should make your basic neighborhood vector longer and longer until you are not gaining benefit, then augment that with vectors from other contexts such as syntactic dependencies or semantic roles.
“Comparing Convolutional Neural Networks to Traditional Models for Slot Filling” by Heike Adel et al.
This paper compares three methods for slot-filling: pattern-based, SVM over common features, and CNN methods. A combination of the three is found to be best. However, the performance on an end-to-end slot-filling benchmark has an F1 of less than 0.3, so this is not ready for prime time.
“Abstractive Sentence Summarization with Attentive Recurrent Neural Networks“, Sumit Chopra et al.
Abstractive summarization is more difficult than extractive, but also promises to have more meaningful summaries as opposed to picking N key sentences. And attentive RNNs are just flat-out cool. I think we will see more work along this line in the near future. But if you need to do summarization right now, extractive remains the way to go. The abstractive methods have this unfortunate little problem with negation.
“Neural Architectures for Named Entity Recognition“, Guillaume Lample et al.
This is an update of some very interesting work that has been on arXiv since early March. They use a CRF on top of a bidirectional LSTM, and the LSTM combines both character-based and word-based embeddings. The performance on the CoNLL-2003 test set hits an F1 of .909, trailing only one other system they know of, which was trained with external data and used a gazetteer. Even better, their code is available on GitHub. That’s great, but to add a bit of perspective, the top system from 2003 had an F1 of .8876. So the absolute amount of improvement in 13 years is disappointing. Also, this benchmark looks at Person, Location, and Organization entities. The performance of their code on scientific entities across a range of disciplines is sure to be much less. Nevertheless, in my heart of hearts I’m hoping to be proven wrong. I’m going to try to get some data together and find out!