This third and final part of a trip report about NAACL 2016 covers Thursday’s workshop on human-computer question answering. It featured several good talks and posters. Plus, at the end of the workshop the winning system for the quizbowl shared task faced off with the California championship team. This was possible since, conveniently, all the members of that team come from San Diego. Read more to find out how it turned out!
The workshop began in a jam-packed room with a talk by Ray Mooney about Ensembling Diverse Approaches to Question Answering. Ray is on the faculty of the University of Texas but nobody will mistake him for a slow-talking Texan with a soft Southern drawl. 😉 Ray started by covering a range of types of question answering, and the systems that were developed to do them. He then hit his key point – the importance of “stacking” as an ensembling method and a method for bringing unsupervised methods into its normally fully-supervised framework. Highly recommended.
Jason Weston from Facebook AI Research then spoke about Question Answering via Dialog-based Language Learning. The material for the talk is also covered in the paper Dialog-Based Language Learning. The bulk of the talk covered the details of 10 different supervision schemes which led to different kinds of dialog between the computer learner and its hypothetical teacher. Each of the schemes was tried on four different training strategies. At the end he quickly gave the results for comparing the strategies on the bAbI dataset and on the MovieQA dataset. In the QA period most of the questions were people giving him trouble about the bAbI dataset. Some of that was justified; the dataset is very artificial but it is there for a purpose. After a bit I just seemed to me to be people piling on with unjustified criticism since he had also evaluated the strategies on the MovieQA dataset. The results from both evaluations were consistent in identifying the highest performing schemes, with the MovieQA dataset having lower scores because of its greater complexity.
After the coffee break the workshop resumed in a larger room, to the greater comfort of all concerned. Eunsol Choi spoke on Semantic parsing with Freebase: a very large, yet wildly incomplete knowledge base. I have to confess that I didn’t get a lot out of this talk. One key theme was the desire to use text instead of a structured knowledge base because of the greater amount of information available that way. That’s perfectly reasonable to me. At work we regularly see how sparse the coverage of WikiData is for scientific topics. Most interesting to me was the work on using CCG parses to go from the text to something close to a logical form. This is something I’d like to get back to and study more.
Charlie Beller spoke on the Watson Discovery Advisor: Question-answering in an industrial setting. This is pretty familiar material to me. Certainly if you have not read about Watson’s architecture you should do so. You should also try to find out more about the knowledge engineering efforts that go on behind the scenes in the modern Watson deployments.
Denis Savenkov presented Crowdsourcing for (almost) Real-time Question Answering looked at timeboxing the work of Mechanical Turkers so that they could be used in nearly interactive question answering systems. Accuracy did not suffer greatly if tasks were limited to 1 minute, and in fact it looked like most of their tasks could be limited to 30 or even 20 seconds without severe damage to accuracy. (Obviously, your mileage may vary depending on the complexity of the task). One of the takeaways after the talk and its Q&A period was that for many tasks, you get better results by having only 1 turker tagging each item, instead of ending up with 1/3 as many items that have been been checked by 3 turkers. This is good advice in machine learning contexts where you expect some amount of noise to be generalized away. If you are trying to create a benchmark standard that advice would not necessarily apply.
Attention-Based Convolutional Neural Network for Machine Comprehension was presented by Hinrich Schütze.
After the lunch break, Peter Clark of AI2 presented Project Aristo: Towards a System that can Answer Elementary Science Questions. I’ve heard Peter talk about Aristo a couple of times already. Nevertheless, this was one of the more valuable talks for me, showing how they are ensembling multiple systems together and the overlap in capabilities between different systems. Recently his team has added a couple of new modules for different types of question answering. Those have been incorporated into their ensemble model. One of the new models is something they call the ‘table model’. It features information collected into relational-like tables. Some of the tables contain pretty straightforward information, such as cities within countries. Other tables were much less obvious. Having a uniform set of keys looks like a difficult task. After the talk I asked Peter how the schema for those tables was developed. I’ll paraphrase his reply as “Lots of iterative manual effort.” A lot of that work was carried out in their IKE tool, which is definitely worth a look.
Towards Neural Network-based Question-Answering by Zhengdong Lu presented the planning for a fully neural QA system. Unfortunately this is another talk I did not get a lot out of.
Definitely a high point of the day for me was Richard Socher’s talk on Dynamic Memory Networks for Visual and Textual Question Answering. Richard’s slides are not online but the material is covered in the arXiv paper linked to above. The modular nature of the Dynamic Memory Network seems like an important aspect for future systems. The paper shows how different input units – one for text and one for images – were used in the system. In the future I expect we will see other kinds of input modules such as one for tables. The Episodic Memory attention mechanism is also interesting and this is an architecture I want to play with. One issue is that I am not clear on the capacity of the DMN for handling information from a couple of terabytes of text. I want to look into its scalability over the summer.
The QA workshop concluded with a competition between California’s team for the National Quiz Bowl Championship, and the best entry for the quizbowl shared task. The humans won, but the computer gave a very creditable performance. Unfortunately, I don’t have a link to the description of the winning system, but it was a much smaller effort than the scale that was needed for Watson to win at Jeopardy! a few years ago. If the task were repeated next year then it seems very likely that the results would be reversed.