Ambiguity

I was exchanging email with a colleague about sentence simplification (see https://github.com/sebastianruder/NLP-progress/blob/master/english/simplification.md  if you need an intro). To illustrate a point about it, I picked a sentence basically at random from my favorite corpus. It happened to be:

However, in this instance, cells grow in aggregates and do not require centrifugation as they settle under gravity.

As I started to convert this to several simpler sentences, the degree of ambiguity I found was really surprising. (It pretty much always is, we just don’t typically notice it). It got to be fun as I deliberately started looking for alternative interpretations.

cells
Would that be biological cells, prison cells, some other kind of cell? Of course, it’s really some specific combo of cell line and treatment that can only be determined from analysing “this instance”. That’s one of the problems with sentence simplification. But I digress. See https://en.wikipedia.org/wiki/Cell for more senses of the term than you imagine.
grow “in”
Here they are talking about growing in clumps, but normally I think about containers when hearing “grow in”.  A colleague is looking at the scientific literature on the cell culture media (very fancy Gatorade) in which cells grow, so “grow in cell media” came quickly to my mind.
aggregates
Would that be clumps, or the stuff used to make concrete?  See https://en.wikipedia.org/wiki/Aggregate for some more senses of the term.
grow in aggregates
So do the cells grow in groups, or do they find rock and sand to be a hospitable environment? Won’t my colleague be surprised!
settle
Would that be the cells falling to the bottom of the container we were not talking about earlier? Or perhaps they are going to homestead new territory? Will they settle down like children, or settle for less like their parents? Or perhaps they reach a legal accommodation? (Which itself is ambiguous, but we don’t need to digress into their trek to a hotel since they still didn’t have a container in which to live). Being a verb, settle’s Wikipedia disambiguation page isn’t quite so fun, but it is still worth a quick look if you have been reading this far and looking at the other ones.
under
“Under [the force of]” is a much less common sense of the term than “beneath”. And I must say that ‘under’, being a preposition, and an adverb, and an adjective, is terribly under-appreciated. It’s quite the over-achiever. (Hey, if you come to a site that says it is words about words, you have to expect a a few puns).
gravity
Issac Newton can be mentioned here, not only for the apple falling down, but also for the gravity of his nature as a vindictive SOB. Wikipedia’s disambiguation page about gravity is mostly a list of books, songs, and movies by that name, but eventually it gets to some more interesting senses.

I hope you all had some fun reading this, I certainly had fun writing it. I’d like to say that I’ll post again in a few days, rather than a few years, but we shall see.

 

Advertisements

Human-Computer Question Answering Workshop

This third and final part of a trip report about NAACL 2016 covers Thursday’s workshop on human-computer question answering. It featured several good talks and posters. Plus, at the end of the workshop the winning system for the quizbowl shared task faced off with the California championship team. This was possible since, conveniently, all the members of that team come from San Diego. Read more to find out how it turned out!

The workshop began in a jam-packed room with a talk by Ray Mooney about Ensembling Diverse Approaches to Question AnsweringRay is on the faculty of the University of Texas but nobody will mistake him for a slow-talking Texan with a soft Southern drawl. 😉  Ray started by covering a range of types of question answering, and the systems that were developed to do them. He then hit his key point – the importance of “stacking” as an ensembling method and a method for bringing unsupervised methods into its normally fully-supervised framework. Highly recommended.

Jason Weston from Facebook AI Research then spoke about Question Answering via Dialog-based Language LearningThe material for the talk is also covered in the paper Dialog-Based Language LearningThe bulk of the talk covered the details of 10 different supervision schemes which led to different kinds of dialog between the computer  learner and its hypothetical teacher. Each of the schemes was  tried on four different training strategies. At the end he quickly gave the results for comparing the strategies on the bAbI dataset and on the MovieQA dataset. In the QA period most of the questions were people giving him trouble about the bAbI dataset. Some of that was justified; the dataset is very artificial but it is there for a purpose. After a bit I just seemed to me to be people piling on with unjustified criticism since he had also evaluated the strategies on the MovieQA dataset. The results from both evaluations were consistent in identifying the highest performing schemes, with the MovieQA dataset having lower scores because of its greater complexity.

After the coffee break the workshop resumed in a larger room, to the greater comfort of all concerned.  Eunsol Choi spoke on Semantic parsing with Freebase: a very large, yet wildly incomplete knowledge baseI have to confess that I didn’t get a lot out of this talk. One key theme was the desire to use text instead of a structured knowledge base because of the greater amount of information available that way. That’s perfectly reasonable to me. At work we regularly see how sparse the coverage of WikiData is for scientific topics. Most interesting to me was the work on using CCG parses to go from the text to something close to a logical form. This is something I’d like to get back to and study more.

Charlie Beller spoke on the Watson Discovery Advisor: Question-answering in an industrial setting. This is pretty familiar material to me. Certainly if you have not read about Watson’s architecture you should do so. You should also try to find out more about the knowledge engineering efforts that go on behind the scenes in the modern Watson deployments.

Denis Savenkov presented Crowdsourcing for (almost) Real-time Question Answering looked at timeboxing the work of Mechanical Turkers so that they could be used in nearly interactive question answering systems. Accuracy did not suffer greatly if tasks were limited to 1 minute, and in fact it looked like most of their tasks could be limited to 30 or even 20 seconds without severe damage to accuracy. (Obviously, your mileage may vary depending on the complexity of the task). One of the takeaways after the talk and its Q&A period was that for many tasks, you get better results by having only 1 turker tagging each item, instead of ending up with 1/3 as many items that have been been checked by 3 turkers. This is good advice in machine learning contexts where you expect some amount of noise to be generalized away. If you are trying to create a benchmark standard that advice would not necessarily apply.

Attention-Based Convolutional Neural Network for Machine Comprehension was presented by Hinrich Schütze. 

After the lunch break, Peter Clark of AI2 presented Project Aristo: Towards a System that can Answer Elementary Science QuestionsI’ve heard Peter talk about Aristo a couple of times already. Nevertheless, this was one of the more valuable talks for me, showing how they are ensembling multiple systems together and the overlap in capabilities between different systems. Recently his team has added a couple of new modules for different types of question answering. Those have been incorporated into their ensemble model. One of the new models is something they call the ‘table model’. It features information collected into relational-like tables. Some of the tables contain pretty straightforward information, such as cities within countries. Other tables were much less obvious. Having a uniform set of keys looks like a difficult task. After the talk I asked Peter how the schema for those tables was developed. I’ll paraphrase his reply as “Lots of iterative manual effort.” A lot of that work was carried out in their IKE tool, which is definitely worth a look.

Towards Neural Network-based Question-Answering by  Zhengdong Lu presented the planning for a fully neural QA system. Unfortunately this is another talk I did not get a lot out of.

Definitely a high point of the day for me was Richard Socher’s talk on Dynamic Memory Networks for Visual and Textual Question AnsweringRichard’s slides are not online but the material is covered in the arXiv paper linked to above. The modular nature of the Dynamic Memory Network seems like an important aspect for future systems. The paper shows how different input units – one for text and one for images – were used in the system. In the future I expect we will see other kinds of input modules such as one for tables. The Episodic Memory attention mechanism is also interesting and this is an architecture I want to play with. One issue is that I am not clear on the capacity of the DMN for handling information from a couple of terabytes of text. I want to look into its scalability over the summer.

The QA workshop concluded with a competition between California’s team for the National Quiz Bowl Championship, and the best entry for the quizbowl shared task.  The humans won, but the computer gave a very creditable performance. Unfortunately, I don’t have a link to the description of the winning system, but it was a much smaller effort than the scale that was needed for Watson to win at Jeopardy! a few years ago. If the task were repeated next year then it seems very likely that the results would be reversed.

AKBC 2016

Earlier I blogged about the NAACL conference. The conference proper ran Monday..Wednesday. It was followed by workshops on Thursday and Friday. I’ll describe Thursday’s workshop on Question Answering in a future posting. This one is about Friday’s workshop on Automated Knowledge Base Completion (AKBC). It was the highlight of the week for me, and I left there re-energized to dig into some tasks I’ve been wanting to do for a while, and with some cool ideas for new tasks.

The organizers put a lot of effort into the workshop, including humorous introductions for most of the invited speakers. For better or worse, they had more invited talks than contributed ones. I think this was for the better, as we got to hear about the efforts underway at several leading programs, and the speakers were good communicators. The disadvantage is that the slides of their talks are not available for download (at least yet).

Oren Etzioni spoke about things underway at the Allen Institute of Artificial Intelligence, and he put in a plug for the poster on IKE (Interactive Knowledge Explorer) which looks like a very nice tool for helping in KB construction. Oren’s work on the Semantic Scholar has me itching to do a lot more analysis on the text of sentences that cite other works in the literature.

Andrew McCallum spoke about recent progress in Universal Schemas, and put in plug for a couple of posters from his group when those topics came up in his presentation: Incorporating Selectional Preferences in Multi-hop Relation Extraction” by Rajarshi Das et al, and “Row-less Universal Schema” by Patrick Verga et al. Patrick’s work is very interesting as it allows us to train a Universal Schema, and then use it even when the incoming entities are not ones that were seen during training time. There is some complementary work on column-less Universal Schema, and Patrick has identified a merger of the two as an area of future work.

William Cohen spoke about some statistical relational learning systems like ProPPR and the quality of results they produce. He also noted that they were hard to integrate into an end-to-end system. Neural networks are easy to integrate that way because the individual components are differentiable, therefore the whole system can be trained with backprop. He went on to talk about making logical systems that were differentiable, such as their most current work on TensorLog.

Chris Manning spoke about a couple of things – one was contrasting some neural networks against traditional methods, then reworking those networks for improved performance. He also spoke about Natural Logic Inference – breaking complex sentences into clauses, then mutating the text of those clauses, to come to a set of simpler statements that are still valid inferences from the original text. That work has been recently extended to come up with a broader range of inferences based on text similarity, at the cost of shifting to probabilities of being a correct inference. Some of us see that as a benefit.

Richard Socher spoke about Dynamic Memory Networks. The Episodic Memory Module in these networks is an exceedingly cool attention mechanism and I recommend you read the paper “Dynamic Memory Networks for Visual and Textual Question Answering“. The composition of several modules to provide an overall system is probably a foretaste of the things to come.

The talks from Richard, Andrew and Chris made me want to go out and get their code and start playing around. Quite a bit of that code is available. Based on a quick peek before the flight home, CoreNLP now has an Open IE annotator that seems to use the Natural Logic Inference, although I don’t think it has the lexical similarity work in it yet. Patrick Verga, from Andrew’s group, has code on GitHub for working with a few different Universal Schema models including the new row-less one. There are several reimplementations of Richard’s work on GitHub. I think I will have fun over the summer with these things!

I should note that my colleague Paul Groth submitted a paper to the AKBC workshop which was accepted for poster presentation. He couldn’t make the trip due to schedule issues, so I presented it for him. Unlike Verga’s paper, ours is not not about pushing the state of the art in Universal Schemas. What is does show is how it was pretty easy to use Universal Schemas and some other unsupervised techniques to come up with a system to help in the maintenance of something we might call a medical knowledge base. Since the panel of experts at the end of the day predicted that medical was the big domain for NLP in the near..medium term, and they also said they didn’t see a lot of people working on integrating a flow of new knowledge into a knowledge base, you might want to check it out: Applying Universal Schemas for Domain Specific Ontology Expansion, Paul Groth et al.

One final comment – the organizers had a set of questions for the audience and panel. One of them was whether the next big challenge for NLP/AKBC was precision, recall, scalability, or something else. Recall was the dominant answer. Personally, I’d go for something else. Make no mistake, I think recall is both vital and difficult. But it seems to me that the question assumes to much. How are we going to evaluate recall, or even precision, on a large knowledge base that is automatically constructed and maintained? I think we need some fundamental advance in how we can perform such evaluations before we can tell what kind of progress we are making on recall.

NAACL 2016

The weather was beautiful in San Diego last week during the North American chapter of the Association for Computational Linguistics conference, better known as NAACL.  Lots of interesting stuff on the inside of the meeting hotel as well. The conference and affiliated workshops took all week, so there is too much material for me to describe in a reasonable-length blog posting.  Check the link above for the schedule and proceedings, then enjoy the kinds of papers you like. Instead of describing all (or even many) of the talks I attended, I’ll try to describe the context and a few high points. The workshops on Thursday and Friday were really great too, so I’ll address them in another post.

Deep Learning was the major semi-official theme of the conference.  Is it fashionable (yes), is it over-hyped (probably), is it all a bunch of BS (definitely not), etc. During lunch on Monday I was chatting with a grad student who asked something to the effect of whether I thought neural nets were going to remain ascendent this time, as opposed to their earlier rides on the hype cycle. “Probably not.” was my answer, which elicited an astonished look. I went on to explain that this is partially due to known issues, such as the difficulty of getting explanations of their decisions (although there has been work on reducing that problem). It is partially due to the way that ensemble models tend to work better. A neural net may make the decisions in an ensemble, but there will be other methods in there as well. On reflection, as I write this post, it just seems unlikely that we have finally come across the great secret of AI and now its just a matter of hyperparameter optimization.  I just don’t think things are going to be that easy. I can hope that neural network methods don’t hit the depths of despair that they have in the past, but I have little doubt that something new will come along in five years and be the newly fashionable thing.

NAACL is a heavily academic conference, despite considerable participation by research groups from Google, Microsoft, Facebook, etc. This was highlighted in the opening keynote when the speaker, Regina Barzilay of MIT, mentioned her surprise in finding out that most current medical NLP work is rules-based. Certainly learning-based methods have dominiated the academic literature in the last 10 or more years, but rules still dominate commercial practice. Rule-based methods have characteristics that fit commercial constraints well. You can start simply. You don’t need a large existing training and test set. The rules tend to be high precision, which is a good thing since errors in precision are more blatant to people than errors in recall. Importantly, you can understand why a set of rules gave the decision that it did, as opposed to having an uninterperatable matrix of numbers. That reason in particular is important to medical practitioners. All of this has been known in the commercial community for years. In 2013 it was nicely described to the academic community in the paper “Rule-based Information Extraction is Dead!  Long Live Rule-based Information Extraction Systems!” by Laura Chiticariu and colleagues at IBM’s Aladen research lab. You may not be able to get a paper about a rules-based system into an NLP conference, but that doesn’t mean they are not a pragmatic way to get things done.

Enough about the context. Here are a few papers that particularly caught my interest:

Dynamic Entity Representation with Max-pooling Improves Machine Reading by Sosuke Kobayashi et al. I tweeted about this paper right after the talk wound up. As mentioned then, it looks like a very interesting way to accumulate a word vector for a named entity based on its contexts in a document, and also on its interactions with other named entities. There are a couple of caveats to keep in mind. This paper uses gold mentions of entity boundaries, so it is not clear how well it will work in the wild when it has to find its own entities. Also, special contexts for embeddings tend to help with intrinsic evaluation tasks, but not make much difference in extrinsic evaluations which are more important.

How can I say that special contexts don’t make a big difference? Check out: “The Role of Context Types and Dimensionality in Learning Word Embeddings” by Oren Melamud et al. This paper appeared on arXiv a few months ago. The short version is that if you are limited on the amount of space you want to use for your embeddings, you should just use a basic neighborhood. If you have lots of space, you should make your basic neighborhood vector  longer and longer until you are not gaining benefit, then augment that with vectors from other contexts such as syntactic dependencies or semantic roles.

Comparing Convolutional Neural Networks to Traditional Models for Slot Filling” by Heike Adel et al.

This paper compares three methods for slot-filling: pattern-based, SVM over common features, and CNN methods. A combination of the three is found to be best. However, the performance on an end-to-end slot-filling benchmark has an F1 of less than 0.3, so this is not ready for prime time.

Abstractive Sentence Summarization with Attentive Recurrent Neural Networks“, Sumit Chopra et al.

Abstractive summarization is more difficult than extractive, but also promises to have more meaningful summaries as opposed to picking N key sentences. And attentive RNNs are just flat-out cool. I think we will see more work along this line in the near future. But if you need to do summarization right now, extractive remains the way to go. The abstractive methods have this unfortunate little problem with negation. 

Neural Architectures for Named Entity Recognition“, Guillaume Lample et al.

This is an update of some very interesting work that has been on arXiv since early March. They use a CRF on top of a bidirectional LSTM, and the LSTM combines both character-based and word-based embeddings. The performance on the CoNLL-2003 test set hits an F1 of .909, trailing only one other system  they know of, which was trained with external data and used a gazetteer. Even better, their code is available on GitHub.  That’s great, but to add a bit of perspective, the top system from 2003 had an F1 of .8876. So the absolute amount of improvement in 13 years is disappointing. Also, this benchmark looks at Person, Location, and Organization entities. The performance of their code on scientific entities across a range of disciplines is sure to be much less. Nevertheless, in my heart of hearts I’m hoping to be proven wrong. I’m going to try to get some data together and find out!

I could go on, but really, go check out the program  and the proceedings. If you have read this far, you are the kind of person who will find them a good way to spend many hours.

On replicating “Grammar as a Foreign Language”

Last December at the NIPS conference I got the opportunity to talk with Lukasz Kaiser of Google about the work described in “Grammar as a Foreign Language“. If you are not familiar with the paper, check it out (so long as you want to know about using Bidirectional LSTMs with Attention to implement a seq2seq model for syntactic constituents parsing, of course).  🙂

As it turns out, replicating this work is not too hard. You just need to copy and modify TensorFlow’s translation example. Here are the tips Lukasz gave me for doing so:

1) Make your input and output data. The input should be sentences, the output should be constituent parse trees. We used sentences that were already tokenized, which required some tweaking in the code to not use the default tokenizer.  There are a couple of differences from standard linearized parse tree syntax. First, you should have indicators for the types of closing parens. For example, if we have a noun phrase, we are used to seeing ‘(NP … )’. You just need to change the closing paren from ‘)’ to ‘)NP’. Second, drop the tokens from the output. For example, change:

(NP (DT the) (JJ big) (NN door))

to

(NP (DT )DT (JJ )JJ (NN )NN )NP

You will have to insert the tokens in the right place in a post-processing setp. Excluding the tokens from the desired output was something that my collegue, Sujit Pal, and I tripped over in trying to replicate the work. Don’t make the same mistake I did and try to have it produce the tokens as well as the parse tree. I’m hoping there is at least one person out there in the world who finds this and saves themselves some time.

2) Simplify the network. The default translation example uses 1024-element cells. Translation is more difficult than parsing, so shrink those to about 256 elements. Stay at three layers, at least to start.

3) Adjust input/output bucket sizes. The translation example has a number of different-sized buckets for short, medium, and long sentences. The output buckets are a little bit longer than the input buckets. For parsing, they should be a lot longer.  Notice that one token from the input sentence can lead to many output symbols. I made my outputs about 4x longer than the inputs.

4) Use lots of automatically-parsed data to start the training.  We used BLIIP as our constituents parser for starting things, just because we had about 100k articles already parsed as a result of an earlier experiment.  That was more data than needed. We never even completed the first epoch through the data. Once the training on BLIIP data got to a point where the perplexity wasn’t changing very fast, we switched to tuning using the CRAFT corpus. As we work on improving performance we may go back and use more of the automatic training data.

5) Be careful about the numbers in some constituents. By default, the TensorFlow translation example will normalize all digits to 0. That’s not so good for constituents like WHNP-1, NP-SBJ-2, SBAR-3, etc. In our experiments so far we are just normalizing all those to 0. We will try to improve things later when we get serious about improving the error rate.

To evaluate this, we pulled out a random selection of 1k sentences from the CRAFT corpus to be our test set. The figures we are getting currently are:

Error Rate 0.39
Cross Entropy 15.768
Perplexity 1.143

We have not done any hyperparameter tuning, and still have some #FIXMEs to do like handle the numbers in some constituents, but this seems like a good start. Thanks, Lukasz, for all the advice; and thanks to Sujit for all the work in doing the training and testing.

One note Sujit added when he proofread this post for accuracy is that we are using a large AWS instance. Tensorflow works across all the CPUs, so the performance is close to that of a single GPU. When TensorFlow starts working across multiple GPUs in one machine then we will change things.

At your service…

Google announced TensorFlow Serving today. The basic notion is simple – take your trained TensorFlow model and make it into a web service running on some scalable hardware. Predictions on demand.

The fact that this is part of the TensorFlow way of doing things is not the important bit. (In fact, truth be told, I’m finding TensorFlow to be more of a pain as we use it and I’m looking at alternatives). But the notion of deploying models behind a service API is the big idea. People should get comfy with that notion because I expect we will end up using it a lot.

On a related matter, you might want to look at the Velox model manager for Spark and its  machine learning facilities. Plopping a model behind a REST-ful API is only touching part of the problem.

 

Don’t worry. Be happy!

It seems like I only make time to write on this blog when I’m at a conference. And the reason for that is that I use these as my trip reports. 🙂

I’m at the NIPS conference in Montreal. Things are crazy here with 3700 attendees. The thing about NIPS that is really unusual is that every accepted paper ends up being presented as a poster. Some of the papers are then selected for oral presentations of varying lengths. There are lots of 3 min. spotlight announcements and a handful of 30 min oral presentations. The other thing that is unusual for a conference of this size is that it is a single track. This sounded like a pretty cool arrangement when it was a small conference, but I’m not so sure it works well with the current size. Good luck talking with the author of a popular paper during the poster session.

IMG_0045

But I digress. I wanted to tell you about happy things, not moan about the world’s major machine learning conference.

So the good news here comes from Yoshua Bengio during the Deep Learning tutorial on Monday. Anyone doing optimization has worried about the problem of being caught in local minima when trying to optimize some cost function. This is, in fact, a big deal when the cost function has a modest number of dimensions. But Bengio claims that as the number of dimensions increases, saddle points proliferate. When dealing with 100 or more dimensions, it is very rare to have a true local minima. Most of the time what looks like a minima will have at least one dimension that sees the minima as a saddle and lets us continue our optimization. Furthermore, he claims that when you do hit a true local minima, most of them are not radically different in value from each other and the true global minimum. Thus, the title of this post.

All is not sweetness and light of course. The gradient at a saddle point is effectively zero, so it can be hard to escape. But Bengio’s observation provides hope that usually, an escape is possible.  Hmmm – do I hear the sound of simulated annealing coming back over the hill?

Spark Debugging and the Wheel of Reincarnation

Yesterday, Databricks announced that they are making Spark debugging easier by their integration of the Spark UI into the Databricks platform. True enough, but don’t confuse “easier” with “easy”.

Don’t get me wrong – we have Databricks at work and I love it. But debugging has been its weak point. The Spark UI integration gives us some visuals to indicate job status, percentage completion, etc. It has another visualization to show the data dependency graph that underlies everything Spark does. But these pretty pictures are not what I really want.

One of the first things I worked on after grad school was developing a parallel debugger – because I was tired of writing parallel code without the source-level debugging tools that I had on serial machines. That debugger also started out with visuals – in that case it was illustrations of the messages being sent between the processes, as well as information the process state over time – running, blocked waiting, idle, etc. That was based on the data it was easiest to capture. The display was kind of pretty, IMHO, but it turned out to be largely useless. What really makes debugging easier for me is being able to step through the code, examine variables, set breakpoints, run until a condition is encountered, etc. You don’t even need a WYSIWYG display of the code for that. I pretty much run all my Perl and Python code the first few times in their simple debuggers to catch the errors I make. But if your ‘debugger’ doesn’t let me see the state of the program and the results of a statement, then I’m pretty much forced back to debug print statements to figure that out and your debugger doesn’t get used.

Similarly, there is no point in showing me details about things over which I have no control. As an example, the image below comes from looking at the data graph visualization. But all of that information comes from the statement:

inputTest = sqlContext.read.parquet(aFile)

I have no control over all the things in the graph so why show them to me? DAG

Make no mistake about it, creating a source level debugger for parallel programs is hard. You can’t even easily let people inspect the value of a variable if there are N copies of that piece of code running at the time. Which one do you show? How do you help the user manage the complexity of N copies of the same routine, never mind the interactions of different routines?

I think the next stage of debugging in Spark will be to do a better job of handling exceptions and letting us inspect the values of variables just before things went off the rails. The notebook environment is helpful since the pieces of code are frequently shorter than is the norm in large IDEs. But I still want a parallel debugger that gives me the stuff I use most when writing serial code.

Spark Summit West 2015

I was at the Spark Summit last week. Lots of interesting talks and some good chats with people in the booths in the exhibit hall. Different people will have different take-aways from the conference; I’d like to call out a couple of big themes and some miscellaneous topics. The big themes are Spark’s growth and respectability, and a renewed focus on CPU performance.

Growth & Respectability

Last year’s Spark Summit in San Francisco had about 400 attendees, which seemed like a lot at the time. This year’s had about 2000. 5x attendance growth in one year is pretty strong by anyone’s standards. It’s even more impressive when you consider there was an East Coast meeting a few months ago and there is a European meeting coming up in October. Neither of those was the case last year.

An indication of the respectability of Spark is IBM’s recent commitment to it. As noted by Ben Horowitz, having the blessing of Big Blue makes this something even more people can start to consider as part of their tech stacks. Also, the number of developers IBM can bring to the table should help with performance, reliability, and functionality fixes. It does, however, bring a whole new level of project management responsibility to the main committers.

The release of v1.4 happened late last week. Given that 1.5 will come out in about 3 months, and 1.6 is scheduled for 3 months after that, that is less important for its specific features (although there are some we’ve been wanting) than it is for showing that the committers are on-track and are continuing to quickly deliver valuable updates and have a good roadmap for several upcoming releases.

CPU Performance

As noted by Reynold Xin in his talk “From DataFrames to Tungsten: A Peek into Spark’s Future“, I/O and network speed have increased by about 10x since the very earliest days of Spark. Memory capacity has done about the same. But CPU clock speed has hardly budged. CPU performance increases have mostly come from adding cores. Those extra cores are harder to exploit than the simple clock-speed changes that drove performance improvements over the last few decades. Of course, that is part of Spark’s reason for being. We can run Spark jobs on a multicore machine with a few workers and make good use of the cores. For essentially all of my professional life, the emphasis in parallel programming has been hiding the latency of disk and network access. There were two very good talks about how the focus on performance is shifting back to the CPU:

Making Sense of Spark Performance by Kay Osterhout of Berkeley

Kay and colleagues did some good analysis of execution traces and found out how much time the programs spent blocked while waiting for I/O, blocked while waiting for network communication, and blocked while fully using the CPU.  For several reasons the old truths about most programs being I/O or network bound are no longer true. As noted above, disk and network performance has improved while single-threaded CPU performance has not. Another factor is the software improvements. Spark, the OS, and their integration with the hardware already do a great job of hiding communication latency. We will be stalled on reading the first blocks from disk, but once we start computing with that, the rest of the data access goes on in the background so its latency is hidden.

Looking at the traces and the times when a process is blocked waiting for disk or network access, Kay and colleagues found that speed would improve by no more than about 20% if the disk access was infinitely fast so that we never had to wait on it. Network delays were even less important – only about 2% improvement could be made. These performance hits are the maximum they saw across a relatively small number of programs, but the typical hit was much less.

Another performance issue Kay discussed was about ‘stragglers’ – those processes that stall everything because they are still grinding away long after all the other jobs are done. She mentioned that the cause of about ½ of the stragglers could be identified and performance improved, although she did not give more details in the talk. Afterwards when I asked about this, she mentioned one example was during shuffles. The default filesystem on EC2 Linux boxes is ext3, which does not do a great job of handling lots of little files. Even though the files go to SSD, there is still a noticeable overhead for creating and deleting them as they are needed to hold information for shuffling the data, which led to stragglers. In one case they essentially doubled the speed of a program by changing the underlying file system. (Sorry, don’t know what they switched to.)

Deep Dive into Project Tungsten: Bringing Spark Closer to the Bare Metal by Josh Rosen

As they became aware of Kay’s work and its implications, the Spark team has responded with the Project Tungsten effort. It is an effort to improve the speed of Spark Programs by addressing CPU speed limits through several methods. One overall method is more sophisticated memory management, which has three expected benefits. The first is to avoid long garbage collection pauses, which has been the bane of JVM performance for some time. The second is better performance by simpler representations of data structures – using more primitive arrays of floats instead of full Java arrays of Java Floats. This will use less memory, and have better performance by avoiding extra housekeeping operations. The simpler C-like datastructures will also be much more cache-friendly.

A second method of improving performance is by changing algorithms into more cache-aware ones. They used the example of sorting a list of records. The usual way is to have a list of pointers to the records, and move the pointers in the list while the data records stay where they are in main memory. That’s OK, but every comparison means we chase 2 pointers. As the list is sorted the locality will get worse so this method is brutal on the cache. If the data structure is changed to be a list of (pointer, prefix) pairs, the initial sorting can be done on the prefixes and that will fit into cache much better.  (If you are sorting on a single field, the prefix would be the value of that field, and you would never have to go back to the full records in main memory during the sort).

There is lots more to Tungsten. I’m looking forward to when we can easily use large numbers of GPUs through Spark.

Miscellaneous

Two other talks I particularly enjoyed were:

Integrating Spark and Solr by Timothy Potter of Lucidworks

For more than a year Spark has had a connector to ElasticSearch that will let you either index an RDD, or run a query and bring back the results as an RDD. Tim is doing something similar with Solr. His code to connect Spark and Solr is on Github: https://github.com/LucidWorks/spark-solr. An interesting question is how far either of those efforts can go towards making a Full Text engine conform to Sparks Data Source API, and how far Spark’s Catalyst optimizer will be able to go in optimizing queries involving full text. I’m looking forward to seeing what happens there over the next year.

Building the Autodesk Design Graph by Yotto Koga of Autodesk

Autodesk is working to make it possible to search for designs, as opposed to searching for words that may appear in some design documentation. A pretty interesting and difficult problem when you think about it.They have developed some interesting methods to search for the parts within a design. For example, they have something to normalize the different triangle meshes that tile the surface of a part. Yotto had a very interesting analogy between text retrieval and classification with the Bag of Words model, and design retrieval and classification using a Bag of Parts model.

All the conference talks were videotaped. As those videos are edited, they will appear on the conference site, along with the presentations. Check out the full agenda! I hope to see you at next year’s show!

The more the merrier

Last week I was at the AAAI Symposium on Knowledge Representation and Reasoning (KRR). Check out the schedule at that link, and the accepted papers presented as posters. Lots of good stuff there if, of course, you are into that kind of thing.

I think getting to hear Geoff Hinton and Doug Lenat recapitulate the battle of the neats vs. the scruffies was probably the highlight of the conference for many. That was, frankly, kind of fun. What I really got from the symposium was the diversity of methods that are available and need to be reconciled. There is little doubt that the machinery under our veneer of conscious thought is much closer to the neural network models than to the rules and assertions of symbolic logic. So what? Airplanes don’t flap their wings, as has been pointed out for most of the life of AI. There are times when being able to set forth a few facts and rules seems like a great and pragmatic efficiency hack. Josh Tennenbaum’s talk about the things babies learn before they can speak or even understand much of what is said – things like gravity, density, and the stability of towers of blocks – pointed out that part of our thinking is akin to running simulations using a game engine.

People who are trying to figure out consciousness may not be happy with efficiency hacks, but people trying to engineer useful tools probably will be. Integrating those three models of cognition, plus a few others that will come along, is going to keep a lot of grad students locked in their labs and staying out of trouble at night.

The same theme of integrating multiple methods repeats itself at a smaller scale. Also present at the meeting were some people pushing forward with Probabilistic Soft Logic, and others pushing forward with matrix factorization and universal schemas.  How do we tie those in with networks trained on multi-modal content? I’m seriously interested to see how that works out over the next couple of years.