I was at the Spark Summit last week. Lots of interesting talks and some good chats with people in the booths in the exhibit hall. Different people will have different take-aways from the conference; I’d like to call out a couple of big themes and some miscellaneous topics. The big themes are Spark’s growth and respectability, and a renewed focus on CPU performance.
Growth & Respectability
Last year’s Spark Summit in San Francisco had about 400 attendees, which seemed like a lot at the time. This year’s had about 2000. 5x attendance growth in one year is pretty strong by anyone’s standards. It’s even more impressive when you consider there was an East Coast meeting a few months ago and there is a European meeting coming up in October. Neither of those was the case last year.
An indication of the respectability of Spark is IBM’s recent commitment to it. As noted by Ben Horowitz, having the blessing of Big Blue makes this something even more people can start to consider as part of their tech stacks. Also, the number of developers IBM can bring to the table should help with performance, reliability, and functionality fixes. It does, however, bring a whole new level of project management responsibility to the main committers.
The release of v1.4 happened late last week. Given that 1.5 will come out in about 3 months, and 1.6 is scheduled for 3 months after that, that is less important for its specific features (although there are some we’ve been wanting) than it is for showing that the committers are on-track and are continuing to quickly deliver valuable updates and have a good roadmap for several upcoming releases.
As noted by Reynold Xin in his talk “From DataFrames to Tungsten: A Peek into Spark’s Future“, I/O and network speed have increased by about 10x since the very earliest days of Spark. Memory capacity has done about the same. But CPU clock speed has hardly budged. CPU performance increases have mostly come from adding cores. Those extra cores are harder to exploit than the simple clock-speed changes that drove performance improvements over the last few decades. Of course, that is part of Spark’s reason for being. We can run Spark jobs on a multicore machine with a few workers and make good use of the cores. For essentially all of my professional life, the emphasis in parallel programming has been hiding the latency of disk and network access. There were two very good talks about how the focus on performance is shifting back to the CPU:
Making Sense of Spark Performance by Kay Osterhout of Berkeley
Kay and colleagues did some good analysis of execution traces and found out how much time the programs spent blocked while waiting for I/O, blocked while waiting for network communication, and blocked while fully using the CPU. For several reasons the old truths about most programs being I/O or network bound are no longer true. As noted above, disk and network performance has improved while single-threaded CPU performance has not. Another factor is the software improvements. Spark, the OS, and their integration with the hardware already do a great job of hiding communication latency. We will be stalled on reading the first blocks from disk, but once we start computing with that, the rest of the data access goes on in the background so its latency is hidden.
Looking at the traces and the times when a process is blocked waiting for disk or network access, Kay and colleagues found that speed would improve by no more than about 20% if the disk access was infinitely fast so that we never had to wait on it. Network delays were even less important – only about 2% improvement could be made. These performance hits are the maximum they saw across a relatively small number of programs, but the typical hit was much less.
Another performance issue Kay discussed was about ‘stragglers’ – those processes that stall everything because they are still grinding away long after all the other jobs are done. She mentioned that the cause of about ½ of the stragglers could be identified and performance improved, although she did not give more details in the talk. Afterwards when I asked about this, she mentioned one example was during shuffles. The default filesystem on EC2 Linux boxes is ext3, which does not do a great job of handling lots of little files. Even though the files go to SSD, there is still a noticeable overhead for creating and deleting them as they are needed to hold information for shuffling the data, which led to stragglers. In one case they essentially doubled the speed of a program by changing the underlying file system. (Sorry, don’t know what they switched to.)
As they became aware of Kay’s work and its implications, the Spark team has responded with the Project Tungsten effort. It is an effort to improve the speed of Spark Programs by addressing CPU speed limits through several methods. One overall method is more sophisticated memory management, which has three expected benefits. The first is to avoid long garbage collection pauses, which has been the bane of JVM performance for some time. The second is better performance by simpler representations of data structures – using more primitive arrays of floats instead of full Java arrays of Java Floats. This will use less memory, and have better performance by avoiding extra housekeeping operations. The simpler C-like datastructures will also be much more cache-friendly.
A second method of improving performance is by changing algorithms into more cache-aware ones. They used the example of sorting a list of records. The usual way is to have a list of pointers to the records, and move the pointers in the list while the data records stay where they are in main memory. That’s OK, but every comparison means we chase 2 pointers. As the list is sorted the locality will get worse so this method is brutal on the cache. If the data structure is changed to be a list of (pointer, prefix) pairs, the initial sorting can be done on the prefixes and that will fit into cache much better. (If you are sorting on a single field, the prefix would be the value of that field, and you would never have to go back to the full records in main memory during the sort).
There is lots more to Tungsten. I’m looking forward to when we can easily use large numbers of GPUs through Spark.
Two other talks I particularly enjoyed were:
Integrating Spark and Solr by Timothy Potter of Lucidworks
For more than a year Spark has had a connector to ElasticSearch that will let you either index an RDD, or run a query and bring back the results as an RDD. Tim is doing something similar with Solr. His code to connect Spark and Solr is on Github: https://github.com/LucidWorks/spark-solr. An interesting question is how far either of those efforts can go towards making a Full Text engine conform to Sparks Data Source API, and how far Spark’s Catalyst optimizer will be able to go in optimizing queries involving full text. I’m looking forward to seeing what happens there over the next year.
Building the Autodesk Design Graph by Yotto Koga of Autodesk
Autodesk is working to make it possible to search for designs, as opposed to searching for words that may appear in some design documentation. A pretty interesting and difficult problem when you think about it.They have developed some interesting methods to search for the parts within a design. For example, they have something to normalize the different triangle meshes that tile the surface of a part. Yotto had a very interesting analogy between text retrieval and classification with the Bag of Words model, and design retrieval and classification using a Bag of Parts model.
All the conference talks were videotaped. As those videos are edited, they will appear on the conference site, along with the presentations. Check out the full agenda! I hope to see you at next year’s show!