Yesterday, Databricks announced that they are making Spark debugging easier by their integration of the Spark UI into the Databricks platform. True enough, but don’t confuse “easier” with “easy”.
Don’t get me wrong – we have Databricks at work and I love it. But debugging has been its weak point. The Spark UI integration gives us some visuals to indicate job status, percentage completion, etc. It has another visualization to show the data dependency graph that underlies everything Spark does. But these pretty pictures are not what I really want.
One of the first things I worked on after grad school was developing a parallel debugger – because I was tired of writing parallel code without the source-level debugging tools that I had on serial machines. That debugger also started out with visuals – in that case it was illustrations of the messages being sent between the processes, as well as information the process state over time – running, blocked waiting, idle, etc. That was based on the data it was easiest to capture. The display was kind of pretty, IMHO, but it turned out to be largely useless. What really makes debugging easier for me is being able to step through the code, examine variables, set breakpoints, run until a condition is encountered, etc. You don’t even need a WYSIWYG display of the code for that. I pretty much run all my Perl and Python code the first few times in their simple debuggers to catch the errors I make. But if your ‘debugger’ doesn’t let me see the state of the program and the results of a statement, then I’m pretty much forced back to debug print statements to figure that out and your debugger doesn’t get used.
Similarly, there is no point in showing me details about things over which I have no control. As an example, the image below comes from looking at the data graph visualization. But all of that information comes from the statement:
inputTest = sqlContext.read.parquet(aFile)
Make no mistake about it, creating a source level debugger for parallel programs is hard. You can’t even easily let people inspect the value of a variable if there are N copies of that piece of code running at the time. Which one do you show? How do you help the user manage the complexity of N copies of the same routine, never mind the interactions of different routines?
I think the next stage of debugging in Spark will be to do a better job of handling exceptions and letting us inspect the values of variables just before things went off the rails. The notebook environment is helpful since the pieces of code are frequently shorter than is the norm in large IDEs. But I still want a parallel debugger that gives me the stuff I use most when writing serial code.