Last December at the NIPS conference I got the opportunity to talk with Lukasz Kaiser of Google about the work described in “Grammar as a Foreign Language“. If you are not familiar with the paper, check it out (so long as you want to know about using Bidirectional LSTMs with Attention to implement a seq2seq model for syntactic constituents parsing, of course). 🙂
As it turns out, replicating this work is not too hard. You just need to copy and modify TensorFlow’s translation example. Here are the tips Lukasz gave me for doing so:
1) Make your input and output data. The input should be sentences, the output should be constituent parse trees. We used sentences that were already tokenized, which required some tweaking in the code to not use the default tokenizer. There are a couple of differences from standard linearized parse tree syntax. First, you should have indicators for the types of closing parens. For example, if we have a noun phrase, we are used to seeing ‘(NP … )’. You just need to change the closing paren from ‘)’ to ‘)NP’. Second, drop the tokens from the output. For example, change:
(NP (DT the) (JJ big) (NN door))
(NP (DT )DT (JJ )JJ (NN )NN )NP
You will have to insert the tokens in the right place in a post-processing setp. Excluding the tokens from the desired output was something that my collegue, Sujit Pal, and I tripped over in trying to replicate the work. Don’t make the same mistake I did and try to have it produce the tokens as well as the parse tree. I’m hoping there is at least one person out there in the world who finds this and saves themselves some time.
2) Simplify the network. The default translation example uses 1024-element cells. Translation is more difficult than parsing, so shrink those to about 256 elements. Stay at three layers, at least to start.
3) Adjust input/output bucket sizes. The translation example has a number of different-sized buckets for short, medium, and long sentences. The output buckets are a little bit longer than the input buckets. For parsing, they should be a lot longer. Notice that one token from the input sentence can lead to many output symbols. I made my outputs about 4x longer than the inputs.
4) Use lots of automatically-parsed data to start the training. We used BLIIP as our constituents parser for starting things, just because we had about 100k articles already parsed as a result of an earlier experiment. That was more data than needed. We never even completed the first epoch through the data. Once the training on BLIIP data got to a point where the perplexity wasn’t changing very fast, we switched to tuning using the CRAFT corpus. As we work on improving performance we may go back and use more of the automatic training data.
5) Be careful about the numbers in some constituents. By default, the TensorFlow translation example will normalize all digits to 0. That’s not so good for constituents like WHNP-1, NP-SBJ-2, SBAR-3, etc. In our experiments so far we are just normalizing all those to 0. We will try to improve things later when we get serious about improving the error rate.
To evaluate this, we pulled out a random selection of 1k sentences from the CRAFT corpus to be our test set. The figures we are getting currently are:
We have not done any hyperparameter tuning, and still have some #FIXMEs to do like handle the numbers in some constituents, but this seems like a good start. Thanks, Lukasz, for all the advice; and thanks to Sujit for all the work in doing the training and testing.
One note Sujit added when he proofread this post for accuracy is that we are using a large AWS instance. Tensorflow works across all the CPUs, so the performance is close to that of a single GPU. When TensorFlow starts working across multiple GPUs in one machine then we will change things.