On replicating “Grammar as a Foreign Language”

Last December at the NIPS conference I got the opportunity to talk with Lukasz Kaiser of Google about the work described in “Grammar as a Foreign Language“. If you are not familiar with the paper, check it out (so long as you want to know about using Bidirectional LSTMs with Attention to implement a seq2seq model for syntactic constituents parsing, of course).  🙂

As it turns out, replicating this work is not too hard. You just need to copy and modify TensorFlow’s translation example. Here are the tips Lukasz gave me for doing so:

1) Make your input and output data. The input should be sentences, the output should be constituent parse trees. We used sentences that were already tokenized, which required some tweaking in the code to not use the default tokenizer.  There are a couple of differences from standard linearized parse tree syntax. First, you should have indicators for the types of closing parens. For example, if we have a noun phrase, we are used to seeing ‘(NP … )’. You just need to change the closing paren from ‘)’ to ‘)NP’. Second, drop the tokens from the output. For example, change:

(NP (DT the) (JJ big) (NN door))

to

(NP (DT )DT (JJ )JJ (NN )NN )NP

You will have to insert the tokens in the right place in a post-processing setp. Excluding the tokens from the desired output was something that my collegue, Sujit Pal, and I tripped over in trying to replicate the work. Don’t make the same mistake I did and try to have it produce the tokens as well as the parse tree. I’m hoping there is at least one person out there in the world who finds this and saves themselves some time.

2) Simplify the network. The default translation example uses 1024-element cells. Translation is more difficult than parsing, so shrink those to about 256 elements. Stay at three layers, at least to start.

3) Adjust input/output bucket sizes. The translation example has a number of different-sized buckets for short, medium, and long sentences. The output buckets are a little bit longer than the input buckets. For parsing, they should be a lot longer.  Notice that one token from the input sentence can lead to many output symbols. I made my outputs about 4x longer than the inputs.

4) Use lots of automatically-parsed data to start the training.  We used BLIIP as our constituents parser for starting things, just because we had about 100k articles already parsed as a result of an earlier experiment.  That was more data than needed. We never even completed the first epoch through the data. Once the training on BLIIP data got to a point where the perplexity wasn’t changing very fast, we switched to tuning using the CRAFT corpus. As we work on improving performance we may go back and use more of the automatic training data.

5) Be careful about the numbers in some constituents. By default, the TensorFlow translation example will normalize all digits to 0. That’s not so good for constituents like WHNP-1, NP-SBJ-2, SBAR-3, etc. In our experiments so far we are just normalizing all those to 0. We will try to improve things later when we get serious about improving the error rate.

To evaluate this, we pulled out a random selection of 1k sentences from the CRAFT corpus to be our test set. The figures we are getting currently are:

Error Rate 0.39
Cross Entropy 15.768
Perplexity 1.143

We have not done any hyperparameter tuning, and still have some #FIXMEs to do like handle the numbers in some constituents, but this seems like a good start. Thanks, Lukasz, for all the advice; and thanks to Sujit for all the work in doing the training and testing.

One note Sujit added when he proofread this post for accuracy is that we are using a large AWS instance. Tensorflow works across all the CPUs, so the performance is close to that of a single GPU. When TensorFlow starts working across multiple GPUs in one machine then we will change things.

Advertisements

2 thoughts on “On replicating “Grammar as a Foreign Language”

  1. I do the same work now, and I have some question about that. Did you change the loss function and how many echoes do you run it. For the symbol like .,” how do you deal with them, (PU PU) or directly put them into the target sentence. Thank you very much.

    Like

    1. I didn’t change the loss function. As for number of epochs, that would depend on how much training data you had. You should have a held-out validation set of data and keep training as long as the validation scores keep improving. I will note that we started with automatically parsed data from BLIIP for initial training, then switched to the manually parsed CRAFT corpus for some tuning, then to Elseviers Open Access STM corpus for final tuning and testing. The accuracy we are getting on that final corpus is not where we want it to be, so there are probably going to be lots of trials before we get this where we want it.
      Re. punctuation symbols, they were removed with all other tokens when we switched to only predicting the parse elements like S( NP( )NP VP( )VP )S. .

      Hope this helps!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s