Skip to content Skip to navigation

Graph-Based Biaffine Dependency Parsing

Timothy Dozat
Stanford University
May 17, 2018 -
1:30pm to 2:45pm
Margaret Jacks Hall, Greenberg Room (460-126)

The focus of this work is a fast, accurate, and simple syntactic dependency parser that achieves high performance on datasets for a wide variety of languages and dependency formalisms.  

Most, if not all, theories of syntax assume sentences have some sort of recursive hierarchical structure, where syntactically or semantically "important" words (heads) are given privileged status over words that modify them. Dependency formalisms aim to represent this hierarchy between the overt words in a sentence, with directed edges pointing from each head word to its modifying dependents. These formalisms have been used successfully to provide a number of natural language processing (NLP) tasks with high-level, easy-to-use syntactic structure. Historically, research on automatically annotating text according to these formalisms has focused primarily on transition-based strategies, which implicitly learn and parse with a context-free grammar. While these strategies have some advantages in terms of psychological plausibility and theoretical efficiency, they have quite a few drawbacks as well, depending on complex, task-specific architecture, sacrificing the ability to process multiple sentences in parallel, and requiring significant extension to produce context-sensitive dependency representations. 

I take a graph-based approach to parsing instead, providing a Bidirectional Long Short-Term Memory network (BiLSTM: Hochreiter and Schmidhuber, 1997) with a sentence's sequence of tokens and part-of-speech tags, in order to develop contextualized representations for each word. An attention-like mechansims (Bahdanau et al., 2014; Luong et al., 2015) then scores each possible head-dependent pair. These scores can be easily and straightforwardly converted into a tree structure. BiLSTMs and attention are both extremely commonplace in neural machine learning approaches to NLP, making the approach simple and easy to understand even without an extensive parsing background. This alone is enough to outperform most considerably more complex systems on standard benchmarks for multiple languages. Furthermore, by foregoing the transition-based strategy, my system can process many sentences in parallel, making it fast and scalable to large datasets or on-line applications. 

I also explore how properties of the language that the system was trained on impacted relative performance. Parsers for languages with more syntactic "scrambling" seem to benefit from having a graph-based rather than transition-based architecture, and parsers for languages with richer morphology benefit substantially from having character-level information about each word. Adding a character model and high-performing part-of-speech tagger to the system allowed it to win by a substantial margin the 2017 CoNLL shared task on parsing Universal Dependencies (Nivre et al., 2016; Zeman et al., 2017; Nivre et al., 2017a,b), in which 28 teams competed to build the best parsers for 81 treebanks of 49 languages.  

There are also several dependency formalisms that aim to be more "semantic" in nature than syntactic (Oepen et al., 2014). Since some words can be arguments of multiple predicates at the level of semantics (such as Sandy in Sandy wanted to hug Kim,which is an argument of both want and hug),and some words are semantically vacuous (such as auxiliary to),these formalisms represent sentences as directed graphs rather than trees. The graph-based nature of my system also makes extending it to these non-tree dependency formalisms trivial, with a small tweak being enough to generate dependency representations where many words have multiple heads and many words have none. After modifying the system to produce graphs rather than trees, it achieves state-of-the-art performance on these datasets as well, by a fairly large margin. This work not only provides an open-source, simple, fast, flexible, and powerful dependency parser, but more generally demonstrates the power of well-tuned, simple neural machine learning systems. 

(The format for this open part of the oral exam is a 30-45 minute talk by the Ph.D. candidate followed by questions from those attending, for a total of no more than 75 minutes. Please arrive promptly!)

Oral exam committee: Christopher D. Manning (Advisor), Dan Jurafsky, Martin Kay, Boris Harizanov

University oral exam chair: Thomas Icard 

Event Type: 
Dissertation Oral Presentation