by Lan Xiao (Research Engineer)
Computational Linguistics Team, SAIC-Toronto

by Federico Fancellu (Research Scientist)
Computational Linguistics Team, SAIC-Toronto

Recent successes in machine learning have led to numerous Artificial Intelligence applications, such as automatic translation and chatbots. However, the effectiveness of these systems is limited by their opaqueness: predictions made by the machine learning models cannot be easily understood by humans, and hence it is hard to discern what the model learns well and what it doesn’t — which is a fundamental step to build more robust AI systems. In this blog we share our recent work on building an explainable parser that draws on the geometric structure of the underlying syntactic tree of a natural language sentence. We hope our work encourages future research in building explainable systems for natural language understanding.

Understanding the connection between form and meaning in human languages has long been a topic of study across several academic disciplines, including linguistics, philosophy, and psychology. Since the advent of the field of artificial intelligence (AI), there have been attempts at not only understanding human language, but at building systems and machines that can communicate with us through natural language.

Human languages encode an intended meaning or semantics in a sentence through the composition of smaller units of meaning (phrases and clauses) in varying degrees of complexity. The way in which these phrases and clauses can be combined to form larger units is defined by a set of rules, known as syntax. There is a tight connection between syntax and semantics, and therefore a wide range of AI research studies focus on understanding and encoding the syntactic structure underlying natural language utterances.

Dependency grammar defines one way of capturing the syntactic structure of a sentence based on a simple observation: That each token is a grammatical dependent of exactly one other token, referred to as the head or parent of the dependent token. These head—dependent relations, plus a special ROOT node, define a rooted tree structure over a sentence, a.k.a. the dependency tree. The following figure shows the dependency tree of the sentence: ‘John hit the ball with a bat’. The labels represent grammatical relations. In this work, however, we are interested in the structure of the parse tree hence focus on unlabeled dependency parsing.

Figure 1 Sample dependency parse tree

Dependency parsing then aims to recover the dependency structure of a given sentence via finding head—dependent relationships among tokens. Recent graph-based dependency parsers represent each token in a sentence as a vector in a high dimensional space formed by aggregating information from nearby tokens. Complemented with a learnable edge-scoring function that takes these vectors as input and predicts the likelihood of each token being the head/parent of another, parsing then becomes the task of finding a maximum spanning tree over the learned edge scores. However, the resulting representations learned through such an approach are opaque, making it difficult to interpret prediction errors.

Figure 2 Traditional graph-based parser is trained to maximize parent prediction likelihood without explicit constraint on the geometry of the embedding space. Hence, it does not preserve geometry of the underlying tree structure it tries to model.

To fill in this gap, we seek to train a parser to explicitly preserve structural properties while maintaining high parsing performance. Specifically, we want to recover the tree distances between two nodes from the learned representations (embeddings) of their corresponding word tokens. Since distance is symmetric, we also include the depth difference between two nodes to encode edge direction (from heads to dependents). By training the parser to explicitly preserve tree distances and depths, the embedding space is more interpretable. Therefore, we can better understand the sources of error in relation to what the model has learnt.

Figure 3 Dependency parse tree encodes geometric properties such as tree distances and depth difference between nodes.

In our work on dependency parsing with structural preserving embedding – to appear in the forthcoming EACL 2021 (https://2021.eacl.org/) – we show that it is possible to train a dependency parser to preserve tree distances with near-isometric embeddings without sacrificing performance. We do so by casting dependency parsing as a tree embedding problem. The resulting embedding space is fully interpretable.

Earlier we have mentioned tree distance and we start by formally defining it as a metric space. Our tree metric space consists of the set of tokens in a sentence and a metric on this set that measures the length of the shortest path between a pair of nodes in the corresponding dependency tree of the sentence. A mapping from one metric space to another is called an embedding; an embedding is isometric if it preserves the distance. Hence, our goal is to embed each token in a sentence into an m-dimensional space using a metric such that the embedded tree is nearly-isometric to the dependency parse tree of the sentence. In this work we use two distances as our metric: the squared Euclidean distance (SED) (actually a semi-metric) and l1 norm distance (L1). In the paper we motivate our choice of metric with a proof of existence of an isometric embedding in both metric spaces, as well as a description of the exact geometry of the isometric embedding using SED as metric.

Recall that previous work treats dependency parsing as a parent prediction problem by training an edge scoring function to predict parent likelihoods. We follow the same approach, but define a geometric edge scoring function based on the geometry of the ground-truth tree of a sentence to encourage the resulting embedding space to preserve tree structure, as illustrated in figure 4.

Figure 4 Parent node is at a tree distance 1 away from its children nodes, and is 1 level higher in terms of depth from ROOT. We define our geometric edge scoring function based on this.

We build our parser on top of the biaffine parser of (Qi et al., 2018) but use a single representation per token instead of two separate ones to distinguish head and dependent representations as in the original parser. Moreover, we propose novel geometric loss functions for parsing that incorporate geometric properties of the dependency trees. Inspired by Hewitt and Manning (2019), we consider an isometric loss that is the mean absolute error (MAE) in predicted distance and tree distances between any two nodes. In the case where this MAE loss is zero, we learn an isometric tree embedding with respect to the selected metric in a m-dimensional space.

However, the approach of minimizing the MAE is not directly optimized for parsing, since the edge scoring function is applied post-hoc during inference. Therefore, we consider additional geometric parsing losses that directly optimize for head—dependent selection, which all achieve stronger parsing performance compared to simply minimizing the MAE. To leverage the benefits of both worlds, we consider loss combinations with the MAE loss as a soft global constraint on isometry. We find adding the MAE loss improves global isometry for all losses without compromising parsing performance.

To better understand the embedding space, we take a closer look at the resulting embeddings by plotting the predicted distance distribution for each particular ground-truth tree distance. The learned embedding of one parsing loss which directly maximizes the probability of true parent node is shown in the following plot. We observe that embedding trained with this maximum likelihood loss alone is far from distance preserving: the model predicts a median distance of around 10 for head—dependent pairs that have a true tree distance of 1. On the other hand, adding the MAE loss greatly regulates the resulting embedding space, where the predicted distances better correlate with true tree distances.

Figure 5 Distribution of predicted distances trained with maximum likelihood loss on parent prediction. Each x-tick corresponds to a particular tree distance; the four box plots at each x-tick are distributions of predicted distance with SED (d_2^2) and L1 ((d_1)as metric and optionally adding the additional MAE isometry constraint (+L_MAE ); The dashed line represents perfect correlation.

Our best performing model is trained with an alternative parsing loss that tries to match the parent likelihood with the one produced by an isometric embedding, which learns a different embedding as shown in the following figure. It encourages predicted distances for all word pairs to be correlated with tree distances and achieves small geometric losses up to tree distance five, indicating local-isometry in the embedding space. Similarly, we find adding MAE loss further regulates the embedding space.

Figure 6 Distribution of predicted distances for our best performing model.

We compare our results with the biaffine parser of Qi et al. (2018) across all languages. We evaluate parsing performance based on two standard metrics: unlabeled attachment score (UAS) that measures the overall accuracy of parent/head prediction; and labeled attachment score (LAS) that measures the accuracy of predicting the correct parent with the correct grammatical relation. Our approach achieves competitive results within 1% of the biaffine parser of Qi et al. for both UAS and LAS despite using a simpler model. We also find empirical evidence for good structure preservation properties across all languages as measured by the Spearman’s correlation between predicted and true distances.

Figure 7 Parsing performance for all loss combinations averaged over 16 languages.

We then look at parsing errors in terms of tree relations which can be defined by the pairwise tree distance and the change in depth. Example tree relations are illustrated with the following dependency parse tree: the word hit is tree distance 2 away from bat and has a delta depth of 2, hence hit is the grandparent of bat. Similarly, the word with is a sister of John (with a tree distance of 2 and delta depth of 0).

Figure 8 Tree relations in a sample dependency parse tree.

Although the predicted distances deviate from true distances for long-distant edges, we do not identify long distance ambiguities as a major source of error. As shown in figure 9 left, 99.8% of UAS errors are local (up to tree distance 5), where sisters and grandparents account for 36.2% and 34.8% of the errors for our best performing model. To put it in perspective, we construct a random error distribution: for each tree in the validation set, we generate all trees with one incorrect edge. As shown in figure 9 right, only 81.6% of UAS errors are local in the random baseline, indicating locally concentrated error distribution is a characteristic of a trained parser. This opens up future research directions on reducing these specific high-frequency errors.

Figure 9 Error distribution for our best performing parser (left) and a random baseline (right). Each bar represents UAS errors that incorrectly assign a head node to a node with specific tree relations. UAS errors that incorrectly assign sisters and grandparents as parents are highlighted.

We hope that our work encourages future research on geometric approaches to dependency parsing.

**Reference:**

Peng Qi, Timothy Dozat, Yuhao Zhang, and Christo- pher D Manning. 2018. Universal dependency pars- ing from scratch. In CoNLL 2018 Shared Task: Mul- tilingual Parsing from Raw Text to Universal Depen- dencies.

John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word repre- sentations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Associ- ation for Computational Linguistics.