DS-2010-01: Relational-Realizational Parsing

DS-2010-01: Tsarfaty, Reut (2010) Relational-Realizational Parsing. Doctoral thesis, University of Amsterdam.

[thumbnail of Full Text] Text (Full Text)
DS-2010-01.text.pdf

Download (4MB)
[thumbnail of Samenvatting] Text (Samenvatting)
DS-2010-01.samenvatting.txt

Download (5kB)

Abstract

Statistical parsing models aim to assign accurate syntactic analyses
to natural language sentences based on the patterns and frequencies
observed in human-annotated training data. State-of-the-art
statistical parsers to date demonstrate excellent performance
in parsing English, but when the same models are applied to languages
different than English, they hardly ever obtain comparable
results. The grammar of English is quite unusual in that it is fairly
configurational. This means that the order of words inside sentences
in English is relatively rigid and that the morphology of words is
rather impoverished.

The main challenge associated with parsing languages that are less
configurational than English, such as German, Arabic, Hebrew or
Warlpiri, is the need to model and to statistically learn complex
correspondence patterns between functions, i.e., sets of
abstract grammatical relations, and their morphological and syntactic
forms of realization.  This thesis proposes a new model, called the
Relational-Realizational (RR) model, that can effectively cope with
parsing languages that allow for flexible word-order patterns and rich
morphological marking. The RR model is applied to parsing the
Semitic language Modern Hebrew, obtaining signficant improvements over
previously reported results.

Whereas grammatical relations are largely universal, their realization
is known to vary across languages. Different means of realization
encompass the interaction of (at least) two typological dimensions,
one associated with word order (Greenberg 1963), and another
associated with word-level morphology (Sapir 1921, Greenberg
1954).  In order to adequately model complex form-function
correspondence patterns that emerge from such interactions, we firstly
consider morphological models that map grammatical properties of words
to the surface formatives that realize them. In this work I adopt the
principles of word-and-paradigm morphology (Anderson 1992, Stump 2001)
and extend them to modeling correspondence patterns in the syntax. In
the proposed RR model, constituents are organized into syntactic
paradigms (Pike 1962, 1963). Each cell in a paradigm is associated
with a Relational Network (Postal and Perlmutter 1977) and a set of
properties that jointly define the function of the constituent. The
form of a constituent emerges from the (i) internal grouping,
(ii) linear ordering, and (iii) morphological marking of its
subconstituents.

The RR decomposition of the rules that spell out the form of
constituents reflects different typological parameters, separating the
functional, configurational and morphological dimensions. The
dominated constituents may be associated with their own relational
networks, and the process continues recursively until fully-specified
morphosyntactic representations map to words. This 3-phased spell-out
process gives rise to a recursive generative process that can be used
as a probabilistic model and its parameters can be estimated from
data.

The resulting statistical model is empirically evaluated by parsing
sentences in the Semitic language Modern Hebrew on the basis of a
small annotated treebank (Sima'an et al 2001). Through a series of
experiments we report significant improvements over the
state-of-the-art Head-Driven (HD) alternative on various measures,
without paying any computational costs. The typological
characterization of the RR statistical distributions further suggests
that the model may be useful for developing corpus-based quantitative
methods for typological classification of natural language data.

This thesis is organized as follows:

Chapter 1: Linguistic Typology. This chapter introduces basic concepts
in linguistic typology, and associates grammatical relations with the
morphological and syntactic dimensions of realization. It further
introduces the notion of noncongfigurationality in relation to the
interplay between the two.

Chapter 2: Parsing Technology. This chapter reviews generative and
discriminative approaches that were applied to parsing English, and
describes the application of existing generative models to Chinese,
German and Arabic. The results suggest that less configurational
languages are harder to parse.

Chapter 3: The Data. This chapter describes the blend of
configurational and nonconfigurational phenomena we find in the
grammar of the Semitic language Modern Hebrew, and illustrates
different instances in which morphological information enhances the
interpretation of configurational structures.

Chapter 4: The Model. This chapter describes the linguistic, formal,
and computational properties of the Relational-Realizational model. It
starts out with morphological modeling and extends the underlying
principles to the syntactic domain. It formally defines the RR model
as a generative rewrite rule-system and describes a probabilistic
generative model based on it.

Chapter 5: The Application. This chapter applies the RR model
developed in chapter 4 to the Hebrew morphosyntactic phenomena
described in chapter 3. The application illustrates the theoretical
reach of the model, and it serves as the theoretical basis for
implementing different treebank grammars.

Chapter 6: Experiments. This chapter reports the results of parsing
experiments for Modern Hebrew in the form of a head-to-head comparison
of the RR model with the state-of-the-art HD approach.

Chapter 7: Extensions. This chapter discusses potential extensions of
the model towards handling related tasks including semantic modeling
and morphological disambiguation. It finally suggests to study the
potential application of the model for quantifying the
information-theoretic content of the morphological and syntactic
dimensions of realization for different languages.

Item Type: Thesis (Doctoral)
Report Nr: DS-2010-01
Series Name: ILLC Dissertation (DS) Series
Year: 2010
Subjects: Computation
Language
Logic
Depositing User: Dr Marco Vervoort
Date Deposited: 14 Jun 2022 15:16
Last Modified: 14 Jun 2022 15:16
URI: https://eprints.illc.uva.nl/id/eprint/2084

Actions (login required)

View Item View Item