Rank Consistent Estimation: The DOP Case
Thuy Linh Nguyen

Abstract:
The goal of an estimator is to approximate the unknown distribution of
the language from its partial evidence. In this thesis, a rank
consistent estimator is defined as an estimator that preserves the
ranking frequencies of all the full parse trees in the treebank proved
to be rank consistent with respect to the training treebank. The rank
consistency property adopts Laplace's Principle of Insufficient Reason
for statistical parsing: a rank consistent estimator assigns the same
probability to all trees that occur the same number of times in the
training data.
This thesis presents the first non&trivial DOP estimator where the
treebank is not only considered as a stochastic generating system but
also a sample of the stochastic process. In this thesis, the existing
DOP definitions of probability and derivation of full parse trees are
generalized to subtrees. Fragments in the treebank's fragment corpus
are assigned weights so that their probabilities are proportional to
their relative frequencies. The estimator is proved to be rank
consistent.
The theoretical property of the model is substantiated by empirical
results.  The new estimator outperforms the DOP1 estimator on the OVIS
corpus.

Keywords: statistical parsing, parse trees, language distribution