Rank Consistent Estimation: The DOP Case Thuy Linh Nguyen Abstract: The goal of an estimator is to approximate the unknown distribution of the language from its partial evidence. In this thesis, a rank consistent estimator is defined as an estimator that preserves the ranking frequencies of all the full parse trees in the treebank proved to be rank consistent with respect to the training treebank. The rank consistency property adopts Laplace's Principle of Insufficient Reason for statistical parsing: a rank consistent estimator assigns the same probability to all trees that occur the same number of times in the training data. This thesis presents the first non&trivial DOP estimator where the treebank is not only considered as a stochastic generating system but also a sample of the stochastic process. In this thesis, the existing DOP definitions of probability and derivation of full parse trees are generalized to subtrees. Fragments in the treebank's fragment corpus are assigned weights so that their probabilities are proportional to their relative frequencies. The estimator is proved to be rank consistent. The theoretical property of the model is substantiated by empirical results. The new estimator outperforms the DOP1 estimator on the OVIS corpus. Keywords: statistical parsing, parse trees, language distribution