MoL-2004-02: A Consistent and Efficient Estimator for the Data-Oriented Parsing Model

MoL-2004-02: Zollmann, Andreas (2004) A Consistent and Efficient Estimator for the Data-Oriented Parsing Model. [Report]

	Text (Full Text (PS)) MoL-2004-02.text.ps.gz Download (296kB)
Preview	Text (Full Text (PDF)) MoL-2004-02.text.pdf Download (439kB) \| Preview
	Text (Abstract) MoL-2004-02.abstract.txt Download (1kB)

Abstract

Abstract Given a sequence of samples from an unknown probability distribution, a statistical estimator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One desired property of an estimator is that its guess approaches the unknown distribution as the sample sequence grows large. Mathematically speaking, this property is called consistency.

This thesis presents the first (non-trivial) consistent estimator for the DataOriented Parsing (DOP) model. A consistency proof is given that addresses a gap in the current probabilistic grammar literature and can serve as the basis for consistency proofs for other estimators in statistical parsing. The thesis also expounds the computational and empirical superiority of the new estimator over the common DOP estimator DOP1 : While achieving an exponential reduction in the number of fragments extracted from the treebank (and thus parsing time), the parsing accuracy improves over DOP1.

Another formal property of estimators is being biased. This thesis studies that property for the case of DOP and presents the somewhat surprising finding that every unbiased DOP estimator overfits the training data.

Item Type:	Report
Report Nr:	MoL-2004-02
Series Name:	Master of Logic Thesis (MoL) Series
Year:	2004
Uncontrolled Keywords:	Data-Oriented Parsing, statistical estimator
Date Deposited:	12 Oct 2016 14:38
Last Modified:	12 Oct 2016 14:38
URI:	https://eprints.illc.uva.nl/id/eprint/747

Actions (login required)

View Item