MoL-2020-16: Statistical Methodology for Quantitative Linguistics: A Case Study of Learnability and Zipf’s Law

MoL-2020-16: Vogelmann, Valentin (2020) Statistical Methodology for Quantitative Linguistics: A Case Study of Learnability and Zipf’s Law. [Report]

[thumbnail of MoL-2020-16.text.pdf]

Download (5MB) | Preview


Quantitative linguistics is a large and rich field in the study of language that has brought about or played an essential role in debates such as those around Zipf’s law or the learnability of language. Using these two topics and their intersection as a case study, we identify in this thesis a fundamental problem of statistical methodology that seems pervasive in quantitative linguistic practice. In essence, the problem can be summarised as a common negligence of the distinction between observed samples of language, that is corpora, and their source distribution, that is the underlying language.
In the first part, we re-derive how upholding the sample-source distinction naturally leads to the problem of statistical estimation and propose and show how to use standard resampling methods to obtain representative and reliable estimates, particularly given the scarcity of resources in linguistics. We use this method to obtain the most reliable estimates of Zipf’s law to date and highlight the importance and potential of proper estimation by analysing some of the estimates’ properties.
The second contribution consists of the Filtering method, an novel and general adaptation of resampling methods grounded in information theory. This method is intended to facilitate realistic large-scale learnability analyses of the distributional properties of language, in our case of Zipf’s law. We derive the Filtering method itself starting again from the sample-source distinction and instantiate it in two exemplary implementations. Subsequently, we validate its usefulness by analysing the sampled corpora in terms of the sampling objectives and the corpora’s naturalness and diversity. Given that these objectives seek to weaken Zipf’s law, and that this is a difficult objective to achieve, we find relatively high naturalness and diversity of the resulting corpora.
Finally, and bringing the resampling and Filtering method together, we make a proposal for empirically assessing recent advancements in the innateness debate, which analyse the learnability of language via Kolmogorov complexity. The high degree of abstraction makes it difficult to directly evaluate the proposed learning strategies but with the help of resampling and Filtering, and the sample-source distinction more generally, we make a concrete proposal at how this may in fact be achieved.

Item Type: Report
Report Nr: MoL-2020-16
Series Name: Master of Logic Thesis (MoL) Series
Year: 2020
Subjects: Computation
Depositing User: Dr Marco Vervoort
Date Deposited: 10 May 2021 22:31
Last Modified: 10 May 2021 22:31

Actions (login required)

View Item View Item