DS-2016-07: Rich Statistical Parsing and Literary Language

DS-2016-07: van Cranenburgh, Andreas (2016) Rich Statistical Parsing and Literary Language. Doctoral thesis, University of Amsterdam.

[thumbnail of Full Text] Text (Full Text)

Download (1MB)
[thumbnail of Samenvatting] Text (Samenvatting)

Download (3kB)
[thumbnail of Cover] Text (Cover)

Download (3MB)


This thesis applies the Data-Oriented Parsing framework in two areas: parsing & literature. The data-oriented approach rests on the assumption that re-use of chunks of training data can be detected and exploited at test time. Syntactic tree fragments form the common thread in the thesis.

Chapter 2 presents a method to efficiently extract them from treebanks, based on heuristics of re-occurrence. This method is thus able to discover the potential building blocks of large corpora. Chapter 3 then develops a multi-lingual statistical parser based on tree-substitution grammar that handles discontinuous constituents and function tags. We show how a mildly context-sensitive grammar can be employed to produce discontinuous constituents, and then compare this to an approximation that stays within the efficiently parsable context-free framework. The conclusion from the empirical evaluation is that tree fragments allow the grammar to adequately capture the statistical regularities of non-local relations, without the need for the increased generative capacity of mildly context-sensitive grammar.

The second part investigates what separates literary from other novels. Aside from an introduction in Chapter 4 to machine learning we discuss the difference between explanation and prediction.

Chapter 5 discusses the data used for this investigation. We work with a corpus of novels and a reader survey with ratings of how literary novels are perceived to be. While considerable questions remain with respect to whether a survey of the general public is an appropriate instrument to probe the concept of literature, when viewed as a barometer of public opinion we may consider the basic question of whether such opinions are at all predictable. The first goal is therefore to find out the extent to which the literary ratings can be predicted from the texts; the second, more challenging goal is to characterize the kind of patterns that are predictors of more or less literary texts.

Chapter 6 establishes baselines for this question. We show that literary novels contain less adjectives and adverbs than non-literary novels, and present several simple measures that are significantly correlated with the literary ratings, such as vocabulary richness and text compressibility. Cliché expressions is established as a negative marker of literary language. A topic model is developed of the corpus, revealing a number of clearly interpretable themes in the novels.

Special attention is given in Chapter 7 to syntactic aspects, as investigated in the first part. The syntactic methods are contrasted with lexical baselines based on bigrams (sequences of two consecutive words). The combination of lexical and syntactic features gives an improvement, and the syntactic features are more interpretable.

In the end, the literary ratings are predictable from textual features to a large extent. While it is not possible to infer a causal relation between these textual features and the ratings from the survey participants, this result clearly rules out the notion that these value-judgments of literary merit were arbitrary, or predominantly determined by factors beyond the text.

Item Type: Thesis (Doctoral)
Report Nr: DS-2016-07
Series Name: ILLC Dissertation (DS) Series
Year: 2016
Subjects: Language
Depositing User: Dr Marco Vervoort
Date Deposited: 14 Jun 2022 15:17
Last Modified: 14 Jun 2022 15:17
URI: https://eprints.illc.uva.nl/id/eprint/2137

Actions (login required)

View Item View Item