DS-2018-05: Typologically Robust Statistical Machine Translation: Understanding and Exploiting Differences and Similarities Between Languages in Machine Translation

DS-2018-05: Daiber, Joachim (2018) Typologically Robust Statistical Machine Translation: Understanding and Exploiting Differences and Similarities Between Languages in Machine Translation. Doctoral thesis, University of Amsterdam.

[thumbnail of Full Text] Text (Full Text)

Download (6MB)
[thumbnail of Samenvatting] Text (Samenvatting)

Download (2kB)


Machine translation systems often incorporate modeling assumptions motivated by properties of the language pairs they initially target. When such systems are applied to language families with considerably different properties, translation quality can deteriorate. Phrase-based machine translation systems, for instance, are ill-equipped to handle the challenges caused by relaxed word order constraints and productive word formation processes in morphologically rich languages. In this thesis, we ask what role the properties of languages, as studied in the field of linguistic typology, play in how well machine translation systems perform. We focus in particular on word order and morphology, and show that typological differences in these areas can be bridged by making certain linguistic phenomena overt to the translation system. Understanding and exploiting typological differences between languages enables improvements to the typological robustness of translation systems without significantly changing the assumptions of the underlying translation models.

We begin by studying the effect of word order freedom on preordering, a popular technique to model word order in phrase-based machine translation. We show that producing a space of potential word order choices instead of a single word order and integrating this space into the translation model via word order permutation lattices provides a principled way of improving the typological robustness of preordering.

Then, we show that reducing the dissimilarity between the source and target language in the area of morphological complexity improves phrase-based machine translation for typologically diverse language pairs. For inflectional morphology, we do so by enriching the morphologically impoverished source language with unexpressed morphological attributes, which enables better lexical choice in the target language. For non-inflectional morphology, we introduce a semantically motivated model of compounding, which can be used to split compound words into their meaning-carrying subparts, thus enabling the translation system to work with comparable translation units in the source and target language.

Finally, we show that besides helping to bridge the performance gaps between typologically diverse languages, linguistic typology can also serve as a source of knowledge to guide reordering models and to facilitate universal reordering models applicable to multiple target languages. Such universal reordering models can learn in a data-driven manner which aspects of linguistic typology to pay attention to, enable better generalization and require less training data than models for individual languages.

Item Type: Thesis (Doctoral)
Report Nr: DS-2018-05
Series Name: ILLC Dissertation (DS) Series
Year: 2018
Subjects: Language
Depositing User: Dr Marco Vervoort
Date Deposited: 14 Jun 2022 15:17
Last Modified: 14 Jun 2022 15:17
URI: https://eprints.illc.uva.nl/id/eprint/2154

Actions (login required)

View Item View Item