PP-2009-50: A syllable frequency list for Dutch

PP-2009-50: Zuidema, Willem (2009) A syllable frequency list for Dutch. [Report]

Text (Full Text)

Download (203kB) | Preview
[img] Text (Abstract)

Download (1kB)


The Corpus Gesproken Nederlands (CGN) is a large corpus of spoken Dutch, partly annotated with syntactic and phonological information (see http://lands.let.kun.nl/cgn/). Although it contains files with syllabified words, and word frequency counts, there is no direct way to extract from it a list of syllable frequencies. This document describes some simple scripts to combine the relevant information from various CGN files (using version 6 and the linux utilities grep, sed, sort, uniq, awk, cut and paste), and gives a complete list of syllable frequencies obtained by running the scripts. The list is made available in the hope that it might be helpful, for instance for experimental studies where one must control for syllable frequency. Depending on the intended use or required level of accuracy, the scripts might have to be adapted and the frequency counts changed accordingly.

Item Type: Report
Report Nr: PP-2009-50
Series Name: Prepublication (PP) Series
Year: 2009
Uncontrolled Keywords: Dutch; phonology; computational linguistics
Subjects: Computation
Depositing User: Jelle Zuidema
Date Deposited: 12 Oct 2016 14:37
Last Modified: 12 Oct 2016 14:37
URI: https://eprints.illc.uva.nl/id/eprint/379

Actions (login required)

View Item View Item