About DoReCo

DoReCo 1.2 was published online on December 16, 2022. It is maintained at the laboratoire Dynamique Du Langage using the CLLD framework. It is edited by Frank Seifart, Ludger Paschen, and Matthew Stave, who are jointly responsible for its contents.

Previous version of DoReCo were DoReCo 1.0 released on July 29, 2022, and DoReCo 1.1 released on August 23, 2022.

The DoReCo project

The DoReCo database was created within the DoReCo project from 2019 to 2022. This project was funded by an ANR-DFG grant (ANR-18-FRAL-0010-01, KR951/17-1) awarded to Frank Seifart, Manfred Krifka (March-July 2019), and François Pellegrino (August 2019-August 2022). The project was housed at the Leibniz-Zentrum Allgemeine Sprachwissenschaft in Berlin and the laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2) in Lyon and cooperated with the Bavarian Archive for Speech Signals, Munich. The aim of the DoReCo project was to carry out research on local variations of speech rate based on a broader sample of the world’s languages than has previously been available in a single linguistic database.

Data contained in DoReCo

The DoReCo database contains corpora on 51 languages from 32 top-level language families (as classified in Glottolog), covering languages from all inhabited continents and all linguistic macro-areas. Most of these data were originally collected in the context of language documentation projects focusing on preserving linguistic practices and traditions. They contain mostly monological, narrative texts, though some texts also represent conversations and stimulus retelling. Most datasets were extracted from larger collections archived in repositories such as TLA or ELAR.

In total, DoReCo contains over 100 hours of recordings with almost half a million transcribed words that are time-aligned at the word and phone levels. The minimum amount of data per language is 35,000 phones (although some datasets are slightly below that mark), corresponding to more than 10,000 word tokens for isolating languages. The total number of core texts is 893, equivalent to 17 texts on average per language. Numbers of unique speakers per core dataset range from 1 (Kamas, Texistepec Popoluca, Yongning Na) to 30 (Urum). All texts are also translated, mostly into English, but in some cases also Portuguese, German, Russian, Swahili and other languages.

For 38 languages, DoReCo provides time-aligned interlinear morpheme glosses. For most of these 38 languages, additional texts with interlinear glosses that are not time-aligned are contained in the DoReCo extended set. In total, DoReCo provides over 300,000 word tokens of time-aligned interlinear glossed text and another 300,000 word tokens of glossed texts without time alignment. Each DoReDo dataset is accompanied by extensive corpus documentation on orthographic conventions, abbreviations used in glosses, and other useful information.

No image file found!
Example of time alignment at the word, morph and phone levels from the DoReCo Dolgan dataset (Däbritz, Kudryakova, Stapert & Arkhipov 2022).

Data processing in DoReCo

DoReCo datasets were originally created and are authored by experts on the language, who recorded and annotated these data and who agreed to share these data as part of the DoReCo database. DoReCo team members further processed these data in eight main steps:

  1. Receiving language documentation data collections from corpus creators
  2. Selection of DoReCo-compatible datasets based on criteria such as audio quality, shareability under CC-BY (+x) licenses, minimum number of transcribed words, and quality of annotation
  3. Automatic time-alignment of audio transcriptions using the MAUS forced alignment system
  4. Manual correction of word start and end times and transcription mismatches in the alignment output, including labeling of filled pauses, code switching etc.
  5. Another round of automatic alignment of phone segments, this time within the manually corrected word start and end times (step 4)
  6. Creating consistent and uniform morphological annotations from input files
  7. Re-injection of translation and morphological annotation into time-aligned transcription, creating time-aligned morphological annotations
  8. Creation of annotation files in various output formats: TextGrid, EAF, TEI XML and CSV
  9. Making audio and annotation files available for download through the DoReCo website along with metadata, license information, etc.

Steps 1-5 were principally carried out by the Berlin DoReCo team, with Frank Seifart being responsible for the overall coordination, and Ludger Paschen for the time-alignment, in cooperation with Florian Schiel and Christoph Draxler (for details, see this 2020 LREC paper). Steps 6-9 were principally carried out by the Lyon team. Matthew Stave and François Delafontaine worked on steps 6-8 using the corflow software suit and the TEICORPO conversion tool, while Sébastien Flavier was responsible for step 9. Time-intensive manual checking especially at step 4, but also steps 6-7, was supported by DoReCo research assistants and interns Webb Abernethy, Celia Birle, Frederic Blum, Alejandra Camelo Cruz, Laura Günther, Indira Hajnács, Nora Hofmann, Francie Höhler, Hannah Ida Hullmeine, Johanna Kimmerl, Cheslie Klein, Elena Lazarenko, Runzhi Lou, Stephan Lünser, Magdalena Nischik, Emma Ritz, Laura Schleicher, Jianqi Sun, Michelle Elizabeth Throssell Balagué, and Christin Walch.


DoReCo editors can be contacted at dorecoproject@gmail.com. Bugs, errata and other issues can be reported using the DoReDo DoReCo GitHub issue tracker.