DoReCo - About the project

About DoReCo

DoReCo 2.0 was published online on December 12, 2024. It is maintained at the laboratoire Dynamique Du Langage using the CLLD framework. DoReCo was conceived and initiated by Frank Seifart and is jointly edited by Frank Seifart, Ludger Paschen, and Matthew Stave. Frank Seifart serves as managing director of the database.

Previous version of DoReCo were DoReCo 1.0 released on July 29, 2022, DoReCo 1.1 released on August 23, 2022, and DoReCo 1.2 released on December 16, 2022.

The DoReCo project

The DoReCo database was created within the DoReCo project from 2019 to 2022. This project was funded by an ANR-DFG grant (ANR-18-FRAL-0010-01, KR951/17-1) awarded to Frank Seifart, Manfred Krifka (March-July 2019), and François Pellegrino (August 2019-August 2022). The project was housed at the Leibniz-Zentrum Allgemeine Sprachwissenschaft in Berlin and the laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2) in Lyon and cooperated with the Bavarian Archive for Speech Signals, Munich. The aim of the DoReCo project was to carry out research on local variations of speech rate based on a broader sample of the world’s languages than had previously been available in a single linguistic database.

Data contained in DoReCo

The DoReCo database contains corpora on 53 languages from 33 top-level language families (as classified in Glottolog), covering languages from all inhabited continents and all linguistic macro-areas. Most of these data were originally collected in the context of language documentation projects focusing on preserving linguistic practices and traditions. They contain mostly monological, narrative texts, though some texts also represent conversations and stimulus retelling. Most datasets were extracted from larger collections archived in repositories such as TLA or ELAR.

In total, DoReCo contains over 100 hours of recordings with almost half a million transcribed words that are time-aligned at the word and phone levels. The minimum amount of data per language is 35,000 phones (although some datasets are slightly below that mark), corresponding to more than 10,000 word tokens for isolating languages. The total number of core texts is 934, equivalent to 17 texts on average per language. Numbers of unique speakers per core dataset range from 1 (Kamas, Texistepec Popoluca, Yongning Na) to 30 (Urum). All texts are also translated, mostly into English, but in some cases also Portuguese, German, Russian, Swahili and other languages.

For 39 languages, DoReCo provides time-aligned interlinear morpheme glosses. For most of these 39 languages, additional texts with interlinear glosses that are not time-aligned are contained in the DoReCo extended set. In total, DoReCo provides over 300,000 word tokens of time-aligned interlinear glossed text and another 300,000 word tokens of glossed texts without time alignment. Each DoReDo dataset is accompanied by extensive corpus documentation on orthographic conventions, abbreviations used in glosses, and other useful information.

No image file found! — Example of time alignment at the word, morph and phone levels from the DoReCo Dolgan dataset (Däbritz, Kudryakova, Stapert & Arkhipov 2024).

Data processing in DoReCo

DoReCo datasets were originally created and are authored by experts on the language, who recorded and annotated these data and who agreed to share these data as part of the DoReCo database. DoReCo team members further processed these data in eight main steps:

Receiving language documentation data collections from corpus creators
Selection of DoReCo-compatible datasets based on criteria such as audio quality, shareability under CC-BY (+x) licenses, minimum number of transcribed words, and quality of annotation
Automatic time-alignment of audio transcriptions using the MAUS forced alignment system
Manual correction of word start and end times and transcription mismatches in the alignment output, including labeling of filled pauses, code switching etc.
Another round of automatic alignment of phone segments, this time within the manually corrected word start and end times (step 4)
Creating consistent and uniform morphological annotations from input files
Re-injection of translation and morphological annotation into time-aligned transcription, creating time-aligned morphological annotations
Creation of annotation files in various output formats: TextGrid, EAF, TEI XML and CSV
Making audio and annotation files available for download through the DoReCo website along with metadata, license information, etc.

Steps 1-5 were principally carried out by the Berlin DoReCo team, with Frank Seifart being responsible for the overall coordination, and Ludger Paschen for the time-alignment, in cooperation with Florian Schiel and Christoph Draxler (for details, see this 2020 LREC paper). Steps 6-9 were principally carried out by the Lyon team. Matthew Stave and François Delafontaine worked on steps 6-8 using the TEICONVERT conversion tool, and especially the CORFLOW software suit, authored by François Delafontaine, while Sébastien Flavier was responsible for step 9. Time-intensive manual checking especially at step 4, but also steps 6-7, was supported by DoReCo research assistants and interns Webb Abernethy, Celia Birle, Frederic Blum, Alejandra Camelo Cruz, Laura Günther, Indira Hajnács, Nora Hofmann, Francie Höhler, Hannah Ida Hullmeine, Johanna Kimmerl, Cheslie Klein, Elena Lazarenko, Runzhi Lou, Stephan Lünser, Magdalena Nischik, Emma Ritz, Laura Schleicher, Jianqi Sun, Michelle Throssell, and Christin Walch. Additional data processing for DoReCo 2.0, especially regarding step 6, was carried out in 2022-2024 by Ludger Paschen’s AIRAL project, including assistants Bruno Behling, Aleksandr Schamberger, and Michelle Throssell.

Contact

DoReCo editors can be contacted at dorecoproject@gmail.com. Bugs, errata and other issues can be reported using the DoReDo DoReCo GitHub issue tracker.