The DoReCo 1.1 database was published online on August 23, 2022. It is maintained at the laboratoire Dynamique Du Langage using the CLLD framework. It is edited by Frank Seifart, Ludger Paschen, and Matthew Stave, who are jointly responsible for its contents.
The first version of the corpus, DoReCo 1.0, was released on July 29, 2022.
The DoReCo database was created within the DoReCo project from 2019 to 2022. This project was funded by an ANR-DFG grant (ANR-18-FRAL-0010-01, KR951/17-1) awarded to Frank Seifart, Manfred Krifka (March-July 2019), and François Pellegrino (August 2019-August 2022). The project was housed at the Leibniz-Zentrum Allgemeine Sprachwissenschaft in Berlin and the laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2) in Lyon and cooperated with the Bavarian Archive for Speech Signals, Munich. The aim of the DoReCo project was to carry out research on local variations of speech rate based on a broader sample of the world’s languages than has previously been available in a single linguistic database.
The DoReCo database contains corpora on 51 languages from 32 top-level language families (as classified in Glottolog), covering languages from all inhabited continents and all linguistic macro-areas. Most of these data were originally collected in the context of language documentation projects focusing on preserving linguistic practices and traditions. They contain mostly monological, narrative texts, though some texts also represent conversations and stimulus retelling. Most datasets were extracted from larger collections archived in repositories such as TLA or ELAR.
In total, DoReCo contains over 100 hours of recordings with transcriptions that are time-aligned at the word and phone levels. The minimum amount of data per language is 35,000 phones (although some datasets are slightly below that mark), corresponding to more than 10,000 word tokens for isolating languages. The total number of core texts is 893, equivalent to 17 texts on average per language. Numbers of unique speakers per core dataset range from 1 (Kamas, Texistepec Popoluca, Yongning Na) to 30 (Urum). All texts are also translated, mostly into English, but in some cases also Portuguese, German, Russian, Swahili and other languages.
For 38 languages, DoReCo provides time-aligned interlinear morpheme glosses. For most of these 38 languages, additional texts with interlinear glosses that are not time-aligned are contained in the DoReCo extended set. In total, DoReCo provides over 300,000 word tokens of time-aligned interlinear glossed text and another 300,000 word tokens of glossed texts without time alignment. Each DoReDo dataset is accompanied by extensive corpus documentation on orthographic conventions, abbreviations used in glosses, and other useful information.
DoReCo datasets were originally created and are authored by experts on the language, who recorded and annotated these data and who agreed to share these data as part of the DoReCo database. DoReCo team members further processed these data in eight main steps:
Steps 1-5 were principally carried out by the Berlin DoReCo team, with Frank Seifart being responsible for the overall coordination, and Ludger Paschen for the time-alignment, in cooperation with Florian Schiel and Christoph Draxler (for details, see this 2020 LREC paper). Steps 6-9 were principally carried out by the Lyon team. Matthew Stave and François Delafontaine worked on steps 6-8 using the corflow software suit and the TEICORPO conversion tool, while Sébastien Flavier was responsible for step 9. Time-intensive manual checking especially at step 4, but also steps 6-7, was supported by DoReCo research assistants and interns Webb Abernethy, Celia Birle, Frederic Blum, Alejandra Camelo Cruz, Laura Günther, Indira Hajnács, Nora Hofmann, Francie Höhler, Hannah Ida Hullmeine, Johanna Kimmerl, Cheslie Klein, Elena Lazarenko, Runzhi Lou, Stephan Lünser, Magdalena Nischik, Emma Ritz, Laura Schleicher, Jianqi Sun, Michelle Elizabeth Throssell Balagué, and Christin Walch.