DoReCo - Homepage

Welcome to DoReCo

DoReCo (Language DOcumentation REference COrpus) brings together spoken language corpora on a world-wide sample of 53 languages.

It focuses on corpora that originated in fieldwork-based documentations of small and endangered languages, carried out by DoReCo contributors.

DoReCo contains over 100 hours of audio-recorded, mostly narrative texts with transcriptions that are time-aligned at the phone level, translations, and – for 39 languages – also time-aligned morphological annotations.

DoReCo data are freely accessible under Creative Commons licenses, providing the language sciences with fully contextualized, spoken data from a diverse sample of the world’s languages.

How to cite

Please note that when actual data from any number of DoReCo datasets is used, the full reference for each individual dataset must be provided, including the name(s) of the creator(s) of each dataset. It is NOT sufficient to refer to DoReCo as a whole. We are aware that this may result in very long lists of references, but it is only in this way that corpus creators get due recognition for their work. The default is to include the full set of bibliographical references in the reference section of the main text of a paper or abstract. If this is absolutely impossible (because of page limit restrictions, for instance), then inclusion of the full list of references in an appendix is acceptable, or - as a last resort - in supplementary material published separately, e.g. on Zenodo or OSF, in which case the main text of the paper or the abstract must explicitly refer to this list and provide its URL or PID.

The reference for the DoReCo database as a whole (which does NOT replace references to datasets if actual data are used) is:

Seifart, Frank, Ludger Paschen & Matthew Stave (eds.). 2024. Language Documentation Reference Corpus (DoReCo) 2.0. Lyon: Laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2). DOI:10.34847/nkl.7cbfq779

For the methods used in building DoReCo, the following can be cited:

Paschen, Ludger, François Delafontaine, Christoph Draxler, Susanne Fuchs, Matthew Stave & Frank Seifart. 2020. Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo). In Proceedings of The 12th Language Resources and Evaluation Conference, 2657–2666. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.324 .