DoReCo (Language DOcumentation REference COrpus) brings together spoken language corpora on a world-wide sample of 51 languages.
It focuses on corpora that originated in fieldwork-based documentations of small and endangered languages, carried out by DoReCo contributors.
DoReCo contains over 100 hours of audio-recorded, mostly narrative texts with transcriptions that are time-aligned at the phone level, translations, and – for 38 languages – also time-aligned morphological annotations.
DoReCo data are freely accessible under Creative Commons licenses, providing the language sciences with fully contextualized, spoken data from a diverse sample of the world’s languages.