This page contains information on how DoReCo is organized, what kind of data it contains, and how to navigate the website. To access datasets from individual languages, click on the name of the language on the map or the table on the Languages page. This will take you to a page with basic metadata and the option to download annotation and audio files of a language. Sometimes, audio files are not made available through DoReCo, in which case a link to a repository holding the audio files is provided.
The terms of use for the various DoReCo datasets are specified through the type of license associated with them, which are also provided with each download bundle. We encourage colleagues carrying out studies with DoReCo data to reach out to us so that we can keep track of such studies, and provide expert input if desired.
For many languages, DoReCo distinguishes between a core set and an extended set of data. The former contains fully time-aligned annotation files in various formats and, in most cases, also .wav audio files. Extended sets contain additional annotation with morphological glossing but no time-alignment and no audio files.
Annotation files are offered for download as a single .zip file, whereas audio files can be downloaded in bulk or individually. To download audio files in bulk, simply click “TOUT TÉLÉCHARGER” in the top right corner after choosing the appropriate download option; to download individual audio files, click on the small icon to the left of the respective filename. Annotation files are provided in ELAN .eaf, Praat .TextGrid, TEI .xml, and tabular .csv formats. The .eaf is the master format from which the three other formats were generated. In addition, every dataset includes documentation files with information on licenses, labels, and other conventions used in DoReCo.
Annotation files for individual datasets can be downloaded as a single .zip archive which includes .eaf, .TextGrid, TEI .xml and .csv files. The former three formats are provided once for each text, while there are two versions of .csv files for each dataset: one with each row representing an entry on the ph tier for all core files in that bundle, and one with each row representing an entry on the wd tier for all files (core + extended) in the bundle. The latter table thus contains both time-aligned and non-time-aligned data, which can be distinguished by checking if there is an entry in the “ph” column.
A bundle of additional documentation is provided with every bulk download of DoReCo audio and annotation files. This includes the following files:
Morphologically annotated files in DoReCo contain up to 8 core tiers and 2 supplementary tiers, in addition to the other tiers present in the files.
The word (wd@...) and phone (ph@...) tiers in DoReCo annotation files sometimes contain elements in angle brackets referred to as labels (see below for a full list). There were two main motivations for using labels in DoReCo. First, they help identify exceptional speech events such as disfluencies or filled pauses that users should be aware of, e.g. in case they wish to exclude them from subsequent studies. Second, labels replace time-alignment at the phone or morpheme level. This makes the alignments overall more consistent, and ensures that grapheme-to-phoneme rules (see doreco_[glottocode]_transcription-conventions.csv) only include phonologically relevant mappings.
Labels consist of two opening brackets, the label proper, a closing bracket, the content (optional), and another closing bracket, e.g. <<ui>word>. Labels may also appear on their own if the content is not known, e.g. <<ui>>. Silent pauses are marked by a special symbol, <p:>. The location of silent pauses is manually checked by the DoReCo team, while the symbol itself is inserted by the WebMAUS service. Unlike the other labels, the symbol has only one of each bracket, and no other content may be included in it.