This page contains information on how DoReCo is organized, what kind of data it contains, and how to navigate the website. To access datasets from individual languages, click on the name of the language on the map or the table on the Languages page. This will take you to a page with basic metadata and the option to download annotation and audio files of a language. Sometimes, audio files are not made available through DoReCo, in which case a link to a repository holding the audio files is provided.

The terms of use for the various DoReCo datasets are specified through the type of license associated with them, which are also provided with each download bundle. We encourage colleagues carrying out studies with DoReCo data to reach out to us so that we can keep track of such studies, and provide expert input if desired.

For many languages, DoReCo distinguishes between a core set and an extended set of data. The former contains fully time-aligned annotation files in various formats and, in most cases, also .wav audio files. Extended sets contain additional annotation with morphological glossing but no time-alignment and no audio files.

Annotation files are offered for download as a single .zip file, whereas audio files can be downloaded in bulk or individually. To download audio files in bulk, simply click “TOUT TÉLÉCHARGER” in the top right corner after choosing the appropriate download option; to download individual audio files, click on the small icon to the left of the respective filename. Annotation files are provided in ELAN .eaf, Praat .TextGrid, TEI .xml, and tabular .csv formats. The .eaf is the master format from which the three other formats were generated. In addition, every dataset includes documentation files with information on licenses, labels, and other conventions used in DoReCo.

Files included in download bundles

Annotation files for individual datasets can be downloaded as a single .zip archive which includes .eaf, .TextGrid, TEI .xml and .csv files. The former three formats are provided once for each text, while there are two versions of .csv files for each dataset: one with each row representing an entry on the ph tier for all core files in that bundle, and one with each row representing an entry on the wd tier for all files (core + extended) in the bundle. The latter table thus contains both time-aligned and non-time-aligned data, which can be distinguished by checking if there is an entry in the “ph” column.

A bundle of additional documentation is provided with every bulk download of DoReCo audio and annotation files. This includes the following files:

  1. Project-wide files (same for all datasets):
    • README file (doreco_README.txt)
      • Description of the contents of the bundle
      • How to cite the DoReCo database
    • DoReCo conventions (doreco_CONVENTIONS.txt)
      • Labels used in transcriptions for non-aligned elements
      • Documentation of annotation tiers
  2. Language-specific files:
    • Language information (doreco_[glottocode]_dataset-info.txt)
      • Information about the language, the corpus and its creator(s), and how to cite the downloaded dataset, including license information
    • File metadata (doreco_[glottocode]_metadata.csv)
      • Speaker information: ID, age, sex
      • File information: genre, recording date, glossing coverage, word count, sound quality
    • Grapheme-to-phoneme mappings (doreco_[glottocode]_transcription-conventions.csv)
      • Correspondence tables mapping the symbols used in the fieldworker's transcription to X-SAMPA characters
    • List of abbreviations (doreco_[glottocode]_gloss-abbreviations.csv)
      • A list of all grammatical abbreviations on the “gl” and “ps” tiers and their meanings
    • Changes to tier names (doreco_[glottocode]_tier-name-changes.csv)
      • A list of all changes made to tier names in a dataset

Tier name conventions

Morphologically annotated files in DoReCo contain up to 8 core tiers and 2 supplementary tiers, in addition to the other tiers present in the files.

Core tiers

  • ref: Reference ID for tx units; generated by DoReCo
  • tx: Transcription of chunks of speech, separated by corpus creators based on a variety of criteria (syntactic, prosodic, semantic, or pragmatic)
  • ft: Free translation of the tx tier into a widely spoken language
  • wd: Word unit
  • mb: Morph unit
  • gl: Morphematic gloss of morph unit
  • ps: Part of speech category; can be of the word unit or the morph unit, depending on the dataset
  • ph: phone unit (in X-SAMPA format)

Supplementary tiers

  • doreco-mb-algn: Units indicate a morpheme boundary which the alignment algorithm was not able to confidently assign to a phoneme boundary; helpful for filtering out such cases
  • mc-zero: Only present in MultiCAST corpora; units contain two kinds of information which could not be assigned to a phone: 1) clause boundaries and 2) zero words; the contents of the mb/gl/ps tiers are separated by spaces ' ', and multiple adjacent units are further separated by pipes '|'

Labelling conventions

The word (wd@...) and phone (ph@...) tiers in DoReCo annotation files sometimes contain elements in angle brackets referred to as labels (see below for a full list). There were two main motivations for using labels in DoReCo. First, they help identify exceptional speech events such as disfluencies or filled pauses that users should be aware of, e.g. in case they wish to exclude them from subsequent studies. Second, labels replace time-alignment at the phone or morpheme level. This makes the alignments overall more consistent, and ensures that grapheme-to-phoneme rules (see doreco_[glottocode]_transcription-conventions.csv) only include phonologically relevant mappings.

Full list of labels used in DoReCo

  • Filled pause: <<fp>>
  • False start: <<fs>>
  • Prolongation: <<pr>>
  • Foreign material: <<fm>>
  • Singing: <<sg>>
  • Backchannel: <<bc>>
  • Ideophone: <<id>>
  • Onomatopoeic: <<on>>
  • Word-internal pause: <<wip>>
  • Unidentifiable: <<ui>>
  • Silent pause: <p:>

Labels consist of two opening brackets, the label proper, a closing bracket, the content (optional), and another closing bracket, e.g. <<ui>word>. Labels may also appear on their own if the content is not known, e.g. <<ui>>. Silent pauses are marked by a special symbol, <p:>. The location of silent pauses is manually checked by the DoReCo team, while the symbol itself is inserted by the WebMAUS service. Unlike the other labels, the symbol has only one of each bracket, and no other content may be included in it.