Representing a corpus

In Lhotse, we represent the data using YAML (more readable), JSON, or JSONL (faster) manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.

Caution

We show all the examples in YAML format for improved readability. However, when processing medium/large datasets, we recommend to use JSON or JSONL, which are much quicker to load and save.

Recording manifest

class lhotse.audio.Recording(id: str, sources: List[lhotse.audio.AudioSource], sampling_rate: int, num_samples: int, duration: float, transforms: Optional[List[Dict]] = None)[source]

The Recording manifest describes the recordings in a given corpus. It contains information about the recording, such as its path(s), duration, the number of samples, etc. It allows to represent multiple channels coming from one or more files.

This manifest does not specify any segmentation information or supervision such as the transcript or the speaker. It means that even when a recording is a 1 hour long file, it is a single item in this manifest.

Hint

Lhotse reads audio recordings using pysoundfile and audioread, similarly to librosa, to support multiple audio formats.

A Recording can be simply created from a local audio file:

>>> from lhotse import RecordingSet, Recording, AudioSource
>>> recording = Recording.from_file('meeting.wav')
>>> recording
Recording(
    id='meeting',
    sources=[AudioSource(type='file', channels=[0], source='meeting.wav')],
    sampling_rate=16000,
    num_samples=57600000,
    duration=3600.0,
    transforms=None
)

This manifest can be easily converted to a Python dict and serialized to JSON/JSONL/YAML/etc:

>>> recording.to_dict()
{'id': 'meeting',
 'sources': [{'type': 'file',
   'channels': [0],
   'source': 'meeting.wav'}],
 'sampling_rate': 16000,
 'num_samples': 57600000,
 'duration': 3600.0}

Recordings can be also created programatically, e.g. when they refer to URLs stored in S3 or somewhere else:

>>> s3_audio_files = ['s3://my-bucket/123-5678.flac', ...]
>>> recs = RecordingSet.from_recordings(
        Recording(
            id=url.split('/')[-1].replace('.flac', ''),
            sources=[AudioSource(type='url', source=url, channels=[0])],
            sampling_rate=16000,
            num_samples=get_num_samples(url),
            duration=get_duration(url)
        )
        for url in s3_audio_files
    )

It allows reading a subset of the audio samples as a numpy array:

>>> samples = recording.load_audio()
>>> assert samples.shape == (1, 16000)
>>> samples2 = recording.load_audio(offset=0.5)
>>> assert samples2.shape == (1, 8000)
__init__(id, sources, sampling_rate, num_samples, duration, transforms=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.audio.RecordingSet(recordings=None)[source]

RecordingSet represents a collection of recordings. It does not contain any annotation such as the transcript or the speaker identity – just the information needed to retrieve a recording such as its path, URL, number of channels, and some recording metadata (duration, number of samples).

It also supports (de)serialization to/from YAML/JSON/etc. and takes care of mapping between rich Python classes and YAML/JSON/etc. primitives during conversion.

When coming from Kaldi, think of it as wav.scp on steroids: RecordingSet also has the information from reco2dur and reco2num_samples, is able to represent multi-channel recordings and read a specified subset of channels, and support reading audio files directly, via a unix pipe, or downloading them on-the-fly from a URL (HTTPS/S3/Azure/GCP/etc.).

RecordingSet can be created from an iterable of Recording objects:

>>> from lhotse import RecordingSet
>>> audio_paths = ['123-5678.wav', ...]
>>> recs = RecordingSet.from_recordings(Recording.from_file(p) for p in audio_paths)

It behaves similarly to a dict:

>>> '123-5678' in recs
True
>>> recording = recs['123-5678']
>>> for recording in recs:
>>>    pass
>>> len(recs)
127

It also provides some utilities for I/O:

>>> recs.to_file('recordings.jsonl')
>>> recs.to_file('recordings.json.gz')  # auto-compression
>>> recs2 = RecordingSet.from_file('recordings.jsonl')

Manipulation:

>>> longer_than_5s = recs.filter(lambda r: r.duration > 5)
>>> first_100 = recs.subset(first=100)
>>> split_into_4 = recs.split(num_splits=4)

And lazy data augmentation/transformation, that requires to adjust some information in the manifest (e.g., num_samples or duration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:

>>> recs_sp = recs.perturb_speed(factor=1.1)
>>> recs_24k = recs.resample(24000)

Finally, since we support importing Kaldi data dirs, if wav.scp contains unix pipes, Recording will also handle them correctly.

__init__(recordings=None)[source]

Initialize self. See help(type(self)) for accurate signature.

Supervision manifest

The supervision manifest contains the supervision information that we have about the recordings. In particular, it involves the segmentation - there might be a single segment for a single utterance recording, and multiple segments for a recording of a converstion.

When coming from Kaldi, think of it as a segments file on steroids, that also contains utt2spk, utt2gender, utt2dur, etc.

This is a YAML supervision manifest:

---
- id: 'segment-1'
  recording_id: 'recording-2'
  channel: 0
  start: 0.1
  duration: 0.3
  text: 'transcript of the first segment'
  language: 'english'
  speaker: 'Norman Dyhrentfurth'

- id: 'segment-2'
  recording_id: 'recording-2'
  start: 0.5
  duration: 0.4

Each segment is characterized by the following attributes:

  • a unique id,

  • a corresponding recording id,

  • start time in seconds, relative to the beginning of the recording,

  • the duration in seconds

Each segment may be assigned optional supervision information. In this example, the first segment contains the transcription text, the language of the utterance and a speaker name. The second segment contains only the minimal amount of information, which should be interpreted as: “this is some area of interest in the recording that we know nothing else about.”

Python

In Python, the supervision manifest is represented by classes SupervisionSet and SupervisionSegment. Example usage:

supervisions = SupervisionSet.from_segments([
    SupervisionSegment(
        id='segment-1',
        recording_id='recording-1',
        start=0.5,
        duration=10.7,
        text='quite a long utterance'
    )
])
print(f'There is {len(supervisions)} supervision in the set.')

Standard data preparation recipes

We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.

Currently supported corpora

Corpus name

Function

Aishell

lhotse.recipes.prepare_aishell()

AMI

lhotse.recipes.prepare_ami()

BABEL

lhotse.recipes.prepare_single_babel_language()

CallHome Egyptian

lhotse.recipes.prepare_callhome_egyptian()

CallHome English

lhotse.recipes.prepare_callhome_english()

CMU Arctic

lhotse.recipes.prepare_cmu_arctic()

CMU Kids

lhotse.recipes.prepare_cmu_kids()

CSLU Kids

lhotse.recipes.prepare_cslu_kids()

DIHARD III

lhotse.recipes.prepare_dihard3()

English Broadcast News 1997

lhotse.recipes.prepare_broadcast_news()

GALE Arabic Broadcast Speech

lhotse.recipes.prepare_gale_arabic()

GALE Mandarin Broadcast Speech

lhotse.recipes.prepare_gale_mandarin()

GigaSpeech

lhotse.recipes.prepare_gigaspeech()

Heroico

lhotse.recipes.prepare_heroico()

L2 Arctic

lhotse.recipes.prepare_l2_arctic()

LibriSpeech (including “mini”)

lhotse.recipes.prepare_librispeech()

LibriTTS

lhotse.recipes.prepare_libritts()

LJ Speech

lhotse.recipes.prepare_ljspeech()

MiniLibriMix

lhotse.recipes.prepare_librimix()

MTEDx

lhotse.recipes.prepare_mtdex()

MobvoiHotWord

lhotse.recipes.prepare_mobvoihotwords()

Multilingual LibriSpeech (MLS)

lhotse.recipes.prepare_mls()

MUSAN

lhotse.recipes.prepare_musan()

National Speech Corpus (Singaporean English)

lhotse.recipes.prepare_nsc()

Switchboard

lhotse.recipes.prepare_switchboard()

TED-LIUM v3

lhotse.recipes.prepare_tedlium()

VCTK

lhotse.recipes.prepare_vctk()

Adding new corpora

General pointers:

  • Each corpus has a dedicated Python file in lhotse/recipes.

  • For publicly available corpora that can be freely downloaded, we usually define a function called download, download_and_untar, etc.

  • Each data preparation recipe should expose a single function called prepare_X, with X being the name of the corpus, that produces dicts like: {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>} for the data in that corpus.

  • When a corpus defines standard split (e.g. train/dev/test), we return a dict with the following structure: {'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}

  • Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the SupervisionSegment will exactly match the Recording duration (and there will likely be exactly one segment corresponding to any recording).

  • Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one Recording object corresponding to a single conversation/session, that spans its whole duration. Each speech segment in that recording should be represented as a SupervisionSegment with the same recording_id value.

  • Corpora with multiple channels for each session (e.g. AMI) should have a single Recording with multiple AudioSource objects - each corresponding to a separate channel.