Representing a corpus¶

In Lhotse, we represent the data using YAML (more readable), JSON, or JSONL (faster) manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.

Caution

We show all the examples in YAML format for improved readability. However, when processing medium/large datasets, we recommend to use JSON or JSONL, which are much quicker to load and save.

Recording manifest¶

class lhotse.audio.Recording(id: str, sources: List[lhotse.audio.AudioSource], sampling_rate: int, num_samples: int, duration: float, transforms: Optional[List[Dict]] = None)[source]

The Recording manifest describes the recordings in a given corpus. It contains information about the recording, such as its path(s), duration, the number of samples, etc. It allows to represent multiple channels coming from one or more files.

This manifest does not specify any segmentation information or supervision such as the transcript or the speaker. It means that even when a recording is a 1 hour long file, it is a single item in this manifest.

Hint

Lhotse reads audio recordings using pysoundfile and audioread, similarly to librosa, to support multiple audio formats.

A Recording can be simply created from a local audio file:

>>> from lhotse import RecordingSet, Recording, AudioSource
>>> recording = Recording.from_file('meeting.wav')
>>> recording
Recording(
    id='meeting',
    sources=[AudioSource(type='file', channels=[0], source='meeting.wav')],
    sampling_rate=16000,
    num_samples=57600000,
    duration=3600.0,
    transforms=None
)

This manifest can be easily converted to a Python dict and serialized to JSON/JSONL/YAML/etc:

>>> recording.to_dict()
{'id': 'meeting',
 'sources': [{'type': 'file',
   'channels': [0],
   'source': 'meeting.wav'}],
 'sampling_rate': 16000,
 'num_samples': 57600000,
 'duration': 3600.0}

Recordings can be also created programatically, e.g. when they refer to URLs stored in S3 or somewhere else:

>>> s3_audio_files = ['s3://my-bucket/123-5678.flac', ...]
>>> recs = RecordingSet.from_recordings(
        Recording(
            id=url.split('/')[-1].replace('.flac', ''),
            sources=[AudioSource(type='url', source=url, channels=[0])],
            sampling_rate=16000,
            num_samples=get_num_samples(url),
            duration=get_duration(url)
        )
        for url in s3_audio_files
    )

It allows reading a subset of the audio samples as a numpy array:

>>> samples = recording.load_audio()
>>> assert samples.shape == (1, 16000)
>>> samples2 = recording.load_audio(offset=0.5)
>>> assert samples2.shape == (1, 8000)

__init__(id, sources, sampling_rate, num_samples, duration, transforms=None): Initialize self. See help(type(self)) for accurate signature.

class lhotse.audio.RecordingSet(recordings=None)[source]

RecordingSet represents a collection of recordings. It does not contain any annotation such as the transcript or the speaker identity – just the information needed to retrieve a recording such as its path, URL, number of channels, and some recording metadata (duration, number of samples).

It also supports (de)serialization to/from YAML/JSON/etc. and takes care of mapping between rich Python classes and YAML/JSON/etc. primitives during conversion.

When coming from Kaldi, think of it as wav.scp on steroids: RecordingSet also has the information from reco2dur and reco2num_samples, is able to represent multi-channel recordings and read a specified subset of channels, and support reading audio files directly, via a unix pipe, or downloading them on-the-fly from a URL (HTTPS/S3/Azure/GCP/etc.).

RecordingSet can be created from an iterable of Recording objects:

>>> from lhotse import RecordingSet
>>> audio_paths = ['123-5678.wav', ...]
>>> recs = RecordingSet.from_recordings(Recording.from_file(p) for p in audio_paths)

It behaves similarly to a dict:

>>> '123-5678' in recs
True
>>> recording = recs['123-5678']
>>> for recording in recs:
>>>    pass
>>> len(recs)
127

It also provides some utilities for I/O:

>>> recs.to_file('recordings.jsonl')
>>> recs.to_file('recordings.json.gz')  # auto-compression
>>> recs2 = RecordingSet.from_file('recordings.jsonl')

Manipulation:

>>> longer_than_5s = recs.filter(lambda r: r.duration > 5)
>>> first_100 = recs.subset(first=100)
>>> split_into_4 = recs.split(num_splits=4)

And lazy data augmentation/transformation, that requires to adjust some information in the manifest (e.g., num_samples or duration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:

>>> recs_sp = recs.perturb_speed(factor=1.1)
>>> recs_24k = recs.resample(24000)

Finally, since we support importing Kaldi data dirs, if wav.scp contains unix pipes, Recording will also handle them correctly.

__init__(recordings=None)[source]: Initialize self. See help(type(self)) for accurate signature.

Supervision manifest¶

The supervision manifest contains the supervision information that we have about the recordings. In particular, it involves the segmentation - there might be a single segment for a single utterance recording, and multiple segments for a recording of a converstion.

When coming from Kaldi, think of it as a segments file on steroids, that also contains utt2spk, utt2gender, utt2dur, etc.

This is a YAML supervision manifest:

---
- id: 'segment-1'
  recording_id: 'recording-2'
  channel: 0
  start: 0.1
  duration: 0.3
  text: 'transcript of the first segment'
  language: 'english'
  speaker: 'Norman Dyhrentfurth'

- id: 'segment-2'
  recording_id: 'recording-2'
  start: 0.5
  duration: 0.4

Each segment is characterized by the following attributes:

a unique id,
a corresponding recording id,
start time in seconds, relative to the beginning of the recording,
the duration in seconds

Each segment may be assigned optional supervision information. In this example, the first segment contains the transcription text, the language of the utterance and a speaker name. The second segment contains only the minimal amount of information, which should be interpreted as: “this is some area of interest in the recording that we know nothing else about.”

Python¶

In Python, the supervision manifest is represented by classes SupervisionSet and SupervisionSegment. Example usage:

supervisions = SupervisionSet.from_segments([
    SupervisionSegment(
        id='segment-1',
        recording_id='recording-1',
        start=0.5,
        duration=10.7,
        text='quite a long utterance'
    )
])
print(f'There is {len(supervisions)} supervision in the set.')

Standard data preparation recipes¶

We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.

Currently supported corpora¶
Corpus name	Function
Aishell	`lhotse.recipes.prepare_aishell()`
AMI	`lhotse.recipes.prepare_ami()`
BABEL	`lhotse.recipes.prepare_single_babel_language()`
CallHome Egyptian	`lhotse.recipes.prepare_callhome_egyptian()`
CallHome English	`lhotse.recipes.prepare_callhome_english()`
CMU Arctic	`lhotse.recipes.prepare_cmu_arctic()`
CMU Kids	`lhotse.recipes.prepare_cmu_kids()`
CSLU Kids	`lhotse.recipes.prepare_cslu_kids()`
DIHARD III	`lhotse.recipes.prepare_dihard3()`
English Broadcast News 1997	`lhotse.recipes.prepare_broadcast_news()`
GALE Arabic Broadcast Speech	`lhotse.recipes.prepare_gale_arabic()`
GALE Mandarin Broadcast Speech	`lhotse.recipes.prepare_gale_mandarin()`
GigaSpeech	`lhotse.recipes.prepare_gigaspeech()`
Heroico	`lhotse.recipes.prepare_heroico()`
L2 Arctic	`lhotse.recipes.prepare_l2_arctic()`
LibriSpeech (including “mini”)	`lhotse.recipes.prepare_librispeech()`
LibriTTS	`lhotse.recipes.prepare_libritts()`
LJ Speech	`lhotse.recipes.prepare_ljspeech()`
MiniLibriMix	`lhotse.recipes.prepare_librimix()`
MTEDx	`lhotse.recipes.prepare_mtdex()`
MobvoiHotWord	`lhotse.recipes.prepare_mobvoihotwords()`
Multilingual LibriSpeech (MLS)	`lhotse.recipes.prepare_mls()`
MUSAN	`lhotse.recipes.prepare_musan()`
National Speech Corpus (Singaporean English)	`lhotse.recipes.prepare_nsc()`
Switchboard	`lhotse.recipes.prepare_switchboard()`
TED-LIUM v3	`lhotse.recipes.prepare_tedlium()`
VCTK	`lhotse.recipes.prepare_vctk()`

Adding new corpora¶

General pointers:

Each corpus has a dedicated Python file in lhotse/recipes.
For publicly available corpora that can be freely downloaded, we usually define a function called download, download_and_untar, etc.
Each data preparation recipe should expose a single function called prepare_X, with X being the name of the corpus, that produces dicts like: {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>} for the data in that corpus.
When a corpus defines standard split (e.g. train/dev/test), we return a dict with the following structure: {'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}
Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the SupervisionSegment will exactly match the Recording duration (and there will likely be exactly one segment corresponding to any recording).
Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one Recording object corresponding to a single conversation/session, that spans its whole duration. Each speech segment in that recording should be represented as a SupervisionSegment with the same recording_id value.
Corpora with multiple channels for each session (e.g. AMI) should have a single Recording with multiple AudioSource objects - each corresponding to a separate channel.