Representing a corpus

In Lhotse, we represent the data using a small number of Python classes, enhanced with methods for solving common data manipulation tasks, that can be stored as JSON or JSONL manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.

Recording manifest

class lhotse.audio.Recording(id, sources, sampling_rate, num_samples, duration, channel_ids=None, transforms=None)[source]

The Recording manifest describes the recordings in a given corpus. It contains information about the recording, such as its path(s), duration, the number of samples, etc. It allows to represent multiple channels coming from one or more files.

This manifest does not specify any segmentation information or supervision such as the transcript or the speaker – we use SupervisionSegment for that.

Note that Recording can represent both a single utterance (e.g., in LibriSpeech) and a 1-hour session with multiple channels and speakers (e.g., in AMI). In the latter case, it is partitioned into data suitable for model training using Cut.

Internally, Lhotse supports multiple audio backends to read audio file. By default, we try to use libsoundfile, then torchaudio (with FFMPEG integration starting with torchaudio 2.1), and then audioread (which is an ffmpeg CLI wrapper). For sphere files we prefer to use sph2pipe binary as it can work with certain unique encodings such as “shorten”.

Audio backends in Lhotse are configurable. See:

  • available_audio_backends()

  • audio_backend(),

  • get_current_audio_backend()

  • set_current_audio_backend()

  • get_default_audio_backend()

Examples

A Recording can be simply created from a local audio file:

>>> from lhotse import RecordingSet, Recording, AudioSource
>>> recording = Recording.from_file('meeting.wav')
>>> recording
Recording(
    id='meeting',
    sources=[AudioSource(type='file', channels=[0], source='meeting.wav')],
    sampling_rate=16000,
    num_samples=57600000,
    duration=3600.0,
    transforms=None
)

This manifest can be easily converted to a Python dict and serialized to JSON/JSONL/YAML/etc:

>>> recording.to_dict()
{'id': 'meeting',
 'sources': [{'type': 'file',
   'channels': [0],
   'source': 'meeting.wav'}],
 'sampling_rate': 16000,
 'num_samples': 57600000,
 'duration': 3600.0}

Recordings can be also created programatically, e.g. when they refer to URLs stored in S3 or somewhere else:

>>> s3_audio_files = ['s3://my-bucket/123-5678.flac', ...]
>>> recs = RecordingSet.from_recordings(
...     Recording(
...         id=url.split('/')[-1].replace('.flac', ''),
...         sources=[AudioSource(type='url', source=url, channels=[0])],
...         sampling_rate=16000,
...         num_samples=get_num_samples(url),
...         duration=get_duration(url)
...     )
...     for url in s3_audio_files
... )

It allows reading a subset of the audio samples as a numpy array:

>>> samples = recording.load_audio()
>>> assert samples.shape == (1, 16000)
>>> samples2 = recording.load_audio(offset=0.5)
>>> assert samples2.shape == (1, 8000)

See also: Recording, Cut, CutSet.

class lhotse.audio.RecordingSet(recordings=None)[source]

RecordingSet represents a collection of recordings. It does not contain any annotation such as the transcript or the speaker identity – just the information needed to retrieve a recording such as its path, URL, number of channels, and some recording metadata (duration, number of samples).

It also supports (de)serialization to/from YAML/JSON/etc. and takes care of mapping between rich Python classes and YAML/JSON/etc. primitives during conversion.

When coming from Kaldi, think of it as wav.scp on steroids: RecordingSet also has the information from reco2dur and reco2num_samples, is able to represent multi-channel recordings and read a specified subset of channels, and support reading audio files directly, via a unix pipe, or downloading them on-the-fly from a URL (HTTPS/S3/Azure/GCP/etc.).

Examples:

RecordingSet can be created from an iterable of Recording objects:

>>> from lhotse import RecordingSet
>>> audio_paths = ['123-5678.wav', ...]
>>> recs = RecordingSet.from_recordings(Recording.from_file(p) for p in audio_paths)

As well as from a directory, which will be scanned recursively for files with parallel processing:

>>> recs2 = RecordingSet.from_dir('/data/audio', pattern='*.flac', num_jobs=4)

It behaves similarly to a dict:

>>> '123-5678' in recs
True
>>> recording = recs['123-5678']
>>> for recording in recs:
>>>    pass
>>> len(recs)
127

It also provides some utilities for I/O:

>>> recs.to_file('recordings.jsonl')
>>> recs.to_file('recordings.json.gz')  # auto-compression
>>> recs2 = RecordingSet.from_file('recordings.jsonl')

Manipulation:

>>> longer_than_5s = recs.filter(lambda r: r.duration > 5)
>>> first_100 = recs.subset(first=100)
>>> split_into_4 = recs.split(num_splits=4)
>>> shuffled = recs.shuffle()

And lazy data augmentation/transformation, that requires to adjust some information in the manifest (e.g., num_samples or duration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:

>>> recs_sp = recs.perturb_speed(factor=1.1)
>>> recs_vp = recs.perturb_volume(factor=2.)
>>> recs_rvb = recs.reverb_rir(rir_recs)
>>> recs_24k = recs.resample(24000)

Supervision manifest

class lhotse.supervision.SupervisionSegment(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None)[source]

SupervisionSegment represents a time interval (segment) annotated with some supervision labels and/or metadata, such as the transcription, the speaker identity, the language, etc.

Each supervision has unique id and always refers to a specific recording (via recording_id) and one or more channel (by default, 0). Note that multiple channels of the recording may share the same supervision, in which case the channel field will be a list of integers.

It’s also characterized by the start time (relative to the beginning of a Recording or a Cut) and a duration, both expressed in seconds.

The remaining fields are all optional, and their availability depends on specific corpora. Since it is difficult to predict all possible types of metadata, the custom field (a dict) can be used to insert types of supervisions that are not supported out of the box.

SupervisionSegment may contain multiple types of alignments. The alignment field is a dict, indexed by alignment’s type (e.g., word or phone), and contains a list of AlignmentItem objects – simple structures that contain a given symbol and its time interval. Alignments can be read from CTM files or created programatically.

Examples

A simple segment with no supervision information:

>>> from lhotse import SupervisionSegment
>>> sup0 = SupervisionSegment(
...     id='rec00001-sup00000', recording_id='rec00001',
...     start=0.5, duration=5.0, channel=0
... )

Typical supervision containing transcript, speaker ID, gender, and language:

>>> sup1 = SupervisionSegment(
...     id='rec00001-sup00001', recording_id='rec00001',
...     start=5.5, duration=3.0, channel=0,
...     text='transcript of the second segment',
...     speaker='Norman Dyhrentfurth', language='English', gender='M'
... )

Two supervisions denoting overlapping speech on two separate channels in a microphone array/multiple headsets (pay attention to start, duration, and channel):

>>> sup2 = SupervisionSegment(
...     id='rec00001-sup00002', recording_id='rec00001',
...     start=15.0, duration=5.0, channel=0,
...     text="i have incredibly good news for you",
...     speaker='Norman Dyhrentfurth', language='English', gender='M'
... )
>>> sup3 = SupervisionSegment(
...     id='rec00001-sup00003', recording_id='rec00001',
...     start=18.0, duration=3.0, channel=1,
...     text="say what",
...     speaker='Hervey Arman', language='English', gender='M'
... )

A supervision with a phone alignment:

>>> from lhotse.supervision import AlignmentItem
>>> sup4 = SupervisionSegment(
...     id='rec00001-sup00004', recording_id='rec00001',
...     start=33.0, duration=1.0, channel=0,
...     text="ice",
...     speaker='Maryla Zechariah', language='English', gender='F',
...     alignment={
...         'phone': [
...             AlignmentItem(symbol='AY0', start=33.0, duration=0.6),
...             AlignmentItem(symbol='S', start=33.6, duration=0.4)
...         ]
...     }
... )

A supervision shared across multiple channels of a recording (e.g. a microphone array):

>>> sup5 = SupervisionSegment(
...     id='rec00001-sup00005', recording_id='rec00001',
...     start=33.0, duration=1.0, channel=[0, 1],
...     text="ice",
...     speaker='Maryla Zechariah',
... )

Converting SupervisionSegment to a dict:

>>> sup0.to_dict()
{'id': 'rec00001-sup00000', 'recording_id': 'rec00001', 'start': 0.5, 'duration': 5.0, 'channel': 0}
class lhotse.supervision.SupervisionSet(segments=None)[source]

SupervisionSet represents a collection of segments containing some supervision information (see SupervisionSegment).

It acts as a Python list, extended with an efficient find operation that indexes and caches the supervision segments in an interval tree. It allows to quickly find supervision segments that correspond to a specific time interval. However, it can also work with lazy iterables.

When coming from Kaldi, think of SupervisionSet as a segments file on steroids, that may also contain text, utt2spk, utt2gender, utt2dur, etc.

Examples

Building a SupervisionSet:

>>> from lhotse import SupervisionSet, SupervisionSegment
>>> sups = SupervisionSet.from_segments([SupervisionSegment(...), ...])

Writing/reading a SupervisionSet:

>>> sups.to_file('supervisions.jsonl.gz')
>>> sups2 = SupervisionSet.from_file('supervisions.jsonl.gz')

Using SupervisionSet like a dict:

>>> 'rec00001-sup00000' in sups
True
>>> sups['rec00001-sup00000']
SupervisionSegment(id='rec00001-sup00000', recording_id='rec00001', start=0.5, ...)
>>> for segment in sups:
...     pass

Searching by recording_id and time interval:

>>> matched_segments = sups.find(recording_id='rec00001', start_after=17.0, end_before=25.0)

Manipulation:

>>> longer_than_5s = sups.filter(lambda s: s.duration > 5)
>>> first_100 = sups.subset(first=100)
>>> split_into_4 = sups.split(num_splits=4)
>>> shuffled = sups.shuffle()

Standard data preparation recipes

We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.

Currently supported audio corpora

Corpus name

Function

ADEPT

lhotse.recipes.prepare_adept()

Aidatatang_200zh

lhotse.recipes.prepare_aidatatang_200zh()

Aishell

lhotse.recipes.prepare_aishell()

Aishell-3

lhotse.recipes.prepare_aishell3()

AISHELL-4

lhotse.recipes.prepare_aishell4()

AliMeeting

lhotse.recipes.prepare_alimeeting()

AMI

lhotse.recipes.prepare_ami()

ASpIRE

lhotse.recipes.prepare_aspire()

ATCOSIM

lhotse.recipes.prepare_atcosim()

AudioMNIST

lhotse.recipes.prepare_audio_mnist()

BABEL

lhotse.recipes.prepare_single_babel_language()

Bengali.AI Speech

lhotse.recipes.prepare_bengaliai_speech()

BUT ReverbDB

lhotse.recipes.prepare_but_reverb_db()

BVCC / VoiceMOS Challenge

lhotse.recipes.bvcc()

CallHome Egyptian

lhotse.recipes.prepare_callhome_egyptian()

CallHome English

lhotse.recipes.prepare_callhome_english()

CHiME-6

lhotse.recipes.prepare_chime6()

CMU Arctic

lhotse.recipes.prepare_cmu_arctic()

CMU Indic

lhotse.recipes.prepare_cmu_indic()

CMU Kids

lhotse.recipes.prepare_cmu_kids()

CommonVoice

lhotse.recipes.prepare_commonvoice()

Corpus of Spontaneous Japanese

lhotse.recipes.prepare_csj()

CSLU Kids

lhotse.recipes.prepare_cslu_kids()

DailyTalk

lhotse.recipes.prepare_daily_talk()

DIHARD III

lhotse.recipes.prepare_dihard3()

DiPCo

lhotse.recipes.prepare_dipco()

Earnings’21

lhotse.recipes.prepare_earnings21()

Earnings’22

lhotse.recipes.prepare_earnings22()

The Edinburgh International Accents of English Corpus

lhotse.recipes.prepare_edacc()

English Broadcast News 1997

lhotse.recipes.prepare_broadcast_news()

Fisher English Part 1, 2

lhotse.recipes.prepare_fisher_english()

Fisher Spanish

lhotse.recipes.prepare_fisher_spanish()

Fluent Speech Commands

lhotse.recipes.slu()

GALE Arabic Broadcast Speech

lhotse.recipes.prepare_gale_arabic()

GALE Mandarin Broadcast Speech

lhotse.recipes.prepare_gale_mandarin()

GigaSpeech

lhotse.recipes.prepare_gigaspeech()

GigaST

lhotse.recipes.prepare_gigast()

Heroico

lhotse.recipes.prepare_heroico()

HiFiTTS

lhotse.recipes.prepare_hifitts()

HI-MIA (including HI-MIA-CW)

lhotse.recipes.prepare_himia()

ICMC-ASR

lhotse.recipes.prepare_icmcasr()

ICSI

lhotse.recipes.prepare_icsi()

IWSLT22_Ta

lhotse.recipes.prepare_iwslt22_ta()

KeSpeech

lhotse.recipes.prepare_kespeech()

L2 Arctic

lhotse.recipes.prepare_l2_arctic()

LibriCSS

lhotse.recipes.prepare_libricss()

LibriLight

lhotse.recipes.prepare_librilight()

LibriSpeech (including “mini”)

lhotse.recipes.prepare_librispeech()

LibriTTS

lhotse.recipes.prepare_libritts()

LibriTTS-R

lhotse.recipes.prepare_librittsr()

LJ Speech

lhotse.recipes.prepare_ljspeech()

MDCC

lhotse.recipes.prepare_mdcc()

Medical

lhotse.recipes.prepare_medical()

MiniLibriMix

lhotse.recipes.prepare_librimix()

MTEDx

lhotse.recipes.prepare_mtdex()

MobvoiHotWord

lhotse.recipes.prepare_mobvoihotwords()

Multilingual LibriSpeech (MLS)

lhotse.recipes.prepare_mls()

MUSAN

lhotse.recipes.prepare_musan()

MuST-C

lhotse.recipes.prepare_must_c()

National Speech Corpus (Singaporean English)

lhotse.recipes.prepare_nsc()

People’s Speech

lhotse.recipes.prepare_peoples_speech()

RIRs and Noises Corpus (OpenSLR 28)

lhotse.recipes.prepare_rir_noise()

Speech Commands

lhotse.recipes.prepare_speechcommands()

SpeechIO

lhotse.recipes.prepare_speechio()

SPGISpeech

lhotse.recipes.prepare_spgispeech()

Switchboard

lhotse.recipes.prepare_switchboard()

TED-LIUM v2

lhotse.recipes.prepare_tedlium2()

TED-LIUM v3

lhotse.recipes.prepare_tedlium()

TIMIT

lhotse.recipes.prepare_timit()

This American Life

lhotse.recipes.prepare_this_american_life()

UWB-ATCC

lhotse.recipes.prepare_uwb_atcc()

VCTK

lhotse.recipes.prepare_vctk()

VoxCeleb

lhotse.recipes.prepare_voxceleb()

VoxConverse

lhotse.recipes.prepare_voxconverse()

VoxPopuli

lhotse.recipes.prepare_voxpopuli()

WenetSpeech

lhotse.recipes.prepare_wenet_speech()

YesNo

lhotse.recipes.prepare_yesno()

Eval2000

lhotse.recipes.prepare_eval2000()

MGB2

lhotse.recipes.prepare_mgb2()

XBMU-AMDO31

lhotse.recipes.xbmu_amdo31()

Currently supported video corpora

Corpus name

Function

Grid Audio-Visual Speech Corpus

lhotse.recipes.prepare_grid()

Adding new corpora

Hint

Python data preparation recipes. Each corpus has a dedicated Python file in lhotse/recipes, which you can use as the basis for your own recipe.

Hint

(optional) Downloading utility. For publicly available corpora that can be freely downloaded, we usually define a function called download_<corpus-name>().

Hint

Data preparation Python entry-point. Each data preparation recipe should expose a single function called prepare_<corpus-name>, that produces dicts like: {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}.

Hint

CLI recipe wrappers. We provide a command-line interface that wraps the download and prepare functions – see lhotse/bin/modes/recipes for examples of how to do it.

Hint

Pre-defined train/dev/test splits. When a corpus defines standard split (e.g. train/dev/test), we return a dict with the following structure: {'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}

Hint

Manifest naming convention. The default naming convention is <corpus-name>_<manifest-type>_<split>.jsonl.gz, i.e., we save the manifests in a compressed JSONL file. Here, <manifest-type> can be recordings, supervisions, etc., and <split> can be train, dev, test, etc. In case the corpus has no such split defined, we can use all as default. Other information, e.g., mic type, language, etc. may be included in the <corpus-name>. Some examples are: cmu-indic_recordings_all.jsonl.gz, ami-ihm_supervisions_dev.jsonl.gz, mtedx-english_recordings_train.jsonl.gz.

Hint

Isolated utterance corpora. Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the SupervisionSegment will exactly match the Recording duration (and there will likely be exactly one segment corresponding to any recording).

Hint

Conversational corpora. Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one Recording object corresponding to a single conversation/session, that spans its whole duration. Each speech segment in that recording should be represented as a SupervisionSegment with the same recording_id value.

Hint

Multi-channel corpora. Corpora with multiple channels for each session (e.g. AMI) should have a single Recording with multiple AudioSource objects – each corresponding to a separate channel. Remember to make the SupervisionSegment objects correspond to the right channels!