Representing a corpus¶
In Lhotse, we represent the data using YAML (more readable) or JSON (faster) manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.
Caution
We show all the examples in YAML format for improved readability. However, when processing medium/large datasets, we recommend to use JSON, which is much quicker to load and save.
Recording manifest¶
The recording manifest describes the recordings in a given corpus. It only contains information about the recording itself - this manifest does not specify any segmentation information or supervision such as the transcript or the speaker. It means that when a recording is a 1 hour long file, it is a single item in this manifest.
When coming from Kaldi, think of it as wav.scp on steroids, that also contains reco2dur, reco2num_samples and some extra information.
This is a YAML manifest for a corpus with two recordings:
---
- id: 'recording-1'
sampling_rate: 8000
num_samples: 4000
duration: 0.5
sources:
- type: file
channels: [0]
source: 'test/fixtures/mono_c0.wav'
- type: file
channels: [1]
source: 'test/fixtures/mono_c1.wav'
- id: 'recording-2'
sampling_rate: 8000
num_samples: 8000
duration: 1.0
sources:
- type: file
channels: [0, 1]
source: 'test/fixtures/stereo.wav'
Each recording is described by:
a unique id,
its sampling rate,
the number of samples,
the duration in seconds,
a list of audio sources.
Audio source is a useful abstraction for cases when the user has an audio format not supported by the library, or wants to use shell tools such as SoX to perform some additional preprocessing. An audio source has the following properties:
type: either file or command
channel_ids: a list of integer identifiers for each channel in the recording
source: in case of a file, it’s a path; in case of a command, its a shell command that will be expected to write a WAVE file to stdout.
Python¶
In Python, the recording manifest is represented by classes RecordingSet
, Recording
, and AudioSource
.
Example usage:
recordings = RecordingSet.from_yaml('audio.yml')
for recording in recordings:
# Note: all time units in Lhotse are seconds
if recording.duration >= 7.5:
samples = recording.load_audio(
channels=0,
offset=2.5,
duration=5.0
)
# Further sample processing
Supervision manifest¶
The supervision manifest contains the supervision information that we have about the recordings. In particular, it involves the segmentation - there might be a single segment for a single utterance recording, and multiple segments for a recording of a converstion.
When coming from Kaldi, think of it as a segments file on steroids, that also contains utt2spk, utt2gender, utt2dur, etc.
This is a YAML supervision manifest:
---
- id: 'segment-1'
recording_id: 'recording-2'
channel: 0
start: 0.1
duration: 0.3
text: 'transcript of the first segment'
language: 'english'
speaker: 'Norman Dyhrentfurth'
- id: 'segment-2'
recording_id: 'recording-2'
start: 0.5
duration: 0.4
Each segment is characterized by the following attributes:
a unique id,
a corresponding recording id,
start time in seconds, relative to the beginning of the recording,
the duration in seconds
Each segment may be assigned optional supervision information. In this example, the first segment contains the transcription text, the language of the utterance and a speaker name. The second segment contains only the minimal amount of information, which should be interpreted as: “this is some area of interest in the recording that we know nothing else about.”
Python¶
In Python, the supervision manifest is represented by classes SupervisionSet
and SupervisionSegment
.
Example usage:
supervisions = SupervisionSet.from_segments([
SupervisionSegment(
id='segment-1',
recording_id='recording-1',
start=0.5,
duration=10.7,
text='quite a long utterance'
)
])
print(f'There is {len(supervisions)} supervision in the set.')
Standard data preparation recipes¶
We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.
Currently supported corpora:
AMI
lhotse.recipes.prepare_ami()
English Broadcast News 1997
lhotse.recipes.prepare_broadcast_news()
Full or Mini LibriSpeech
lhotse.recipes.prepare_librispeech()
Heroico
lhotse.recipes.prepare_heroico()
LJ Speech
lhotse.recipes.prepare_ljspeech()
Mini LibriMix
lhotse.recipes.prepare_librimix()
Switchboard
lhotse.recipes.prepare_switchboard()
TED-LIUM v3
lhotse.recipes.prepare_tedlium()
Adding new corpora¶
General pointers:
Each corpus has a dedicated Python file in
lhotse/recipes
.For publicly available corpora that can be freely downloaded, we usually define a function called
download
,download_and_untar
, etc.Each data preparation recipe should expose a single function called
prepare_X
, with X being the name of the corpus, that produces dicts like:{'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}
for the data in that corpus.When a corpus defines standard split (e.g. train/dev/test), we return a dict with the following structure:
{'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}
Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the
SupervisionSegment
will exactly match theRecording
duration (and there will likely be exactly one segment corresponding to any recording).Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one
Recording
object corresponding to a single conversation/session, that spans its whole duration. Each speech segment in that recording should be represented as aSupervisionSegment
with the samerecording_id
value.Corpora with multiple channels for each session (e.g. AMI) should have a single
Recording
with multipleAudioSource
objects - each corresponding to a separate channel.