Representing a corpus¶
In Lhotse, we represent the data using YAML (more readable), JSON, or JSONL (faster) manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.
Caution
We show all the examples in YAML format for improved readability. However, when processing medium/large datasets, we recommend to use JSON or JSONL, which are much quicker to load and save.
Recording manifest¶
-
class
lhotse.audio.
Recording
(id: str, sources: List[lhotse.audio.AudioSource], sampling_rate: int, num_samples: int, duration: float, transforms: Optional[List[Dict]] = None)[source] The
Recording
manifest describes the recordings in a given corpus. It contains information about the recording, such as its path(s), duration, the number of samples, etc. It allows to represent multiple channels coming from one or more files.This manifest does not specify any segmentation information or supervision such as the transcript or the speaker. It means that even when a recording is a 1 hour long file, it is a single item in this manifest.
Hint
Lhotse reads audio recordings using pysoundfile and audioread, similarly to librosa, to support multiple audio formats.
A
Recording
can be simply created from a local audio file:>>> from lhotse import RecordingSet, Recording, AudioSource >>> recording = Recording.from_file('meeting.wav') >>> recording Recording( id='meeting', sources=[AudioSource(type='file', channels=[0], source='meeting.wav')], sampling_rate=16000, num_samples=57600000, duration=3600.0, transforms=None )
This manifest can be easily converted to a Python dict and serialized to JSON/JSONL/YAML/etc:
>>> recording.to_dict() {'id': 'meeting', 'sources': [{'type': 'file', 'channels': [0], 'source': 'meeting.wav'}], 'sampling_rate': 16000, 'num_samples': 57600000, 'duration': 3600.0}
Recordings can be also created programatically, e.g. when they refer to URLs stored in S3 or somewhere else:
>>> s3_audio_files = ['s3://my-bucket/123-5678.flac', ...] >>> recs = RecordingSet.from_recordings( Recording( id=url.split('/')[-1].replace('.flac', ''), sources=[AudioSource(type='url', source=url, channels=[0])], sampling_rate=16000, num_samples=get_num_samples(url), duration=get_duration(url) ) for url in s3_audio_files )
It allows reading a subset of the audio samples as a numpy array:
>>> samples = recording.load_audio() >>> assert samples.shape == (1, 16000) >>> samples2 = recording.load_audio(offset=0.5) >>> assert samples2.shape == (1, 8000)
-
__init__
(id, sources, sampling_rate, num_samples, duration, transforms=None) Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.audio.
RecordingSet
(recordings=None)[source] RecordingSet represents a collection of recordings. It does not contain any annotation such as the transcript or the speaker identity – just the information needed to retrieve a recording such as its path, URL, number of channels, and some recording metadata (duration, number of samples).
It also supports (de)serialization to/from YAML/JSON/etc. and takes care of mapping between rich Python classes and YAML/JSON/etc. primitives during conversion.
When coming from Kaldi, think of it as
wav.scp
on steroids:RecordingSet
also has the information from reco2dur and reco2num_samples, is able to represent multi-channel recordings and read a specified subset of channels, and support reading audio files directly, via a unix pipe, or downloading them on-the-fly from a URL (HTTPS/S3/Azure/GCP/etc.).RecordingSet
can be created from an iterable ofRecording
objects:>>> from lhotse import RecordingSet >>> audio_paths = ['123-5678.wav', ...] >>> recs = RecordingSet.from_recordings(Recording.from_file(p) for p in audio_paths)
It behaves similarly to a
dict
:>>> '123-5678' in recs True >>> recording = recs['123-5678'] >>> for recording in recs: >>> pass >>> len(recs) 127
It also provides some utilities for I/O:
>>> recs.to_file('recordings.jsonl') >>> recs.to_file('recordings.json.gz') # auto-compression >>> recs2 = RecordingSet.from_file('recordings.jsonl')
Manipulation:
>>> longer_than_5s = recs.filter(lambda r: r.duration > 5) >>> first_100 = recs.subset(first=100) >>> split_into_4 = recs.split(num_splits=4)
And lazy data augmentation/transformation, that requires to adjust some information in the manifest (e.g.,
num_samples
orduration
). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:>>> recs_sp = recs.perturb_speed(factor=1.1) >>> recs_24k = recs.resample(24000)
Finally, since we support importing Kaldi data dirs, if
wav.scp
contains unix pipes,Recording
will also handle them correctly.-
__init__
(recordings=None)[source] Initialize self. See help(type(self)) for accurate signature.
-
Supervision manifest¶
The supervision manifest contains the supervision information that we have about the recordings. In particular, it involves the segmentation - there might be a single segment for a single utterance recording, and multiple segments for a recording of a converstion.
When coming from Kaldi, think of it as a segments file on steroids, that also contains utt2spk, utt2gender, utt2dur, etc.
This is a YAML supervision manifest:
---
- id: 'segment-1'
recording_id: 'recording-2'
channel: 0
start: 0.1
duration: 0.3
text: 'transcript of the first segment'
language: 'english'
speaker: 'Norman Dyhrentfurth'
- id: 'segment-2'
recording_id: 'recording-2'
start: 0.5
duration: 0.4
Each segment is characterized by the following attributes:
a unique id,
a corresponding recording id,
start time in seconds, relative to the beginning of the recording,
the duration in seconds
Each segment may be assigned optional supervision information. In this example, the first segment contains the transcription text, the language of the utterance and a speaker name. The second segment contains only the minimal amount of information, which should be interpreted as: “this is some area of interest in the recording that we know nothing else about.”
Python¶
In Python, the supervision manifest is represented by classes SupervisionSet
and SupervisionSegment
.
Example usage:
supervisions = SupervisionSet.from_segments([
SupervisionSegment(
id='segment-1',
recording_id='recording-1',
start=0.5,
duration=10.7,
text='quite a long utterance'
)
])
print(f'There is {len(supervisions)} supervision in the set.')
Standard data preparation recipes¶
We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.
Corpus name |
Function |
---|---|
Aishell |
|
AMI |
|
BABEL |
|
CallHome Egyptian |
|
CallHome English |
|
CMU Arctic |
|
CMU Kids |
|
CSLU Kids |
|
DIHARD III |
|
English Broadcast News 1997 |
|
GALE Arabic Broadcast Speech |
|
GALE Mandarin Broadcast Speech |
|
GigaSpeech |
|
Heroico |
|
L2 Arctic |
|
LibriSpeech (including “mini”) |
|
LibriTTS |
|
LJ Speech |
|
MiniLibriMix |
|
MTEDx |
|
MobvoiHotWord |
|
Multilingual LibriSpeech (MLS) |
|
MUSAN |
|
National Speech Corpus (Singaporean English) |
|
Switchboard |
|
TED-LIUM v3 |
|
VCTK |
|
Adding new corpora¶
General pointers:
Each corpus has a dedicated Python file in
lhotse/recipes
.For publicly available corpora that can be freely downloaded, we usually define a function called
download
,download_and_untar
, etc.Each data preparation recipe should expose a single function called
prepare_X
, with X being the name of the corpus, that produces dicts like:{'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}
for the data in that corpus.When a corpus defines standard split (e.g. train/dev/test), we return a dict with the following structure:
{'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}
Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the
SupervisionSegment
will exactly match theRecording
duration (and there will likely be exactly one segment corresponding to any recording).Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one
Recording
object corresponding to a single conversation/session, that spans its whole duration. Each speech segment in that recording should be represented as aSupervisionSegment
with the samerecording_id
value.Corpora with multiple channels for each session (e.g. AMI) should have a single
Recording
with multipleAudioSource
objects - each corresponding to a separate channel.