Cuts¶
Overview¶
Audio cuts are one of the main Lhotse features. Cut is a part of a recording, but it can be longer than a supervision segment, or even span multiple segments. The regions without a supervision are just audio that we don’t assume we know anything about - there may be silence, noise, non-transcribed speech, etc. Task-specific datasets can leverage this information to generate masks for such regions.
-
class
lhotse.cut.
Cut
[source] Caution
Cut
is just an abstract class – the actual logic is implemented by its child classes (scroll down for references).Cut
is a base class for audio cuts. An “audio cut” is a subset of aRecording
– it can also be thought of as a “view” or a pointer to a chunk of audio. It is not limited to audio data – cuts may also point to (sub-spans of) precomputedFeatures
.Cuts are different from
SupervisionSegment
in that they may be arbitrarily longer or shorter than supervisions; cuts may even contain multiple supervisions for creating contextual training data, and unsupervised regions that provide real or synthetic acoustic background context for the supervised segments.The following example visualizes how a cut may represent a part of a single-channel recording with two utterances and some background noise in between:
Recording |-------------------------------------------| "Hey, Matt!" "Yes?" "Oh, nothing" |----------| |----| |-----------| Cut1 |------------------------|
This scenario can be represented in code, using
MonoCut
, as:>>> from lhotse import Recording, SupervisionSegment, MonoCut >>> rec = Recording(id='rec1', duration=10.0, sampling_rate=8000, num_samples=80000, sources=[...]) >>> sups = [ ... SupervisionSegment(id='sup1', recording_id='rec1', start=0, duration=3.37, text='Hey, Matt!'), ... SupervisionSegment(id='sup2', recording_id='rec1', start=4.5, duration=0.9, text='Yes?'), ... SupervisionSegment(id='sup3', recording_id='rec1', start=6.9, duration=2.9, text='Oh, nothing'), ... ] >>> cut = MonoCut(id='rec1-cut1', start=0.0, duration=6.0, channel=0, recording=rec, ... supervisions=[sups[0], sups[1]])
Note
All Cut classes assume that the
SupervisionSegment
time boundaries are relative to the beginning of the cut. E.g. if the underlyingRecording
starts at 0s (always true), the cut starts at 100s, and the SupervisionSegment inside the cut starts at 3s, it really did start at 103rd second of the recording. In some cases, the supervision might have a negative start, or a duration exceeding the duration of the cut; this means that the supervision in the recording extends beyond the cut.Cut allows to check and read audio data or features data:
>>> assert cut.has_recording >>> samples = cut.load_audio() >>> if cut.has_features: ... feats = cut.load_features()
It can be visualized, and listened to, inside Jupyter Notebooks:
>>> cut.plot_audio() >>> cut.play_audio() >>> cut.plot_features()
Cuts can be used with Lhotse’s
FeatureExtractor
to compute features.>>> from lhotse import Fbank >>> feats = cut.compute_features(extractor=Fbank())
It is also possible to use a
FeaturesWriter
to store the features and attach their manifest to a copy of the cut:>>> from lhotse import LilcomHdf5Writer >>> with LilcomHdf5Writer('feats.h5') as storage: ... cut_with_feats = cut.compute_and_store_features( ... extractor=Fbank(), ... storage=storage ... )
Cuts have several methods that allow their manipulation, transformation, and mixing. Some examples (see the respective methods documentation for details):
>>> cut_2_to_4s = cut.truncate(offset=2, duration=2) >>> cut_padded = cut.pad(duration=10.0) >>> cut_mixed = cut.mix(other_cut, offset_other_by=5.0, snr=20) >>> cut_append = cut.append(other_cut) >>> cut_24k = cut.resample(24000) >>> cut_sp = cut.perturb_speed(1.1) >>> cut_vp = cut.perturb_volume(2.)
Note
All cut transformations are performed lazily, on-the-fly, upon calling
load_audio
orload_features
. The stored waveforms and features are untouched.Caution
Operations on cuts are not mutating – they return modified copies of
Cut
objects, leaving the original object unmodified.A
Cut
that contains multiple segments (SupervisionSegment
) can be decayed into smaller cuts that correspond directly to supervisions:>>> smaller_cuts = cut.trim_to_supervisions()
Cuts can be detached from parts of their metadata:
>>> cut_no_feat = cut.drop_features() >>> cut_no_rec = cut.drop_recording() >>> cut_no_sup = cut.drop_supervisions()
Finally, cuts provide convenience methods to compute feature frame and audio sample masks for supervised regions:
>>> sup_frames = cut.supervisions_feature_mask() >>> sup_samples = cut.supervisions_audio_mask()
See also:
-
class
lhotse.cut.
CutSet
(cuts=None)[source] CutSet
represents a collection of cuts, indexed by cut IDs. CutSet ties together all types of data – audio, features and supervisions, and is suitable to represent training/dev/test sets.Note
CutSet
is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.When coming from Kaldi, there is really no good equivalent – the closest concept may be Kaldi’s “egs” for training neural networks, which are chunks of feature matrices and corresponding alignments used respectively as inputs and supervisions.
CutSet
is different because it provides you with all kinds of metadata, and you can select just the interesting bits to feed them to your models.CutSet
can be created from any combination ofRecordingSet
,SupervisionSet
, andFeatureSet
withlhotse.cut.CutSet.from_manifests()
:>>> from lhotse import CutSet >>> cuts = CutSet.from_manifests(recordings=my_recording_set) >>> cuts2 = CutSet.from_manifests(features=my_feature_set) >>> cuts3 = CutSet.from_manifests( ... recordings=my_recording_set, ... features=my_feature_set, ... supervisions=my_supervision_set, ... )
When creating a
CutSet
withCutSet.from_manifests()
, the resulting cuts will have the same duration as the input recordings or features. For long recordings, it is not viable for training. We provide several methods to transform the cuts into shorter ones.Consider the following scenario:
Recording |-------------------------------------------| "Hey, Matt!" "Yes?" "Oh, nothing" |----------| |----| |-----------| .......... CutSet.from_manifests() .......... Cut1 |-------------------------------------------| ............. Example CutSet A .............. Cut1 Cut2 Cut3 |----------| |----| |-----------| ............. Example CutSet B .............. Cut1 Cut2 |---------------------||--------------------| ............. Example CutSet C .............. Cut1 Cut2 |---| |------|
The CutSet’s A, B and C can be created like:
>>> cuts_A = cuts.trim_to_supervisions() >>> cuts_B = cuts.cut_into_windows(duration=5.0) >>> cuts_C = cuts.trim_to_unsupervised_segments()
Note
Some operations support parallel execution via an optional
num_jobs
parameter. By default, all processing is single-threaded.Caution
Operations on cut sets are not mutating – they return modified copies of
CutSet
objects, leaving the original object unmodified (and all of its cuts are also unmodified).CutSet
can be stored and read from JSON, JSONL, etc. and supports optional gzip compression:>>> cuts.to_file('cuts.jsonl.gz') >>> cuts4 = CutSet.from_file('cuts.jsonl.gz')
It behaves similarly to a
dict
:>>> 'rec1-1-0' in cuts True >>> cut = cuts['rec1-1-0'] >>> for cut in cuts: >>> pass >>> len(cuts) 127
CutSet
has some convenience properties and methods to gather information about the dataset:>>> ids = list(cuts.ids) >>> speaker_id_set = cuts.speakers >>> # The following prints a message: >>> cuts.describe() Cuts count: 547 Total duration (hours): 326.4 Speech duration (hours): 79.6 (24.4%) *** Duration statistics (seconds): mean 2148.0 std 870.9 min 477.0 25% 1523.0 50% 2157.0 75% 2423.0 max 5415.0 dtype: float64
Manipulation examples:
>>> longer_than_5s = cuts.filter(lambda c: c.duration > 5) >>> first_100 = cuts.subset(first=100) >>> split_into_4 = cuts.split(num_splits=4) >>> shuffled = cuts.shuffle() >>> random_sample = cuts.sample(n_cuts=10) >>> new_ids = cuts.modify_ids(lambda c: c.id + '-newid')
These operations can be composed to implement more complex operations, e.g. bucketing by duration:
>>> buckets = cuts.sort_by_duration().split(num_splits=30)
Cuts in a
CutSet
can be detached from parts of their metadata:>>> cuts_no_feat = cuts.drop_features() >>> cuts_no_rec = cuts.drop_recordings() >>> cuts_no_sup = cuts.drop_supervisions()
Sometimes specific sorting patterns are useful when a small CutSet represents a mini-batch:
>>> cuts = cuts.sort_by_duration(ascending=False) >>> cuts = cuts.sort_like(other_cuts)
CutSet
offers some batch processing operations:>>> cuts = cuts.pad(num_frames=300) # or duration=30.0 >>> cuts = cuts.truncate(max_duration=30.0, offset_type='start') # truncate from start to 30.0s >>> cuts = cuts.mix(other_cuts, snr=[10, 30], mix_prob=0.5)
CutSet
supports lazy data augmentation/transformation methods which require adjusting some information in the manifest (e.g.,num_samples
orduration
). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:>>> cuts_sp = cuts.perturb_speed(factor=1.1) >>> cuts_vp = cuts.perturb_volume(factor=2.) >>> cuts_24k = cuts.resample(24000)
Caution
If the
CutSet
containedFeatures
manifests, they will be detached after performing audio augmentations such asCutSet.perturb_speed()
orCutSet.resample()
orCutSet.perturb_volume()
.CutSet
offers parallel feature extraction capabilities (see meth:.CutSet.compute_and_store_features: for details), and can be used to estimate global mean and variance:>>> from lhotse import Fbank >>> cuts = CutSet() >>> cuts = cuts.compute_and_store_features( ... extractor=Fbank(), ... storage_path='/data/feats', ... num_jobs=4 ... ) >>> mvn_stats = cuts.compute_global_feature_stats('/data/features/mvn_stats.pkl', max_cuts=10000)
See also:
Types of cuts¶
There are three cut classes: MonoCut
, MixedCut
, and PaddingCut
that are described below in more detail:
-
class
lhotse.cut.
MonoCut
(id: str, start: float, duration: float, channel: int, supervisions: List[lhotse.supervision.SupervisionSegment] = <factory>, features: Optional[lhotse.features.base.Features] = None, recording: Optional[lhotse.audio.Recording] = None)[source] MonoCut
is aCut
of a single channel of aRecording
. In addition to Cut, it has a specified channel attribute. This is the most commonly used type of cut.Please refer to the documentation of
Cut
to learn more about using cuts.See also:
-
class
lhotse.cut.
MixedCut
(id: str, tracks: List[lhotse.cut.MixTrack])[source] MixedCut
is aCut
that actually consists of multiple other cuts. It can be interpreted as a multi-channel cut, but its primary purpose is to allow time-domain and feature-domain augmentation via mixing the training cuts with noise, music, and babble cuts. The actual mixing operations are performed on-the-fly.Internally,
MixedCut
holds other cuts in multiple trakcs (MixTrack
), each with its own offset and SNR that is relative to the first track.Please refer to the documentation of
Cut
to learn more about using cuts.In addition to methods available in
Cut
,MixedCut
provides the methods to read all of its tracks audio and features as separate channels:>>> cut = MixedCut(...) >>> mono_features = cut.load_features() >>> assert len(mono_features.shape) == 2 >>> multi_features = cut.load_features(mixed=False) >>> # Now, the first dimension is the channel. >>> assert len(multi_features.shape) == 3
See also:
-
class
lhotse.cut.
PaddingCut
(id: str, duration: float, sampling_rate: int, feat_value: float, num_frames: Optional[int] = None, num_features: Optional[int] = None, frame_shift: Optional[float] = None, num_samples: Optional[int] = None)[source] PaddingCut
is a dummyCut
that doesn’t refer to actual recordings or features –it simply returns zero samples in the time domain and a specified features value in the feature domain. Its main role is to be appended to other cuts to make them evenly sized.Please refer to the documentation of
Cut
to learn more about using cuts.See also:
Each of these types has additional attributes that are not common - e.g., it makes sense to specify start for
MonoCut
to locate it in the source recording, but it is undefined for MixedCut
and PaddingCut
.
CLI¶
We provide a limited CLI to manipulate Lhotse manifests. Some examples of how to perform manipulations in the terminal:
# Reject short segments
lhotse yaml filter 'duration>=3.0' cuts.jsonl cuts-3s.jsonl
# Pad short segments to 5 seconds.
lhotse cut pad --duration 5.0 cuts-3s.jsonl cuts-5s-pad.jsonl
# Truncate longer segments to 5 seconds.
lhotse cut truncate --max-duration 5.0 --offset-type random cuts-5s-pad.jsonl cuts-5s.jsonl