Cuts

Overview

Audio cuts are one of the main Lhotse features. Cut is a part of a recording, but it can be longer than a supervision segment, or even span multiple segments. The regions without a supervision are just audio that we don’t assume we know anything about - there may be silence, noise, non-transcribed speech, etc. Task-specific datasets can leverage this information to generate masks for such regions.

class lhotse.cut.Cut[source]

Caution

Cut is just an abstract class – the actual logic is implemented by its child classes (scroll down for references).

Cut is a base class for audio cuts. An “audio cut” is a subset of a Recording – it can also be thought of as a “view” or a pointer to a chunk of audio. It is not limited to audio data – cuts may also point to (sub-spans of) precomputed Features.

Cuts are different from SupervisionSegment in that they may be arbitrarily longer or shorter than supervisions; cuts may even contain multiple supervisions for creating contextual training data, and unsupervised regions that provide real or synthetic acoustic background context for the supervised segments.

The following example visualizes how a cut may represent a part of a single-channel recording with two utterances and some background noise in between:

                  Recording
|-------------------------------------------|
"Hey, Matt!"     "Yes?"        "Oh, nothing"
|----------|     |----|        |-----------|
           Cut1
|------------------------|

This scenario can be represented in code, using MonoCut, as:

>>> from lhotse import Recording, SupervisionSegment, MonoCut
>>> rec = Recording(id='rec1', duration=10.0, sampling_rate=8000, num_samples=80000, sources=[...])
>>> sups = [
...     SupervisionSegment(id='sup1', recording_id='rec1', start=0, duration=3.37, text='Hey, Matt!'),
...     SupervisionSegment(id='sup2', recording_id='rec1', start=4.5, duration=0.9, text='Yes?'),
...     SupervisionSegment(id='sup3', recording_id='rec1', start=6.9, duration=2.9, text='Oh, nothing'),
... ]
>>> cut = MonoCut(id='rec1-cut1', start=0.0, duration=6.0, channel=0, recording=rec,
...     supervisions=[sups[0], sups[1]])

Note

All Cut classes assume that the SupervisionSegment time boundaries are relative to the beginning of the cut. E.g. if the underlying Recording starts at 0s (always true), the cut starts at 100s, and the SupervisionSegment inside the cut starts at 3s, it really did start at 103rd second of the recording. In some cases, the supervision might have a negative start, or a duration exceeding the duration of the cut; this means that the supervision in the recording extends beyond the cut.

Cut allows to check and read audio data or features data:

>>> assert cut.has_recording
>>> samples = cut.load_audio()
>>> if cut.has_features:
...     feats = cut.load_features()

It can be visualized, and listened to, inside Jupyter Notebooks:

>>> cut.plot_audio()
>>> cut.play_audio()
>>> cut.plot_features()

Cuts can be used with Lhotse’s FeatureExtractor to compute features.

>>> from lhotse import Fbank
>>> feats = cut.compute_features(extractor=Fbank())

It is also possible to use a FeaturesWriter to store the features and attach their manifest to a copy of the cut. For best storage efficiency, prefer LilcomChunkyWriter when the optional lilcom dependency is installed:

>>> from lhotse import LilcomChunkyWriter
>>> with LilcomChunkyWriter('feats.lca') as storage:
...     cut_with_feats = cut.compute_and_store_features(
...         extractor=Fbank(),
...         storage=storage
...     )

Cuts have several methods that allow their manipulation, transformation, and mixing. Some examples (see the respective methods documentation for details):

>>> cut_2_to_4s = cut.truncate(offset=2, duration=2)
>>> cut_padded = cut.pad(duration=10.0)
>>> cut_extended = cut.extend_by(duration=5.0, direction='both')
>>> cut_mixed = cut.mix(other_cut, offset_other_by=5.0, snr=20)
>>> cut_append = cut.append(other_cut)
>>> cut_24k = cut.resample(24000)
>>> cut_sp = cut.perturb_speed(1.1)
>>> cut_vp = cut.perturb_volume(2.)
>>> cut_rvb = cut.reverb_rir(rir_recording)

Note

All cut transformations are performed lazily, on-the-fly, upon calling load_audio or load_features. The stored waveforms and features are untouched.

Caution

Operations on cuts are not mutating – they return modified copies of Cut objects, leaving the original object unmodified.

A Cut that contains multiple segments (SupervisionSegment) can be decayed into smaller cuts that correspond directly to supervisions:

>>> smaller_cuts = cut.trim_to_supervisions()

Cuts can be detached from parts of their metadata:

>>> cut_no_feat = cut.drop_features()
>>> cut_no_rec = cut.drop_recording()
>>> cut_no_sup = cut.drop_supervisions()

Finally, cuts provide convenience methods to compute feature frame and audio sample masks for supervised regions:

>>> sup_frames = cut.supervisions_feature_mask()
>>> sup_samples = cut.supervisions_audio_mask()

See also:

lhotse.cut.MonoCut

lhotse.cut.MixedCut

lhotse.cut.CutSet

class lhotse.cut.CutSet(cuts=None)[source]

CutSet represents a collection of cuts. CutSet ties together all types of data – audio, features and supervisions, and is suitable to represent training/dev/test sets.

CutSet can be either “lazy” (acts as an iterable) which is best for representing full datasets, or “eager” (acts as a list), which is best for representing individual mini-batches (and sometimes test/dev datasets). Almost all operations are available for both modes, but some of them are more efficient depending on the mode (e.g. indexing an “eager” manifest is O(1)).

Note

CutSet is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.

When coming from Kaldi, there is really no good equivalent – the closest concept may be Kaldi’s “egs” for training neural networks, which are chunks of feature matrices and corresponding alignments used respectively as inputs and supervisions. CutSet is different because it provides you with all kinds of metadata, and you can select just the interesting bits to feed them to your models.

CutSet can be created from any combination of RecordingSet, SupervisionSet, and FeatureSet with lhotse.cut.CutSet.from_manifests():

>>> from lhotse import CutSet
>>> cuts = CutSet.from_manifests(recordings=my_recording_set)
>>> cuts2 = CutSet.from_manifests(features=my_feature_set)
>>> cuts3 = CutSet.from_manifests(
...     recordings=my_recording_set,
...     features=my_feature_set,
...     supervisions=my_supervision_set,
... )

When creating a CutSet with CutSet.from_manifests(), the resulting cuts will have the same duration as the input recordings or features. For long recordings, it is not viable for training. We provide several methods to transform the cuts into shorter ones.

Consider the following scenario:

                  Recording
|-------------------------------------------|
"Hey, Matt!"     "Yes?"        "Oh, nothing"
|----------|     |----|        |-----------|

.......... CutSet.from_manifests() ..........
                    Cut1
|-------------------------------------------|

............. Example CutSet A ..............
    Cut1          Cut2              Cut3
|----------|     |----|        |-----------|

............. Example CutSet B ..............
          Cut1                  Cut2
|---------------------||--------------------|

............. Example CutSet C ..............
             Cut1        Cut2
            |---|      |------|

The CutSet’s A, B and C can be created like:

>>> cuts_A = cuts.trim_to_supervisions()
>>> cuts_B = cuts.cut_into_windows(duration=5.0)
>>> cuts_C = cuts.trim_to_unsupervised_segments()

Note

Some operations support parallel execution via an optional num_jobs parameter. By default, all processing is single-threaded.

Caution

Operations on cut sets are not mutating – they return modified copies of CutSet objects, leaving the original object unmodified (and all of its cuts are also unmodified).

CutSet can be stored and read from JSON, JSONL, etc. and supports optional gzip compression:

>>> cuts.to_file('cuts.jsonl.gz')
>>> cuts4 = CutSet.from_file('cuts.jsonl.gz')

It behaves similarly to a dict:

>>> 'rec1-1-0' in cuts
True
>>> cut = cuts['rec1-1-0']
>>> for cut in cuts:
>>>    pass
>>> len(cuts)
127

CutSet has some convenience properties and methods to gather information about the dataset:

>>> ids = list(cuts.ids)
>>> speaker_id_set = cuts.speakers
>>> # The following prints a message:
>>> cuts.describe()
Cuts count: 547
Total duration (hours): 326.4
Speech duration (hours): 79.6 (24.4%)
***
Duration statistics (seconds):
mean    2148.0
std      870.9
min      477.0
25%     1523.0
50%     2157.0
75%     2423.0
max     5415.0
dtype: float64

Manipulation examples:

>>> longer_than_5s = cuts.filter(lambda c: c.duration > 5)
>>> first_100 = cuts.subset(first=100)
>>> split_into_4 = cuts.split(num_splits=4)
>>> shuffled = cuts.shuffle()
>>> random_sample = cuts.sample(n_cuts=10)
>>> new_ids = cuts.modify_ids(lambda c: c.id + '-newid')

These operations can be composed to implement more complex operations, e.g. bucketing by duration:

>>> buckets = cuts.sort_by_duration().split(num_splits=30)

Cuts in a CutSet can be detached from parts of their metadata:

>>> cuts_no_feat = cuts.drop_features()
>>> cuts_no_rec = cuts.drop_recordings()
>>> cuts_no_sup = cuts.drop_supervisions()

Sometimes specific sorting patterns are useful when a small CutSet represents a mini-batch:

>>> cuts = cuts.sort_by_duration(ascending=False)
>>> cuts = cuts.sort_like(other_cuts)

CutSet offers some batch processing operations:

>>> cuts = cuts.pad(num_frames=300)  # or duration=30.0
>>> cuts = cuts.truncate(max_duration=30.0, offset_type='start')  # truncate from start to 30.0s
>>> cuts = cuts.mix(other_cuts, snr=[10, 30], mix_prob=0.5)

CutSet supports lazy data augmentation/transformation methods which require adjusting some information in the manifest (e.g., num_samples or duration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:

>>> cuts_sp = cuts.perturb_speed(factor=1.1)
>>> cuts_vp = cuts.perturb_volume(factor=2.)
>>> cuts_24k = cuts.resample(24000)
>>> cuts_rvb = cuts.reverb_rir(rir_recordings)

Caution

If the CutSet contained Features manifests, they will be detached after performing audio augmentations such as CutSet.perturb_speed(), CutSet.resample(), CutSet.perturb_volume(), or CutSet.reverb_rir().

CutSet offers parallel feature extraction capabilities (see meth:.CutSet.compute_and_store_features: for details), and can be used to estimate global mean and variance:

>>> from lhotse import Fbank
>>> cuts = CutSet()
>>> # This uses the default backend (numpy_files unless overridden with
>>> # LHOTSE_FEATURES_STORAGE_BACKEND). If lilcom is installed, prefer
>>> # storage_type=LilcomChunkyWriter for better storage efficiency.
>>> cuts = cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='/data/feats',
...     num_jobs=4
... )
>>> mvn_stats = cuts.compute_global_feature_stats('/data/features/mvn_stats.pkl', max_cuts=10000)

See also:

Cut

Types of cuts

There are three cut classes: MonoCut, MixedCut, and PaddingCut that are described below in more detail:

class lhotse.cut.MonoCut(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)[source]

MonoCut is a Cut of a single channel of a Recording. In addition to Cut, it has a specified channel attribute. This is the most commonly used type of cut.

Please refer to the documentation of Cut to learn more about using cuts.

See also:

lhotse.cut.Cut

lhotse.cut.MixedCut

lhotse.cut.CutSet

class lhotse.cut.MixedCut(id, tracks, transforms=None)[source]

MixedCut is a Cut that actually consists of multiple other cuts. Its primary purpose is to allow time-domain and feature-domain augmentation via mixing the training cuts with noise, music, and babble cuts. The actual mixing operations are performed on-the-fly.

Internally, MixedCut holds other cuts in multiple tracks (MixTrack), each with its own offset and SNR that is relative to the first track.

Please refer to the documentation of Cut to learn more about using cuts.

In addition to methods available in Cut, MixedCut provides the methods to read all of its tracks audio and features as separate channels:

>>> cut = MixedCut(...)
>>> mono_features = cut.load_features()
>>> assert len(mono_features.shape) == 2
>>> multi_features = cut.load_features(mixed=False)
>>> # Now, the first dimension is the channel.
>>> assert len(multi_features.shape) == 3

Note

MixedCut is different from MultiCut, which is intended to represent multi-channel recordings that share the same supervisions.

Note

Each track in a MixedCut can be either a MonoCut, MultiCut, or PaddingCut.

Note

The transforms field is a list of dictionaries that describe the transformations that should be applied to the track after mixing.

See also:

lhotse.cut.Cut

lhotse.cut.MonoCut

lhotse.cut.MultiCut

lhotse.cut.CutSet

class lhotse.cut.PaddingCut(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None, video=None, custom=None)[source]

PaddingCut is a dummy Cut that doesn’t refer to actual recordings or features –it simply returns zero samples in the time domain and a specified features value in the feature domain. Its main role is to be appended to other cuts to make them evenly sized.

Please refer to the documentation of Cut to learn more about using cuts.

See also:

lhotse.cut.Cut

lhotse.cut.MonoCut

lhotse.cut.MixedCut

lhotse.cut.CutSet

CLI

We provide a limited CLI to manipulate Lhotse manifests. Some examples of how to perform manipulations in the terminal:

# Reject short segments
lhotse filter 'duration>=3.0' cuts.jsonl cuts-3s.jsonl
# Pad short segments to 5 seconds.
lhotse cut pad --duration 5.0 cuts-3s.jsonl cuts-5s-pad.jsonl
# Truncate longer segments to 5 seconds.
lhotse cut truncate --max-duration 5.0 --offset-type random cuts-5s-pad.jsonl cuts-5s.jsonl