PyTorch Datasets¶

Caution

Lhotse datasets are still very much in the works and are subject to breaking changes.

We supply subclasses of the torch.data.Dataset for various audio/speech tasks. These datasets are created from CutSet objects and load the features from disk into memory on-the-fly. Each dataset accepts an optional root_dir argument which is used as a prefix for the paths to features and audio.

Currently, we provide the following:

class lhotse.dataset.diarization.DiarizationDataset(cuts, min_speaker_dim=None, global_speaker_ids=False)

A PyTorch Dataset for the speaker diarization task. Our assumptions about speaker diarization are the following:

we assume a single channel input (for now), which could be either a true mono signal
or a beamforming result from a microphone array.
we assume that the supervision used for model training is a speech activity matrix, with one
row dedicated to each speaker (either in the current cut or the whole dataset, depending on the settings). The columns correspond to feature frames. Each row is effectively a Voice Activity Detection supervision for a single speaker. This setup is somewhat inspired by the TS-VAD paper: https://arxiv.org/abs/2005.07272

Each item in this dataset is a dict of:

{
    'features': (T x F) tensor
    'speaker_activity': (num_speaker x T) tensor
}

Constructor arguments:

Parameters

cuts (CutSet) – a CutSet used to create the dataset object.
min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
global_speaker_ids (bool) – a bool, indicates whether the same speaker should always retain the same row index in the speaker activity matrix (useful for speaker-dependent systems)
root_dir – a prefix path to be attached to the feature files paths.

__init__(cuts, min_speaker_dim=None, global_speaker_ids=False): Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.unsupervised.UnsupervisedDataset(cuts)

Dataset that contains no supervision - it only provides the features extracted from recordings. The returned features are a torch.Tensor of shape (T x F), where T is the number of frames, and F is the feature dimension.

__init__(cuts): Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.unsupervised.UnsupervisedWaveformDataset(cuts): A variant of UnsupervisedDataset that provides waveform samples instead of features. The output is a tensor of shape (C, T), with C being the number of channels and T the number of audio samples. In this implemenation, there will always be a single channel.

class lhotse.dataset.unsupervised.DynamicUnsupervisedDataset(feature_extractor, cuts, augment_fn=None)

An example dataset that shows how to use on-the-fly feature extraction in Lhotse. It accepts two additional inputs - a FeatureExtractor and an optional WavAugmenter for time-domain data augmentation.. The output is approximately the same as that of the UnsupervisedDataset - there might be slight differences for MixedCut``s, because this dataset mixes them in the time domain, and ``UnsupervisedDataset does that in the feature domain. Cuts that are not mixed will yield identical results in both dataset classes.

__init__(feature_extractor, cuts, augment_fn=None): Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.speech_recognition.SpeechRecognitionDataset(cuts)

The PyTorch Dataset for the speech recognition task. Each item in this dataset is a dict of:

{
    'features': (T x F) tensor,
    'text': string,
    'supervisions_mask': (T) tensor
}

The supervisions_mask field is a mask that specifies which frames are covered by a supervision by assigning a value of 1 (in this case: segments with transcribed speech contents), and which are not by asigning a value of 0 (in this case: padding, contextual noise, or in general the acoustic context without transcription).

In the future, will be extended by graph supervisions.

__init__(cuts): Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.speech_recognition.K2SpeechRecognitionIterableDataset(cuts, max_frames=26000, max_cuts=None, shuffle=False, concat_cuts=True, concat_cuts_gap=1.0, concat_cuts_duration_factor=1)

The PyTorch Dataset for the speech recognition task using K2 library.

This dataset internally batches and collates the Cuts and should be used with PyTorch DataLoader with argument batch_size=None to work properly. The batch size is determined automatically to satisfy the constraints of max_frames and max_cuts.

This dataset will automatically partition itself when used with a multiprocessing DataLoader (i.e. the same cut will not appear twice in the same epoch).

By default, we “pack” the batches to minimize the amount of padding - we achieve that by concatenating the cuts’ feature matrices with a small amount of silence (padding) in between.

Each item in this dataset is a dict of:

{
    'features': float tensor of shape (B, T, F)
    'supervisions': [
        {
            'cut_id': List[str] of len S
            'sequence_idx': Tensor[int] of shape (S,)
            'text': List[str] of len S
            'start_frame': Tensor[int] of shape (S,)
            'num_frames': Tensor[int] of shape (S,)
        }
    ]
}

Dimension symbols legend: * B - batch size (number of Cuts) * S - number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions) * T - number of frames of the longest Cut * F - number of features

The ‘sequence_idx’ field is the index of the Cut used to create the example in the Dataset.

__init__(cuts, max_frames=26000, max_cuts=None, shuffle=False, concat_cuts=True, concat_cuts_gap=1.0, concat_cuts_duration_factor=1)

K2 ASR IterableDataset constructor.

Parameters

cuts (CutSet) – the CutSet to sample data from.
max_frames (int) – The maximum number of feature frames that we’re going to put in a single batch. The padding frames do not contribute to that limit, since we pack the batch by default to minimze the amount of padding.
max_cuts (Optional[int]) – The maximum number of cuts sampled to form a mini-batch. By default, this constraint is off.
shuffle (bool) – When True, the cuts will be shuffled at the start of iteration. Convenient when mini-batch loop is inside an outer epoch-level loop, e.g.: for epoch in range(10): for batch in dataset: … as every epoch will see a different cuts order.
concat_cuts (bool) – When True, we will concatenate the cuts to minimize the total amount of padding; e.g. instead of creating a batch with 40 examples, we will merge some of the examples together adding some silence between them to avoid a large number of padding frames that waste the computation. Enabled by default.
concat_cuts_gap (float) – The duration of silence in seconds that is inserted between the cuts; it’s goal is to let the model “know” that there are separate utterances in a single example.
concat_cuts_duration_factor (float) – Determines the maximum duration of the concatenated cuts; by default it’s 1, setting the limit at the duration of the longest cut in the batch.

lhotse.dataset.speech_recognition.concat_cuts(cuts, gap=1.0, max_duration=None)

We’re going to concatenate the cuts to minimize the amount of total padding frames used. This is actually solving a knapsack problem. In this initial implementation we’re using a greedy approach: going from the back (i.e. the shortest cuts) we’ll try to concat them to the longest cut that still has some “space” at the end.

Parameters

cuts (List[Union[Cut, MixedCut, PaddingCut]]) – a list of cuts to pack.
gap (float) – the duration of silence inserted between concatenated cuts.
max_duration (Optional[float]) – the maximum duration for the concatenated cuts (by default set to the duration of the first cut).

:return a list of packed cuts.

Return type: List[Union[Cut, MixedCut, PaddingCut]]

class lhotse.dataset.speech_recognition.K2SpeechRecognitionDataset(cuts)

The PyTorch Dataset for the speech recognition task using K2 library. Each item in this dataset is a dict of:

{
    'features': (T x F) tensor,
    'supervisions': List[Dict] -> [
        {
            'sequence_idx': int
            'text': string,
            'start_frame': int,
            'num_frames': int
        } (multiplied N times, for each of the N supervisions present in the Cut)
    ]
}

The ‘sequence_idx’ field is the index of the Cut used to create the example in the Dataset. It is mapped to the batch index later in the DataLoader.

__init__(cuts): Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.speech_recognition.K2DataLoader(*args, **kwds)

A PyTorch DataLoader that has a custom collate_fn that complements the K2SpeechRecognitionDataset.

The ‘features’ tensor is collated in a standard way to return a tensor of shape (B, T, F).

The ‘supervisions’ dict contains the same fields as in K2SpeechRecognitionDataset, except that each sub-field (like ‘start_frame’) is a 1D PyTorch tensor with shape (B,). The ‘text’ sub-field is an exception - it’s a list of strings with length equal to batch size.

The ‘sequence_idx’ sub-field in ‘supervisions’, which originally points to index of the example in the Dataset, is remapped to the index of the corresponding features matrix in the collated ‘features’. Multiple supervisions coming from the same cut will share the same ‘sequence_idx’.

For an example, see test/dataset/test_speech_recognition_dataset.py::test_k2_dataloader().

__init__(*args, **kwargs): Initialize self. See help(type(self)) for accurate signature.

dataset

batch_size

num_workers

pin_memory

drop_last

timeout

sampler

prefetch_factor

lhotse.dataset.speech_recognition.multi_supervision_collate_fn(batch)

Custom collate_fn for K2SpeechRecognitionDataset.

It merges the items provided by K2SpeechRecognitionDataset into the following structure:

{
    'features': float tensor of shape (B, T, F)
    'supervisions': [
        {
            'sequence_idx': Tensor[int] of shape (S,)
            'text': List[str] of len S
            'start_frame': Tensor[int] of shape (S,)
            'num_frames': Tensor[int] of shape (S,)
        }
    ]
}

Dimension symbols legend: * B - batch size (number of Cuts), * S - number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions), * T - number of frames of the longest Cut * F - number of features

Return type: Dict

lhotse.dataset.speech_synthesis: alias of lhotse.dataset.speech_synthesis

class lhotse.dataset.source_separation.DynamicallyMixedSourceSeparationDataset(sources_set, mixtures_set, nonsources_set=None)

A PyTorch Dataset for the source separation task. It’s created from a number of CutSets:

sources_set: provides the audio cuts for the sources that (the targets of source separation),
mixtures_set: provides the audio cuts for the signal mix (the input of source separation),
nonsources_set: (optional) provides the audio cuts for other signals that are in the mix, but are not the targets of source separation. Useful for adding noise.

When queried for data samples, it returns a dict of:

{
    'sources': (N x T x F) tensor,
    'mixture': (T x F) tensor,
    'real_mask': (N x T x F) tensor,
    'binary_mask': (T x F) tensor
}

This Dataset performs on-the-fly feature-domain mixing of the sources. It expects the mixtures_set to contain MixedCuts, so that it knows which Cuts should be mixed together.

__init__(sources_set, mixtures_set, nonsources_set=None): Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.source_separation.PreMixedSourceSeparationDataset(sources_set, mixtures_set)

A PyTorch Dataset for the source separation task. It’s created from two CutSets - one provides the audio cuts for the sources, and the other one the audio cuts for the signal mix. When queried for data samples, it returns a dict of:

{
    'sources': (N x T x F) tensor,
    'mixture': (T x F) tensor,
    'real_mask': (N x T x F) tensor,
    'binary_mask': (T x F) tensor
}

It expects both CutSets to return regular Cuts, meaning that the signals were mixed in the time domain. In contrast to DynamicallyMixedSourceSeparationDataset, no on-the-fly feature-domain-mixing is performed.

__init__(sources_set, mixtures_set): Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.vad.VadDataset(cuts)

The PyTorch Dataset for the voice activity detection task. Each item in this dataset is a dict of:

{
    'features': (T x F) tensor
    'is_voice': (T x 1) tensor
}

__init__(cuts): Initialize self. See help(type(self)) for accurate signature.