PyTorch Datasets

Lhotse supports PyTorch’s dataset API, providing implementations for the Dataset and Sampler concepts. They can be used together with the standard DataLoader class for efficient mini-batch collection with multiple parallel readers and pre-fetching.

A quick re-cap of PyTorch’s data API

PyTorch defines the Dataset class that is responsible for reading the data from disk/memory/Internet/database/etc., and converting it to tensors that can be used for network training or inference. These Dataset’s are typically „map-style” datasets which are given an index (or a list of indices) and return the corresponding data samples.

The selection of indices is performed by the Sampler class. Sampler, knowing the length (number of items) in a Dataset, can use various strategies to determine the order of elements to read (e.g. sequential reads, or random reads).

More details about the data pipeline API in PyTorch can be found here.

About Lhotse’s Datasets and Samplers

Lhotse provides a number of utilities that make it simpler to define Dataset’s for speech processing tasks. CutSet is the base data structure that is used to initialize the Dataset class. This makes it possible to manipulate the speech data in convenient ways - pad, mix, concatenate, augment, compute features, look up the supervision information, etc.

Lhotse’s Dataset’s will perform batching by themselves, because auto-collation in DataLoader is too limiting for speech data handling. These Dataset’s expect to be handed lists of element indices, so that they can collate the data before it is passed to the DataLoader (which must use batch_size=None). It allows for interesting collation methods - e.g. padding the speech with noise recordings, or actual acoustic context, rather than artificial zeroes; or dynamic batch sizes.

The items for mini-batch creation are selected by the Sampler. Lhotse defines Sampler classes that are initialized with CutSet’s, so that they can look up specific properties of an utterance to stratify the sampling. For example, SingleCutSampler has a defined max_frames attribute, and it will keep sampling cuts for a batch until they do not exceed the specified number of frames. Another strategy — used in BucketingSampler — will first group the cuts of similar durations into buckets, and then randomly select a bucket to draw the whole batch from.

For tasks where both input and output of the model are speech utterances, we can use the CutPairsSampler, which accepts two CutSet’s and will match the cuts in them by their IDs.

A typical Lhotse’s dataset API usage might look like this:

from torch.utils.data import DataLoader
from lhotse.dataset import SpeechRecognitionDataset, SingleCutSampler

cuts = CutSet(...)
dset = SpeechRecognitionDataset(cuts)
sampler = SingleCutSampler(cuts, max_frames=50000)
# Dataset performs batching by itself, so we have to indicate that
# to the DataLoader with batch_size=None
dloader = DataLoader(dset, sampler=sampler, batch_size=None, num_workers=1)
for batch in dloader:
    ...  # process data

Dataset’s list

class lhotse.dataset.diarization.DiarizationDataset(cuts, min_speaker_dim=None, global_speaker_ids=False)

A PyTorch Dataset for the speaker diarization task. Our assumptions about speaker diarization are the following:

  • we assume a single channel input (for now), which could be either a true mono signal

    or a beamforming result from a microphone array.

  • we assume that the supervision used for model training is a speech activity matrix, with one

    row dedicated to each speaker (either in the current cut or the whole dataset, depending on the settings). The columns correspond to feature frames. Each row is effectively a Voice Activity Detection supervision for a single speaker. This setup is somewhat inspired by the TS-VAD paper: https://arxiv.org/abs/2005.07272

Each item in this dataset is a dict of:

{
    'features': (B x T x F) tensor
    'speaker_activity': (B x num_speaker x T) tensor
}

Constructor arguments:

Parameters
  • cuts (CutSet) – a CutSet used to create the dataset object.

  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • global_speaker_ids (bool) – a bool, indicates whether the same speaker should always retain the same row index in the speaker activity matrix (useful for speaker-dependent systems)

  • root_dir – a prefix path to be attached to the feature files paths.

__init__(cuts, min_speaker_dim=None, global_speaker_ids=False)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.unsupervised.UnsupervisedDataset(cuts)

Dataset that contains no supervision - it only provides the features extracted from recordings. The returned features are a torch.Tensor of shape (T x F), where T is the number of frames, and F is the feature dimension.

__init__(cuts)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.unsupervised.UnsupervisedWaveformDataset(cuts)

A variant of UnsupervisedDataset that provides waveform samples instead of features. The output is a tensor of shape (C, T), with C being the number of channels and T the number of audio samples. In this implemenation, there will always be a single channel.

class lhotse.dataset.unsupervised.DynamicUnsupervisedDataset(feature_extractor, cuts, augment_fn=None)

An example dataset that shows how to use on-the-fly feature extraction in Lhotse. It accepts two additional inputs - a FeatureExtractor and an optional WavAugmenter for time-domain data augmentation.. The output is approximately the same as that of the UnsupervisedDataset - there might be slight differences for MixedCut``s, because this dataset mixes them in the time domain, and ``UnsupervisedDataset does that in the feature domain. Cuts that are not mixed will yield identical results in both dataset classes.

__init__(feature_extractor, cuts, augment_fn=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.speech_recognition.K2SpeechRecognitionDataset(cuts, return_cuts=False, cut_transforms=None)

The PyTorch Dataset for the speech recognition task using K2 library.

This dataset expects to be queried with lists of cut IDs, for which it loads features and automatically collates/batches them.

To use it with a PyTorch DataLoader, set batch_size=None and provide a SingleCutSampler sampler.

Each item in this dataset is a dict of:

{
    'features': float tensor of shape (B, T, F)
    'supervisions': [
        {
            'sequence_idx': Tensor[int] of shape (S,)
            'text': List[str] of len S
            'start_frame': Tensor[int] of shape (S,)
            'num_frames': Tensor[int] of shape (S,)
            # Optionally, when return_cuts=True
            'cut': List[AnyCut] of len S
        }
    ]
}

Dimension symbols legend: * B - batch size (number of Cuts) * S - number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions) * T - number of frames of the longest Cut * F - number of features

The ‘sequence_idx’ field is the index of the Cut used to create the example in the Dataset.

__init__(cuts, return_cuts=False, cut_transforms=None)

K2 ASR IterableDataset constructor.

Parameters
  • cuts (CutSet) – the CutSet to sample data from.

  • return_cuts (bool) – When True, will additionally return a “cut” field in each batch with the Cut objects used to create that batch.

  • cut_transforms (Optional[List[Callable[[CutSet], CutSet]]]) – A list of transforms to be applied on each sampled batch (e.g. cut concatenation, noise cuts mixing, etc.).

lhotse.dataset.speech_synthesis

alias of lhotse.dataset.speech_synthesis

class lhotse.dataset.source_separation.DynamicallyMixedSourceSeparationDataset(sources_set, mixtures_set, nonsources_set=None)

A PyTorch Dataset for the source separation task. It’s created from a number of CutSets:

  • sources_set: provides the audio cuts for the sources that (the targets of source separation),

  • mixtures_set: provides the audio cuts for the signal mix (the input of source separation),

  • nonsources_set: (optional) provides the audio cuts for other signals that are in the mix, but are not the targets of source separation. Useful for adding noise.

When queried for data samples, it returns a dict of:

{
    'sources': (N x T x F) tensor,
    'mixture': (T x F) tensor,
    'real_mask': (N x T x F) tensor,
    'binary_mask': (T x F) tensor
}

This Dataset performs on-the-fly feature-domain mixing of the sources. It expects the mixtures_set to contain MixedCuts, so that it knows which Cuts should be mixed together.

__init__(sources_set, mixtures_set, nonsources_set=None)

Initialize self. See help(type(self)) for accurate signature.

validate()
class lhotse.dataset.source_separation.PreMixedSourceSeparationDataset(sources_set, mixtures_set)

A PyTorch Dataset for the source separation task. It’s created from two CutSets - one provides the audio cuts for the sources, and the other one the audio cuts for the signal mix. When queried for data samples, it returns a dict of:

{
    'sources': (N x T x F) tensor,
    'mixture': (T x F) tensor,
    'real_mask': (N x T x F) tensor,
    'binary_mask': (T x F) tensor
}

It expects both CutSets to return regular Cuts, meaning that the signals were mixed in the time domain. In contrast to DynamicallyMixedSourceSeparationDataset, no on-the-fly feature-domain-mixing is performed.

__init__(sources_set, mixtures_set)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.dataset.vad.VadDataset(cuts)

The PyTorch Dataset for the voice activity detection task. Each item in this dataset is a dict of:

{
    'features': (T x F) tensor
    'is_voice': (T x 1) tensor
    'cut': List[Cut]
}
__init__(cuts)

Initialize self. See help(type(self)) for accurate signature.

Sampler’s list

class lhotse.dataset.sampling.CutSampler(cut_ids, shuffle=False, world_size=None, rank=None, seed=0)

CutSampler is responsible for collecting batches of cuts, given specified criteria. It implements correct handling of distributed sampling in DataLoader, so that the cuts are not duplicated across workers.

Sampling in a CutSampler is intended to be very quick - it only uses the metadata in CutSet manifest to select the cuts, and is not intended to perform any I/O.

CutSampler works similarly to PyTorch’s DistributedSampler - when shuffle=True, you should call sampler.set_epoch(epoch) at each new epoch to have a different ordering of returned elements.

Example usage:

>>> dataset = K2SpeechRecognitionDataset(cuts)
>>> sampler = SingleCutSampler(cuts, shuffle=True)
>>> loader = DataLoader(dataset, sampler=sampler, batch_size=None)
>>> for epoch in range(start_epoch, n_epochs):
...     sampler.set_epoch(epoch)
...     train(loader)

Note

For implementers of new samplers: Subclasses of CutSampler are expected to implement __next__() to introduce specific sampling logic (e.g. based on filters such as max number of frames/tokens/etc.). CutSampler defines __iter__(), which optionally shuffles the cut IDs, and resets self.current_idx to zero (to be used and incremented inside of __next__().

__init__(cut_ids, shuffle=False, world_size=None, rank=None, seed=0)
Parameters
  • cut_ids (Iterable[str]) – An iterable of cut IDs for the full dataset. CutSampler will take care of partitioning that into distributed workers (if needed).

  • shuffle (bool) – When True, the cuts will be shuffled at the start of iteration. Convenient when mini-batch loop is inside an outer epoch-level loop, e.g.: for epoch in range(10): for batch in dataset: … as every epoch will see a different cuts order.

  • world_size (Optional[int]) – Total number of distributed nodes. We will try to infer it by default.

  • rank (Optional[int]) – Index of distributed node. We will try to infer it by default.

  • seed (int) – Random seed used to consistently shuffle the dataset across different processes.

set_epoch(epoch)

Sets the epoch for this sampler. When shuffle=True, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.

Parameters

epoch (int) – Epoch number.

Return type

None

class lhotse.dataset.sampling.SingleCutSampler(cuts, max_frames=26000, max_cuts=None, **kwargs)

Samples cuts from a CutSet to satisfy the criteria of max_frames and max_cuts. It behaves like an iterable that yields lists of strings (cut IDs).

__init__(cuts, max_frames=26000, max_cuts=None, **kwargs)

SingleCutSampler’s constructor.

Parameters
  • cuts (CutSet) – the CutSet to sample data from.

  • max_frames (int) – The maximum number of feature frames from cuts that we’re going to put in a single batch. The padding introduced during collation does not contribute to that limit.

  • max_cuts (Optional[int]) – The maximum number of cuts sampled to form a mini-batch. By default, this constraint is off.

  • kwargs – Arguments to be passed into CutSampler.

class lhotse.dataset.sampling.CutPairsSampler(source_cuts, target_cuts, max_source_frames=26000, max_target_frames=26000, max_cuts=None, **kwargs)

Samples pairs of cuts from a “source” and “target” CutSet. It expects that both CutSet’s strictly consist of Cuts with corresponding IDs. It will try to satisfy the criteria of max_source_frames, max_target_frames, and max_cuts. It behaves like an iterable that yields lists of strings (cut IDs).

__init__(source_cuts, target_cuts, max_source_frames=26000, max_target_frames=26000, max_cuts=None, **kwargs)

CutPairsSampler’s constructor.

Parameters
  • source_cuts (CutSet) – the first CutSet to sample data from.

  • target_cuts (CutSet) – the second CutSet to sample data from.

  • max_source_frames (int) – The maximum number of feature frames from source_cuts that we’re going to put in a single batch. The padding introduced during collation does not contribute to that limit.

  • max_source_frames – The maximum number of feature frames from target_cuts that we’re going to put in a single batch. The padding introduced during collation does not contribute to that limit.

  • max_cuts (Optional[int]) – The maximum number of cuts sampled to form a mini-batch. By default, this constraint is off.

class lhotse.dataset.sampling.BucketingSampler(*cuts, sampler_type=<class 'lhotse.dataset.sampling.SingleCutSampler'>, num_buckets=10, **kwargs)

Sorts the cuts in a CutSet by their duration and puts them into similar duration buckets. For each bucket, it instantiates a simpler sampler instance, e.g. SingleCutSampler.

It behaves like an iterable that yields lists of strings (cut IDs). During iteration, it randomly selects one of the buckets to yield the batch from, until all the underlying samplers are depleted (which means it’s the end of an epoch).

Examples:

Bucketing sampler with 20 buckets, sampling single cuts:

>>> sampler = BucketingSampler(
...    cuts,
...    # BucketingSampler specific args
...    sampler_type=SingleCutSampler, num_buckets=20,
...    # Args passed into SingleCutSampler
...    max_frames=20000
... )

Bucketing sampler with 20 buckets, sampling pairs of source-target cuts:

>>> sampler = BucketingSampler(
...    cuts, target_cuts,
...    # BucketingSampler specific args
...    sampler_type=CutPairsSampler, num_buckets=20,
...    # Args passed into CutPairsSampler
...    max_source_frames=20000, max_target_frames=15000
... )
__init__(*cuts, sampler_type=<class 'lhotse.dataset.sampling.SingleCutSampler'>, num_buckets=10, **kwargs)

BucketingSampler’s constructor.

Parameters
  • cuts (CutSet) – one or more CutSet objects. The first one will be used to determine the buckets for all of them. Then, all of them will be used to instantiate the per-bucket samplers.

  • sampler_type (Type) – a sampler type that will be created for each underlying bucket.

  • num_buckets (int) – how many buckets to create.

  • kwargs – Arguments used to create the underlying sampler for each bucket.

set_epoch(epoch)

Sets the epoch for this sampler. When shuffle=True, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.

Parameters

epoch (int) – Epoch number.

Return type

None

property is_depleted
Return type

bool

lhotse.dataset.sampling.partition_cut_ids(data_source, world_size=1, rank=0)

Returns a list of cut IDs to be used by a single dataloading process. For multiple dataloader workers or DistributedDataParallel training, that list will be a subset of sampler.full_data_source.

Parameters
  • data_source (List[str]) – a list of Cut IDs, representing the full dataset.

  • world_size (int) – Total number of distributed nodes. Set only when using DistributedDataParallel.

  • rank (int) – Index of distributed node. Set only when using DistributedDataParallel.

Return type

List[str]

Collation utilities for building custom Datasets

lhotse.dataset.collation.collate_features(cuts)

Load features for all the cuts and return them as a batch in a torch tensor. The output shape is (batch, time, features). The cuts will be padded with silence if necessary.

Return type

Tensor

lhotse.dataset.collation.collate_audio(cuts)

Load audio samples for all the cuts and return them as a batch in a torch tensor. The output shape is (batch, time). The cuts will be padded with silence if necessary.

Return type

Tensor

lhotse.dataset.collation.collate_multi_channel_features(cuts)

Load features for all the cuts and return them as a batch in a torch tensor. The cuts have to be of type MixedCut and their tracks will be interpreted as individual channels. The output shape is (batch, channel, time, features). The cuts will be padded with silence if necessary.

Return type

Tensor

lhotse.dataset.collation.collate_multi_channel_audio(cuts)

Load audio samples for all the cuts and return them as a batch in a torch tensor. The cuts have to be of type MixedCut and their tracks will be interpreted as individual channels. The output shape is (batch, channel, time). The cuts will be padded with silence if necessary.

Return type

Tensor

lhotse.dataset.collation.collate_vectors(tensors, padding_value=- 100, matching_shapes=False)

Convert an iterable of 1-D tensors (of possibly various lengths) into a single stacked tensor.

Parameters
  • tensors (Iterable[Union[Tensor, ndarray]]) – an iterable of 1-D tensors.

  • padding_value (Union[int, float]) – the padding value inserted to make all tensors have the same length.

  • matching_shapes (bool) – when True, will fail when input tensors have different shapes.

Return type

Tensor

Returns

a tensor with shape (B, L) where B is the number of input tensors and L is the number of items in the longest tensor.

lhotse.dataset.collation.collate_matrices(tensors, padding_value=0, matching_shapes=False)

Convert an iterable of 2-D tensors (of possibly various first dimension, but consistent second dimension) into a single stacked tensor.

Parameters
  • tensors (Iterable[Union[Tensor, ndarray]]) – an iterable of 2-D tensors.

  • padding_value (Union[int, float]) – the padding value inserted to make all tensors have the same length.

  • matching_shapes (bool) – when True, will fail when input tensors have different shapes.

Return type

Tensor

Returns

a tensor with shape (B, L, F) where B is the number of input tensors, L is the largest found shape[0], and F is equal to shape[1].

lhotse.dataset.collation.maybe_pad(cuts)

Check if all cuts’ durations are equal and pad them to match the longest cut otherwise.

Return type

CutSet