PyTorch Datasets¶
Lhotse supports PyTorch’s dataset API, providing implementations for the Dataset
and Sampler
concepts.
They can be used together with the standard DataLoader
class for efficient mini-batch collection with multiple parallel readers and pre-fetching.
A quick re-cap of PyTorch’s data API¶
PyTorch defines the Dataset class that is responsible for reading the data from disk/memory/Internet/database/etc., and converting it to tensors that can be used for network training or inference.
These Dataset
’s are typically „map-style” datasets which are given an index (or a list of indices) and return the corresponding data samples.
The selection of indices is performed by the Sampler
class.
Sampler
, knowing the length (number of items) in a Dataset
, can use various strategies to determine the order of elements to read (e.g. sequential reads, or random reads).
More details about the data pipeline API in PyTorch can be found here.
About Lhotse’s Datasets and Samplers¶
Lhotse provides a number of utilities that make it simpler to define Dataset
’s for speech processing tasks.
CutSet
is the base data structure that is used to initialize the Dataset
class.
This makes it possible to manipulate the speech data in convenient ways - pad, mix, concatenate, augment, compute features, look up the supervision information, etc.
Lhotse’s Dataset
’s will perform batching by themselves, because auto-collation in DataLoader
is too limiting for speech data handling.
These Dataset
’s expect to be handed lists of element indices, so that they can collate the data before it is passed to the DataLoader
(which must use batch_size=None
).
It allows for interesting collation methods - e.g. padding the speech with noise recordings, or actual acoustic context, rather than artificial zeroes; or dynamic batch sizes.
The items for mini-batch creation are selected by the Sampler
.
Lhotse defines Sampler
classes that are initialized with CutSet
’s, so that they can look up specific properties of an utterance to stratify the sampling.
For example, SingleCutSampler
has a defined max_frames
attribute, and it will keep sampling cuts for a batch until they do not exceed the specified number of frames.
Another strategy — used in BucketingSampler
— will first group the cuts of similar durations into buckets, and then randomly select a bucket to draw the whole batch from.
For tasks where both input and output of the model are speech utterances, we can use the CutPairsSampler
, which accepts two CutSet
’s and will match the cuts in them by their IDs.
A typical Lhotse’s dataset API usage might look like this:
from torch.utils.data import DataLoader
from lhotse.dataset import SpeechRecognitionDataset, SingleCutSampler
cuts = CutSet(...)
dset = SpeechRecognitionDataset(cuts)
sampler = SingleCutSampler(cuts, max_frames=50000)
# Dataset performs batching by itself, so we have to indicate that
# to the DataLoader with batch_size=None
dloader = DataLoader(dset, sampler=sampler, batch_size=None, num_workers=1)
for batch in dloader:
... # process data
Dataset’s list¶
-
class
lhotse.dataset.diarization.
DiarizationDataset
(cuts, min_speaker_dim=None, global_speaker_ids=False)¶ A PyTorch Dataset for the speaker diarization task. Our assumptions about speaker diarization are the following:
- we assume a single channel input (for now), which could be either a true mono signal
or a beamforming result from a microphone array.
- we assume that the supervision used for model training is a speech activity matrix, with one
row dedicated to each speaker (either in the current cut or the whole dataset, depending on the settings). The columns correspond to feature frames. Each row is effectively a Voice Activity Detection supervision for a single speaker. This setup is somewhat inspired by the TS-VAD paper: https://arxiv.org/abs/2005.07272
Each item in this dataset is a dict of:
{ 'features': (B x T x F) tensor 'speaker_activity': (B x num_speaker x T) tensor }
Constructor arguments:
- Parameters
cuts (
CutSet
) – aCutSet
used to create the dataset object.min_speaker_dim (
Optional
[int
]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).global_speaker_ids (
bool
) – a bool, indicates whether the same speaker should always retain the same row index in the speaker activity matrix (useful for speaker-dependent systems)root_dir – a prefix path to be attached to the feature files paths.
-
__init__
(cuts, min_speaker_dim=None, global_speaker_ids=False)¶ Initialize self. See help(type(self)) for accurate signature.
-
class
lhotse.dataset.unsupervised.
UnsupervisedDataset
(cuts)¶ Dataset that contains no supervision - it only provides the features extracted from recordings. The returned features are a
torch.Tensor
of shape(T x F)
, where T is the number of frames, and F is the feature dimension.-
__init__
(cuts)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.dataset.unsupervised.
UnsupervisedWaveformDataset
(cuts)¶ A variant of UnsupervisedDataset that provides waveform samples instead of features. The output is a tensor of shape (C, T), with C being the number of channels and T the number of audio samples. In this implemenation, there will always be a single channel.
-
class
lhotse.dataset.unsupervised.
DynamicUnsupervisedDataset
(feature_extractor, cuts, augment_fn=None)¶ An example dataset that shows how to use on-the-fly feature extraction in Lhotse. It accepts two additional inputs - a FeatureExtractor and an optional WavAugmenter for time-domain data augmentation.. The output is approximately the same as that of the
UnsupervisedDataset
- there might be slight differences forMixedCut``s, because this dataset mixes them in the time domain, and ``UnsupervisedDataset
does that in the feature domain. Cuts that are not mixed will yield identical results in both dataset classes.-
__init__
(feature_extractor, cuts, augment_fn=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.dataset.speech_recognition.
K2SpeechRecognitionDataset
(cuts, return_cuts=False, cut_transforms=None)¶ The PyTorch Dataset for the speech recognition task using K2 library.
This dataset expects to be queried with lists of cut IDs, for which it loads features and automatically collates/batches them.
To use it with a PyTorch DataLoader, set
batch_size=None
and provide aSingleCutSampler
sampler.Each item in this dataset is a dict of:
{ 'features': float tensor of shape (B, T, F) 'supervisions': [ { 'sequence_idx': Tensor[int] of shape (S,) 'text': List[str] of len S 'start_frame': Tensor[int] of shape (S,) 'num_frames': Tensor[int] of shape (S,) # Optionally, when return_cuts=True 'cut': List[AnyCut] of len S } ] }
Dimension symbols legend: *
B
- batch size (number of Cuts) *S
- number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions) *T
- number of frames of the longest Cut *F
- number of featuresThe ‘sequence_idx’ field is the index of the Cut used to create the example in the Dataset.
-
__init__
(cuts, return_cuts=False, cut_transforms=None)¶ K2 ASR IterableDataset constructor.
- Parameters
cuts (
CutSet
) – theCutSet
to sample data from.return_cuts (
bool
) – WhenTrue
, will additionally return a “cut” field in each batch with the Cut objects used to create that batch.cut_transforms (
Optional
[List
[Callable
[[CutSet
],CutSet
]]]) – A list of transforms to be applied on each sampled batch (e.g. cut concatenation, noise cuts mixing, etc.).
-
-
lhotse.dataset.
speech_synthesis
¶ alias of
lhotse.dataset.speech_synthesis
-
class
lhotse.dataset.source_separation.
DynamicallyMixedSourceSeparationDataset
(sources_set, mixtures_set, nonsources_set=None)¶ A PyTorch Dataset for the source separation task. It’s created from a number of CutSets:
sources_set
: provides the audio cuts for the sources that (the targets of source separation),mixtures_set
: provides the audio cuts for the signal mix (the input of source separation),nonsources_set
: (optional) provides the audio cuts for other signals that are in the mix, but are not the targets of source separation. Useful for adding noise.
When queried for data samples, it returns a dict of:
{ 'sources': (N x T x F) tensor, 'mixture': (T x F) tensor, 'real_mask': (N x T x F) tensor, 'binary_mask': (T x F) tensor }
This Dataset performs on-the-fly feature-domain mixing of the sources. It expects the mixtures_set to contain MixedCuts, so that it knows which Cuts should be mixed together.
-
__init__
(sources_set, mixtures_set, nonsources_set=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
validate
()¶
-
class
lhotse.dataset.source_separation.
PreMixedSourceSeparationDataset
(sources_set, mixtures_set)¶ A PyTorch Dataset for the source separation task. It’s created from two CutSets - one provides the audio cuts for the sources, and the other one the audio cuts for the signal mix. When queried for data samples, it returns a dict of:
{ 'sources': (N x T x F) tensor, 'mixture': (T x F) tensor, 'real_mask': (N x T x F) tensor, 'binary_mask': (T x F) tensor }
It expects both CutSets to return regular Cuts, meaning that the signals were mixed in the time domain. In contrast to DynamicallyMixedSourceSeparationDataset, no on-the-fly feature-domain-mixing is performed.
-
__init__
(sources_set, mixtures_set)¶ Initialize self. See help(type(self)) for accurate signature.
-
Sampler’s list¶
-
class
lhotse.dataset.sampling.
CutSampler
(cut_ids, shuffle=False, world_size=None, rank=None, seed=0)¶ CutSampler is responsible for collecting batches of cuts, given specified criteria. It implements correct handling of distributed sampling in DataLoader, so that the cuts are not duplicated across workers.
Sampling in a CutSampler is intended to be very quick - it only uses the metadata in
CutSet
manifest to select the cuts, and is not intended to perform any I/O.CutSampler works similarly to PyTorch’s DistributedSampler - when
shuffle=True
, you should callsampler.set_epoch(epoch)
at each new epoch to have a different ordering of returned elements.Example usage:
>>> dataset = K2SpeechRecognitionDataset(cuts) >>> sampler = SingleCutSampler(cuts, shuffle=True) >>> loader = DataLoader(dataset, sampler=sampler, batch_size=None) >>> for epoch in range(start_epoch, n_epochs): ... sampler.set_epoch(epoch) ... train(loader)
Note
For implementers of new samplers: Subclasses of CutSampler are expected to implement
__next__()
to introduce specific sampling logic (e.g. based on filters such as max number of frames/tokens/etc.). CutSampler defines__iter__()
, which optionally shuffles the cut IDs, and resetsself.current_idx
to zero (to be used and incremented inside of__next__()
.-
__init__
(cut_ids, shuffle=False, world_size=None, rank=None, seed=0)¶ - Parameters
cut_ids (
Iterable
[str
]) – An iterable of cut IDs for the full dataset. CutSampler will take care of partitioning that into distributed workers (if needed).shuffle (
bool
) – WhenTrue
, the cuts will be shuffled at the start of iteration. Convenient when mini-batch loop is inside an outer epoch-level loop, e.g.: for epoch in range(10): for batch in dataset: … as every epoch will see a different cuts order.world_size (
Optional
[int
]) – Total number of distributed nodes. We will try to infer it by default.rank (
Optional
[int
]) – Index of distributed node. We will try to infer it by default.seed (
int
) – Random seed used to consistently shuffle the dataset across different processes.
-
set_epoch
(epoch)¶ Sets the epoch for this sampler. When
shuffle=True
, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.- Parameters
epoch (
int
) – Epoch number.- Return type
None
-
-
class
lhotse.dataset.sampling.
SingleCutSampler
(cuts, max_frames=26000, max_cuts=None, **kwargs)¶ Samples cuts from a CutSet to satisfy the criteria of max_frames and max_cuts. It behaves like an iterable that yields lists of strings (cut IDs).
-
__init__
(cuts, max_frames=26000, max_cuts=None, **kwargs)¶ SingleCutSampler’s constructor.
- Parameters
cuts (
CutSet
) – theCutSet
to sample data from.max_frames (
int
) – The maximum number of feature frames fromcuts
that we’re going to put in a single batch. The padding introduced during collation does not contribute to that limit.max_cuts (
Optional
[int
]) – The maximum number of cuts sampled to form a mini-batch. By default, this constraint is off.kwargs – Arguments to be passed into
CutSampler
.
-
-
class
lhotse.dataset.sampling.
CutPairsSampler
(source_cuts, target_cuts, max_source_frames=26000, max_target_frames=26000, max_cuts=None, **kwargs)¶ Samples pairs of cuts from a “source” and “target” CutSet. It expects that both CutSet’s strictly consist of Cuts with corresponding IDs. It will try to satisfy the criteria of max_source_frames, max_target_frames, and max_cuts. It behaves like an iterable that yields lists of strings (cut IDs).
-
__init__
(source_cuts, target_cuts, max_source_frames=26000, max_target_frames=26000, max_cuts=None, **kwargs)¶ CutPairsSampler’s constructor.
- Parameters
source_cuts (
CutSet
) – the firstCutSet
to sample data from.target_cuts (
CutSet
) – the secondCutSet
to sample data from.max_source_frames (
int
) – The maximum number of feature frames fromsource_cuts
that we’re going to put in a single batch. The padding introduced during collation does not contribute to that limit.max_source_frames – The maximum number of feature frames from
target_cuts
that we’re going to put in a single batch. The padding introduced during collation does not contribute to that limit.max_cuts (
Optional
[int
]) – The maximum number of cuts sampled to form a mini-batch. By default, this constraint is off.
-
-
class
lhotse.dataset.sampling.
BucketingSampler
(*cuts, sampler_type=<class 'lhotse.dataset.sampling.SingleCutSampler'>, num_buckets=10, **kwargs)¶ Sorts the cuts in a
CutSet
by their duration and puts them into similar duration buckets. For each bucket, it instantiates a simpler sampler instance, e.g.SingleCutSampler
.It behaves like an iterable that yields lists of strings (cut IDs). During iteration, it randomly selects one of the buckets to yield the batch from, until all the underlying samplers are depleted (which means it’s the end of an epoch).
Examples:
Bucketing sampler with 20 buckets, sampling single cuts:
>>> sampler = BucketingSampler( ... cuts, ... # BucketingSampler specific args ... sampler_type=SingleCutSampler, num_buckets=20, ... # Args passed into SingleCutSampler ... max_frames=20000 ... )
Bucketing sampler with 20 buckets, sampling pairs of source-target cuts:
>>> sampler = BucketingSampler( ... cuts, target_cuts, ... # BucketingSampler specific args ... sampler_type=CutPairsSampler, num_buckets=20, ... # Args passed into CutPairsSampler ... max_source_frames=20000, max_target_frames=15000 ... )
-
__init__
(*cuts, sampler_type=<class 'lhotse.dataset.sampling.SingleCutSampler'>, num_buckets=10, **kwargs)¶ BucketingSampler’s constructor.
- Parameters
cuts (
CutSet
) – one or moreCutSet
objects. The first one will be used to determine the buckets for all of them. Then, all of them will be used to instantiate the per-bucket samplers.sampler_type (
Type
) – a sampler type that will be created for each underlying bucket.num_buckets (
int
) – how many buckets to create.kwargs – Arguments used to create the underlying sampler for each bucket.
-
set_epoch
(epoch)¶ Sets the epoch for this sampler. When
shuffle=True
, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.- Parameters
epoch (
int
) – Epoch number.- Return type
None
-
property
is_depleted
¶ - Return type
bool
-
-
lhotse.dataset.sampling.
partition_cut_ids
(data_source, world_size=1, rank=0)¶ Returns a list of cut IDs to be used by a single dataloading process. For multiple dataloader workers or
DistributedDataParallel
training, that list will be a subset ofsampler.full_data_source
.- Parameters
data_source (
List
[str
]) – a list of Cut IDs, representing the full dataset.world_size (
int
) – Total number of distributed nodes. Set only when usingDistributedDataParallel
.rank (
int
) – Index of distributed node. Set only when usingDistributedDataParallel
.
- Return type
List
[str
]
Collation utilities for building custom Datasets¶
-
lhotse.dataset.collation.
collate_features
(cuts)¶ Load features for all the cuts and return them as a batch in a torch tensor. The output shape is
(batch, time, features)
. The cuts will be padded with silence if necessary.- Return type
Tensor
-
lhotse.dataset.collation.
collate_audio
(cuts)¶ Load audio samples for all the cuts and return them as a batch in a torch tensor. The output shape is
(batch, time)
. The cuts will be padded with silence if necessary.- Return type
Tensor
-
lhotse.dataset.collation.
collate_multi_channel_features
(cuts)¶ Load features for all the cuts and return them as a batch in a torch tensor. The cuts have to be of type
MixedCut
and their tracks will be interpreted as individual channels. The output shape is(batch, channel, time, features)
. The cuts will be padded with silence if necessary.- Return type
Tensor
-
lhotse.dataset.collation.
collate_multi_channel_audio
(cuts)¶ Load audio samples for all the cuts and return them as a batch in a torch tensor. The cuts have to be of type
MixedCut
and their tracks will be interpreted as individual channels. The output shape is(batch, channel, time)
. The cuts will be padded with silence if necessary.- Return type
Tensor
-
lhotse.dataset.collation.
collate_vectors
(tensors, padding_value=- 100, matching_shapes=False)¶ Convert an iterable of 1-D tensors (of possibly various lengths) into a single stacked tensor.
- Parameters
tensors (
Iterable
[Union
[Tensor
,ndarray
]]) – an iterable of 1-D tensors.padding_value (
Union
[int
,float
]) – the padding value inserted to make all tensors have the same length.matching_shapes (
bool
) – whenTrue
, will fail when input tensors have different shapes.
- Return type
Tensor
- Returns
a tensor with shape
(B, L)
whereB
is the number of input tensors andL
is the number of items in the longest tensor.
-
lhotse.dataset.collation.
collate_matrices
(tensors, padding_value=0, matching_shapes=False)¶ Convert an iterable of 2-D tensors (of possibly various first dimension, but consistent second dimension) into a single stacked tensor.
- Parameters
tensors (
Iterable
[Union
[Tensor
,ndarray
]]) – an iterable of 2-D tensors.padding_value (
Union
[int
,float
]) – the padding value inserted to make all tensors have the same length.matching_shapes (
bool
) – whenTrue
, will fail when input tensors have different shapes.
- Return type
Tensor
- Returns
a tensor with shape
(B, L, F)
whereB
is the number of input tensors,L
is the largest found shape[0], andF
is equal to shape[1].