PyTorch Datasets¶
Lhotse supports PyTorch’s dataset API, providing implementations for the Dataset
and Sampler
concepts.
They can be used together with the standard DataLoader
class for efficient mini-batch collection with multiple parallel readers and pre-fetching.
A quick re-cap of PyTorch’s data API¶
PyTorch defines the Dataset class that is responsible for reading the data from disk/memory/Internet/database/etc., and converting it to tensors that can be used for network training or inference.
These Dataset
’s are typically „map-style” datasets which are given an index (or a list of indices) and return the corresponding data samples.
The selection of indices is performed by the Sampler
class.
Sampler
, knowing the length (number of items) in a Dataset
, can use various strategies to determine the order of elements to read (e.g. sequential reads, or random reads).
More details about the data pipeline API in PyTorch can be found here.
About Lhotse’s Datasets and Samplers¶
Lhotse provides a number of utilities that make it simpler to define Dataset
’s for speech processing tasks.
CutSet
is the base data structure that is used to initialize the Dataset
class.
This makes it possible to manipulate the speech data in convenient ways - pad, mix, concatenate, augment, compute features, look up the supervision information, etc.
Lhotse’s Dataset
’s will perform batching by themselves, because auto-collation in DataLoader
is too limiting for speech data handling.
These Dataset
’s expect to be handed lists of element indices, so that they can collate the data before it is passed to the DataLoader
(which must use batch_size=None
).
It allows for interesting collation methods - e.g. padding the speech with noise recordings, or actual acoustic context, rather than artificial zeroes; or dynamic batch sizes.
The items for mini-batch creation are selected by the Sampler
.
Lhotse defines Sampler
classes that are initialized with CutSet
’s, so that they can look up specific properties of an utterance to stratify the sampling.
For example, SingleCutSampler
has a defined max_frames
attribute, and it will keep sampling cuts for a batch until they do not exceed the specified number of frames.
Another strategy — used in BucketingSampler
— will first group the cuts of similar durations into buckets, and then randomly select a bucket to draw the whole batch from.
For tasks where both input and output of the model are speech utterances, we can use the CutPairsSampler
, which accepts two CutSet
’s and will match the cuts in them by their IDs.
A typical Lhotse’s dataset API usage might look like this:
from torch.utils.data import DataLoader
from lhotse.dataset import SpeechRecognitionDataset, SingleCutSampler
cuts = CutSet(...)
dset = SpeechRecognitionDataset(cuts)
sampler = SingleCutSampler(cuts, max_frames=50000)
# Dataset performs batching by itself, so we have to indicate that
# to the DataLoader with batch_size=None
dloader = DataLoader(dset, sampler=sampler, batch_size=None, num_workers=1)
for batch in dloader:
... # process data
Restoring sampler’s state: continuing the training¶
All CutSampler
types can save their progress and pick up from that checkpoint.
For consistency with PyTorch tensors, the relevant methods are called .state_dict()
and .load_state_dict()
.
The following example illustrates how to save the sampler’s state (pay attention to the last bit):
dataset = ... # Some task-specific dataset initialization
sampler = BucketingSampler(cuts, max_duration=200, shuffle=True, num_buckets=30)
dloader = DataLoader(dataset, batch_size=None, sampler=sampler, num_workers=4)
global_step = 0
for epoch in range(30):
dloader.sampler.set_epoch(epoch)
for batch in dloader:
# ... processing forward, backward, etc.
global_step += 1
if global_step % 5000 == 0:
state = dloader.sampler.state_dict()
torch.save(state, f'sampler-ckpt-ep{epoch}-step{global_step}.pt')
In case that the training is ended abruptly and the epochs are very long (10k+ steps, not uncommon with large datasets these days), we can resume the training from where it left off like the following:
# Creating a vanilla sampler, we will read the previous progress into it.
sampler = BucketingSampler(cuts, max_duration=200, shuffle=True, num_buckets=30)
# Restore the sampler's state.
state = torch.load('sampler-ckpt-ep5-step75000.pt')
sampler.load_state_dict(state)
dloader = DataLoader(dataset, batch_size=None, sampler=sampler, num_workers=4)
global_step = sampler.diagnostics.total_cuts # <-- Restore the global step idx.
for epoch in range(sampler.epoch, 30): # <-- Skip previous epochs that are already processed.
dloader.sampler.set_epoch(epoch)
for batch in dloader:
# Note: the first batch is going to be from step 75009.
# With DataLoader num_workers==0, it would have been 75001, but we get
# +8 because of num_workers==4 * prefetching_factor==2
# ... processing forward, backward, etc.
global_step += 1
Note
In general, the sampler arguments may be different – loading a state_dict
will
overwrite the arguments, and emit a warning for the user to be aware what happened.
BucketingSampler
is an exception – the num_buckets
and bucket_method
must be consistent, otherwise we couldn’t guarantee identical
outcomes after training resumption.
Note
The DataLoader
’s num_workers
can be different after resuming.
Batch I/O: pre-computed vs. on-the-fly features¶
Depending on the experimental setup and infrastructure, it might be more convenient to either pre-compute and store features like filter-bank energies for later use (as traditionally done in Kaldi/ESPnet/Espresso toolkits), or compute them dynamically during training (“on-the-fly”).
Lhotse supports both modes of computation by introducing a class called BatchIO
.
It is accepted as an argument in most dataset classes, and defaults to PrecomputedFeatures
.
Other available choices are AudioSamples
for working with waveforms directly,
and OnTheFlyFeatures
, which wraps a FeatureExtractor
and applies it to a batch of recordings. These strategies automatically pad and collate the inputs, and provide information about the original signal lengths: as a number of frames/samples, binary mask, or start-end frame/sample pairs.
Which strategy to choose?¶
In general, pre-computed features can be greatly compressed (we achieve 70% size reduction with regard to un-compressed features), and so the I/O load on your computing infrastructure will be much smaller than if you read the recordings directly. This is especially valuable when working with network file systems (NFS) that are typically used in computational grids for storage. When your experiment is I/O bound, then it is best to use pre-computed features.
When I/O is not the issue, it might be preferable to use on-the-fly computation as it shouldn’t require any prior steps to perform the network training. It is also simpler to apply a vast range of data augmentation methods in a fully randomized way (e.g. reverberation), although Lhotse provides support for approximate feature-domain signal mixing (e.g. for additive noise augmentation) to alleviate that to some extent.
Dataset’s list¶
- class lhotse.dataset.diarization.DiarizationDataset(cuts, uem=None, min_speaker_dim=None, global_speaker_ids=False)[source]¶
A PyTorch Dataset for the speaker diarization task. Our assumptions about speaker diarization are the following:
- we assume a single channel input (for now), which could be either a true mono signal
or a beamforming result from a microphone array.
- we assume that the supervision used for model training is a speech activity matrix, with one
row dedicated to each speaker (either in the current cut or the whole dataset, depending on the settings). The columns correspond to feature frames. Each row is effectively a Voice Activity Detection supervision for a single speaker. This setup is somewhat inspired by the TS-VAD paper: https://arxiv.org/abs/2005.07272
Each item in this dataset is a dict of:
{ 'features': (B x T x F) tensor 'features_lens': (B, ) tensor 'speaker_activity': (B x num_speaker x T) tensor }
Constructor arguments:
- Parameters
cuts (
CutSet
) – aCutSet
used to create the dataset object.uem (
Optional
[SupervisionSet
]) – aSupervisionSet
used to set regions for diarizationmin_speaker_dim (
Optional
[int
]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).global_speaker_ids (
bool
) – a bool, indicates whether the same speaker should always retain the same row index in the speaker activity matrix (useful for speaker-dependent systems)root_dir – a prefix path to be attached to the feature files paths.
- class lhotse.dataset.unsupervised.UnsupervisedDataset[source]¶
Dataset that contains no supervision - it only provides the features extracted from recordings.
{ 'features': (B x T x F) tensor 'features_lens': (B, ) tensor }
- class lhotse.dataset.unsupervised.UnsupervisedWaveformDataset(collate=True)[source]¶
A variant of UnsupervisedDataset that provides waveform samples instead of features. The output is a tensor of shape (C, T), with C being the number of channels and T the number of audio samples. In this implementation, there will always be a single channel.
Returns:
{ 'audio': (B x NumSamples) float tensor 'audio_lens': (B, ) int tensor }
- class lhotse.dataset.unsupervised.DynamicUnsupervisedDataset(feature_extractor, augment_fn=None)[source]¶
An example dataset that shows how to use on-the-fly feature extraction in Lhotse. It accepts two additional inputs - a FeatureExtractor and an optional WavAugmenter for time-domain data augmentation.. The output is approximately the same as that of the
UnsupervisedDataset
- there might be slight differences forMixedCut``s, because this dataset mixes them in the time domain, and ``UnsupervisedDataset
does that in the feature domain. Cuts that are not mixed will yield identical results in both dataset classes.
- class lhotse.dataset.speech_recognition.K2SpeechRecognitionDataset(return_cuts=False, cut_transforms=None, input_transforms=None, input_strategy=<lhotse.dataset.input_strategies.PrecomputedFeatures object>)[source]¶
The PyTorch Dataset for the speech recognition task using k2 library.
This dataset expects to be queried with lists of cut IDs, for which it loads features and automatically collates/batches them.
To use it with a PyTorch DataLoader, set
batch_size=None
and provide aSingleCutSampler
sampler.Each item in this dataset is a dict of:
{ 'inputs': float tensor with shape determined by :attr:`input_strategy`: - single-channel: - features: (B, T, F) - audio: (B, T) - multi-channel: currently not supported 'supervisions': [ { 'sequence_idx': Tensor[int] of shape (S,) 'text': List[str] of len S # For feature input strategies 'start_frame': Tensor[int] of shape (S,) 'num_frames': Tensor[int] of shape (S,) # For audio input strategies 'start_sample': Tensor[int] of shape (S,) 'num_samples': Tensor[int] of shape (S,) # Optionally, when return_cuts=True 'cut': List[AnyCut] of len S } ] }
Dimension symbols legend: *
B
- batch size (number of Cuts) *S
- number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions) *T
- number of frames of the longest Cut *F
- number of featuresThe ‘sequence_idx’ field is the index of the Cut used to create the example in the Dataset.
- __init__(return_cuts=False, cut_transforms=None, input_transforms=None, input_strategy=<lhotse.dataset.input_strategies.PrecomputedFeatures object>)[source]¶
k2 ASR IterableDataset constructor.
- Parameters
return_cuts (
bool
) – WhenTrue
, will additionally return a “cut” field in each batch with the Cut objects used to create that batch.cut_transforms (
Optional
[List
[Callable
[[CutSet
],CutSet
]]]) – A list of transforms to be applied on each sampled batch, before converting cuts to an input representation (audio/features). Examples: cut concatenation, noise cuts mixing, etc.input_transforms (
Optional
[List
[Callable
[[Tensor
],Tensor
]]]) – A list of transforms to be applied on each sampled batch, after the cuts are converted to audio/features. Examples: normalization, SpecAugment, etc.input_strategy (
BatchIO
) – Converts cuts into a collated batch of audio/features. By default, reads pre-computed features from disk.
- lhotse.dataset.speech_synthesis¶
alias of <module ‘lhotse.dataset.speech_synthesis’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/lhotse/envs/v0.12_di/lib/python3.7/site-packages/lhotse/dataset/speech_synthesis.py’>
- class lhotse.dataset.source_separation.DynamicallyMixedSourceSeparationDataset(sources_set, mixtures_set, nonsources_set=None)[source]¶
A PyTorch Dataset for the source separation task. It’s created from a number of CutSets:
sources_set
: provides the audio cuts for the sources that (the targets of source separation),mixtures_set
: provides the audio cuts for the signal mix (the input of source separation),nonsources_set
: (optional) provides the audio cuts for other signals that are in the mix, but are not the targets of source separation. Useful for adding noise.
When queried for data samples, it returns a dict of:
{ 'sources': (N x T x F) tensor, 'mixture': (T x F) tensor, 'real_mask': (N x T x F) tensor, 'binary_mask': (T x F) tensor }
This Dataset performs on-the-fly feature-domain mixing of the sources. It expects the mixtures_set to contain MixedCuts, so that it knows which Cuts should be mixed together.
- class lhotse.dataset.source_separation.PreMixedSourceSeparationDataset(sources_set, mixtures_set)[source]¶
A PyTorch Dataset for the source separation task. It’s created from two CutSets - one provides the audio cuts for the sources, and the other one the audio cuts for the signal mix. When queried for data samples, it returns a dict of:
{ 'sources': (N x T x F) tensor, 'mixture': (T x F) tensor, 'real_mask': (N x T x F) tensor, 'binary_mask': (T x F) tensor }
It expects both CutSets to return regular Cuts, meaning that the signals were mixed in the time domain. In contrast to DynamicallyMixedSourceSeparationDataset, no on-the-fly feature-domain-mixing is performed.
- class lhotse.dataset.vad.VadDataset(input_strategy=<lhotse.dataset.input_strategies.PrecomputedFeatures object>, cut_transforms=None, input_transforms=None)[source]¶
The PyTorch Dataset for the voice activity detection task. Each item in this dataset is a dict of:
{ 'inputs': (B x T x F) tensor 'input_lens': (B,) tensor 'is_voice': (T x 1) tensor 'cut': List[Cut] }
Sampler’s list¶
Input strategies’ list¶
- class lhotse.dataset.input_strategies.BatchIO(num_workers=0)[source]¶
Converts a
CutSet
into a collated batch of audio representations. These representations can be e.g. audio samples or features. They might also be single or multi channel.All InputStrategies support the
executor
parameter in the constructor. It allows to pass aThreadPoolExecutor
or aProcessPoolExecutor
to parallelize reading audio/features from wherever they are stored. Note that this approach is incompatible with specifying thenum_workers
totorch.utils.data.DataLoader
, but in some instances may be faster.Note
This is a base class that only defines the interface.
- __call__(cuts)[source]¶
Returns a tensor with collated input signals, and a tensor of length of each signal before padding.
- Return type
Tuple
[Tensor
,IntTensor
]
- supervision_intervals(cuts)[source]¶
Returns a dict that specifies the start and end bounds for each supervision, as a 1-D int tensor.
Depending on the strategy, the dict should look like:
or
Where
S
is the total number of supervisions encountered in theCutSet
. Note thatS
might be different than the number of cuts (B
).sequence_idx
means the index of the corresponding feature matrix (or cut) in a batch.- Return type
Dict
[str
,Tensor
]
- supervision_masks(cuts)[source]¶
Returns a collated batch of masks, marking the supervised regions in cuts. They are zero-padded to the longest cut.
Depending on the strategy implementation, it is expected to be a tensor of shape
(B, NF)
or(B, NS)
, whereB
denotes the number of cuts,NF
the number of frames andNS
the total number of samples.NF
andNS
are determined by the longest cut in a batch.- Return type
Tensor
- class lhotse.dataset.input_strategies.PrecomputedFeatures(num_workers=0)[source]¶
InputStrategy
that reads pre-computed features, whose manifests are attached to cuts, from disk.It pads the feature matrices, if needed.
- __call__(cuts)[source]¶
Reads the pre-computed features from disk/other storage. The returned shape is
(B, T, F) => (batch_size, num_frames, num_features)
.- Return type
Tuple
[Tensor
,IntTensor
]- Returns
a tensor with collated features, and a tensor of
num_frames
of each cut before padding.
- supervision_intervals(cuts)[source]¶
Returns a dict that specifies the start and end bounds for each supervision, as a 1-D int tensor, in terms of frames:
Where
S
is the total number of supervisions encountered in theCutSet
. Note thatS
might be different than the number of cuts (B
).sequence_idx
means the index of the corresponding feature matrix (or cut) in a batch.- Return type
Dict
[str
,Tensor
]
- class lhotse.dataset.input_strategies.AudioSamples(num_workers=0)[source]¶
InputStrategy
that reads single-channel recordings, whose manifests are attached to cuts, from disk (or other audio source).It pads the recordings, if needed.
- __call__(cuts)[source]¶
Reads the audio samples from recordings on disk/other storage. The returned shape is
(B, T) => (batch_size, num_samples)
.- Return type
Tuple
[Tensor
,IntTensor
]- Returns
a tensor with collated audio samples, and a tensor of
num_samples
of each cut before padding.
- supervision_intervals(cuts)[source]¶
Returns a dict that specifies the start and end bounds for each supervision, as a 1-D int tensor, in terms of samples:
Where
S
is the total number of supervisions encountered in theCutSet
. Note thatS
might be different than the number of cuts (B
).sequence_idx
means the index of the corresponding feature matrix (or cut) in a batch.- Return type
Dict
[str
,Tensor
]
- class lhotse.dataset.input_strategies.OnTheFlyFeatures(extractor, wave_transforms=None, num_workers=0, use_batch_extract=True)[source]¶
InputStrategy
that reads single-channel recordings, whose manifests are attached to cuts, from disk (or other audio source). Then, it uses aFeatureExtractor
to compute their features on-the-fly.It pads the recordings, if needed.
- __call__(cuts)[source]¶
Reads the audio samples from recordings on disk/other storage and computes their features. The returned shape is
(B, T, F) => (batch_size, num_frames, num_features)
.- Return type
Tuple
[Tensor
,IntTensor
]- Returns
a tensor with collated features, and a tensor of
num_frames
of each cut before padding.
- __init__(extractor, wave_transforms=None, num_workers=0, use_batch_extract=True)[source]¶
OnTheFlyFeatures’ constructor.
- Parameters
extractor (
FeatureExtractor
) – the feature extractor used on-the-fly (individually on each waveform).wave_transforms (
Optional
[List
[Callable
[[Tensor
],Tensor
]]]) – an optional list of transforms applied on the batch of audio waveforms collated into a single tensor, right before the feature extraction.use_batch_extract (
bool
) – whenTrue
, we will callextract_batch()
to compute the features as it is possibly faster. It has a restriction that all cuts must have the same sampling rate. If that is not the case, set this toFalse
.
- supervision_intervals(cuts)[source]¶
Returns a dict that specifies the start and end bounds for each supervision, as a 1-D int tensor, in terms of frames:
Where
S
is the total number of supervisions encountered in theCutSet
. Note thatS
might be different than the number of cuts (B
).sequence_idx
means the index of the corresponding feature matrix (or cut) in a batch.- Return type
Dict
[str
,Tensor
]
Augmentation - transforms on cuts¶
Some transforms, in order for us to have accurate information about the start and end times of the signal and its supervisions, have to be performed on cuts (or CutSets).
- class lhotse.dataset.cut_transforms.CutConcatenate(gap=1.0, duration_factor=1.0)[source]¶
A transform on batch of cuts (
CutSet
) that concatenates the cuts to minimize the total amount of padding; e.g. instead of creating a batch with 40 examples, we will merge some of the examples together adding some silence between them to avoid a large number of padding frames that waste the computation.- __init__(gap=1.0, duration_factor=1.0)[source]¶
CutConcatenate’s constructor.
- Parameters
gap (
float
) – The duration of silence in seconds that is inserted between the cuts; it’s goal is to let the model “know” that there are separate utterances in a single example.duration_factor (
float
) – Determines the maximum duration of the concatenated cuts; by default it’s 1, setting the limit at the duration of the longest cut in the batch.
- class lhotse.dataset.cut_transforms.CutMix(cuts, snr=(10, 20), prob=0.5, pad_to_longest=True, preserve_id=False)[source]¶
A transform for batches of cuts (CutSet’s) that stochastically performs noise augmentation with a constant or varying SNR.
- __init__(cuts, snr=(10, 20), prob=0.5, pad_to_longest=True, preserve_id=False)[source]¶
CutMix’s constructor.
- Parameters
cuts (
CutSet
) – aCutSet
containing augmentation data, e.g. noise, music, babble.snr (
Union
[float
,Tuple
[float
,float
],None
]) – either a float, a pair (range) of floats, orNone
. It determines the SNR of the speech signal vs the noise signal that’s mixed into it. When a range is specified, we will uniformly sample SNR in that range. When it’sNone
, the noise will be mixed as-is – i.e. without any level adjustment. Note that it’s different fromsnr=0
, which will adjust the noise level so that the SNR is 0.prob (
float
) – a float probability in range [0, 1]. Specifies the probability with which we will mix augment the cuts.pad_to_longest (
bool
) – when True, each processedCutSet
will be padded with noise to match the duration of the longest Cut in a batch.preserve_id (
bool
) – WhenTrue
, preserves the IDs the cuts had before augmentation. Otherwise, new random IDs are generated for the augmented cuts (default).
- class lhotse.dataset.cut_transforms.ExtraPadding(extra_frames=None, extra_samples=None, extra_seconds=None, pad_feat_value=- 23.025850929940457, randomized=False, preserve_id=False)[source]¶
A transform on batch of cuts (
CutSet
) that adds a number of extra context frames/samples/seconds on both sides of the cut. Exactly one type of duration has to specified in the constructor.It is intended mainly for training frame-synchronous ASR models with convolutional layers to avoid using padding inside of the hidden layers, by giving the model larger context in the input. Another useful application is to shift the input by a little, so that the data seen after frame subsampling is a bit different, which makes this a data augmentation technique.
This is best used as the first transform in the transform list for dataset - it will ensure that each individual cut gets extra context before concatenation, or that it will be filled with noise, etc.
- __init__(extra_frames=None, extra_samples=None, extra_seconds=None, pad_feat_value=- 23.025850929940457, randomized=False, preserve_id=False)[source]¶
ExtraPadding’s constructor.
- Parameters
extra_frames (
Optional
[int
]) – The total number of frames to add to each cut. We will add half that number on each side of the cut (“both” directions padding).extra_samples (
Optional
[int
]) – The total number of samples to add to each cut. We will add half that number on each side of the cut (“both” directions padding).extra_seconds (
Optional
[float
]) – The total duration in seconds to add to each cut. We will add half that number on each side of the cut (“both” directions padding).pad_feat_value (
float
) – When padding a cut with precomputed features, what value should be used for padding (the default is a very low log-energy).randomized (
bool
) – WhenTrue
, we will sample a value from a uniform distribution of[0, extra_X]
for each cut (for samples/frames – sample an int, for duration – sample a float).preserve_id (
bool
) – WhenTrue
, preserves the IDs the cuts had before augmentation. Otherwise, new random IDs are generated for the augmented cuts (default).
- class lhotse.dataset.cut_transforms.PerturbSpeed(factors, p, randgen=None, preserve_id=False)[source]¶
A transform on batch of cuts (
CutSet
) that perturbs the speed of the recordings with a given probabilityp
.If the effect is applied, then one of the perturbation factors from the constructor’s
factors
parameter is sampled with uniform probability.
- class lhotse.dataset.cut_transforms.PerturbTempo(factors, p, randgen=None, preserve_id=False)[source]¶
A transform on batch of cuts (
CutSet
) that perturbs the tempo of the recordings with a given probabilityp
.If the effect is applied, then one of the perturbation factors from the constructor’s
factors
parameter is sampled with uniform probability.
- class lhotse.dataset.cut_transforms.PerturbVolume(factors, p, randgen=None, preserve_id=False)[source]¶
A transform on batch of cuts (
CutSet
) that perturbs the volume of the recordings with a given probabilityp
.If the effect is applied, then one of the perturbation factors from the constructor’s
factors
parameter is sampled with uniform probability.
Augmentation - transforms on signals¶
These transforms work directly on batches of collated feature matrices (or possibly raw waveforms, if applicable).
- class lhotse.dataset.signal_transforms.GlobalMVN(feature_dim)[source]¶
Apply global mean and variance normalization
- __init__(feature_dim)[source]¶
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(features, *args, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
- class lhotse.dataset.signal_transforms.SpecAugment(time_warp_factor=80, num_feature_masks=1, features_mask_size=13, num_frame_masks=1, frames_mask_size=70, max_frames_mask_fraction=0.2, p=0.5)[source]¶
SpecAugment performs three augmentations: - time warping of the feature matrix - masking of ranges of features (frequency bands) - masking of ranges of frames (time)
The current implementation works with batches, but processes each example separately in a loop rather than simultaneously to achieve different augmentation parameters for each example.
- __init__(time_warp_factor=80, num_feature_masks=1, features_mask_size=13, num_frame_masks=1, frames_mask_size=70, max_frames_mask_fraction=0.2, p=0.5)[source]¶
SpecAugment’s constructor.
- Parameters
time_warp_factor (
Optional
[int
]) – parameter for the time warping; larger values mean more warping. Set toNone
, or less than1
, to disable.num_feature_masks (
int
) – how many feature masks should be applied. Set to0
to disable.features_mask_size (
int
) – the width of the feature mask (expressed in the number of masked feature bins). This is theF
parameter from the SpecAugment paper.num_frame_masks (
int
) – how many frame (temporal) masks should be applied. Set to0
to disable.frames_mask_size (
int
) – the width of the frame (temporal) masks (expressed in the number of masked frames). This is theT
parameter from the SpecAugment paper.max_frames_mask_fraction (
float
) – limits the size of the frame (temporal) mask to this value times the length of the utterance (or supervision segment). This is the parameter denoted byp
in the SpecAugment paper.p – the probability of applying this transform. It is different from
p
in the SpecAugment paper!
- forward(features, supervision_segments=None, *args, **kwargs)[source]¶
Computes SpecAugment for a batch of feature matrices.
Since the batch will usually already be padded, the user can optionally provide a
supervision_segments
tensor that will be used to apply SpecAugment only to selected areas of the input. The format of this input is described below.- Parameters
features (
Tensor
) – a batch of feature matrices with shape(B, T, F)
.supervision_segments (
Optional
[IntTensor
]) – an int tensor of shape(S, 3)
.S
is the number of supervision segments that exist infeatures
– there may be either less or more than the batch size. The second dimension encoder three kinds of information: the sequence index of the corresponding feature matrix in features, the start frame index, and the number of frames for each segment.
- Return type
Tensor
- Returns
an augmented tensor of shape
(B, T, F)
.
- training: bool¶
- class lhotse.dataset.signal_transforms.RandomizedSmoothing(sigma=0.1, sample_sigma=True, p=0.3)[source]¶
Randomized smoothing - gaussian noise added to an input waveform, or a batch of waveforms. The summed audio is clipped to
[-1.0, 1.0]
before returning.- __init__(sigma=0.1, sample_sigma=True, p=0.3)[source]¶
RandomizedSmoothing’s constructor.
- Parameters
sigma (
Union
[float
,Sequence
[Tuple
[int
,float
]]]) – standard deviation of the gaussian noise. Either a constant float, or a schedule, i.e. a list of tuples that specify which value to use from which step. For example,[(0, 0.01), (1000, 0.1)]
means that from steps 0-999, the sigma value will be 0.01, and from step 1000 onwards, it will be 0.1.sample_sigma (
bool
) – whenFalse
, then sigma is used as the standard deviation in each forward step. WhenTrue
, the standard deviation is sampled from a uniform distribution of[-sigma, sigma]
for each forward step.p (
float
) – the probability of applying this transform.
- forward(audio, *args, **kwargs)[source]¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
Collation utilities for building custom Datasets¶
- class lhotse.dataset.collation.TokenCollater(cuts, add_eos=True, add_bos=True, pad_symbol='<pad>', bos_symbol='<bos>', eos_symbol='<eos>', unk_symbol='<unk>')[source]¶
Collate list of tokens
Map sentences to integers. Sentences are padded to equal length. Beginning and end-of-sequence symbols can be added. Call .inverse(tokens_batch, tokens_lens) to reconstruct batch as string sentences.
- Example:
>>> token_collater = TokenCollater(cuts) >>> tokens_batch, tokens_lens = token_collater(cuts.subset(first=32)) >>> original_sentences = token_collater.inverse(tokens_batch, tokens_lens)
- Returns:
- tokens_batch: IntTensor of shape (B, L)
B: batch dimension, number of input sentences L: length of the longest sentence
- tokens_lens: IntTensor of shape (B,)
Length of each sentence after adding <eos> and <bos> but before padding.
- lhotse.dataset.collation.collate_features(cuts, pad_direction='right', executor=None)[source]¶
Load features for all the cuts and return them as a batch in a torch tensor. The output shape is
(batch, time, features)
. The cuts will be padded with silence if necessary.- Parameters
cuts (
CutSet
) – aCutSet
used to load the features.pad_direction (
str
) – where to apply the padding (right
,left
, orboth
).executor (
Optional
[Executor
]) – an instance of ThreadPoolExecutor or ProcessPoolExecutor; when provided, we will use it to read the features concurrently.
- Return type
Tuple
[Tensor
,IntTensor
]- Returns
a tuple of tensors
(features, features_lens)
.
- lhotse.dataset.collation.collate_audio(cuts, pad_direction='right', executor=None)[source]¶
Load audio samples for all the cuts and return them as a batch in a torch tensor. The output shape is
(batch, time)
. The cuts will be padded with silence if necessary.- Parameters
cuts (
CutSet
) – aCutSet
used to load the audio samples.pad_direction (
str
) – where to apply the padding (right
,left
, orboth
).executor (
Optional
[Executor
]) – an instance of ThreadPoolExecutor or ProcessPoolExecutor; when provided, we will use it to read audio concurrently.
- Return type
Tuple
[Tensor
,IntTensor
]- Returns
a tuple of tensors
(audio, audio_lens)
.
- lhotse.dataset.collation.collate_multi_channel_features(cuts)[source]¶
Load features for all the cuts and return them as a batch in a torch tensor. The cuts have to be of type
MixedCut
and their tracks will be interpreted as individual channels. The output shape is(batch, channel, time, features)
. The cuts will be padded with silence if necessary.- Return type
Tensor
- lhotse.dataset.collation.collate_multi_channel_audio(cuts)[source]¶
Load audio samples for all the cuts and return them as a batch in a torch tensor. The cuts have to be of type
MixedCut
and their tracks will be interpreted as individual channels. The output shape is(batch, channel, time)
. The cuts will be padded with silence if necessary.- Return type
Tensor
- lhotse.dataset.collation.collate_vectors(tensors, padding_value=- 100, matching_shapes=False)[source]¶
Convert an iterable of 1-D tensors (of possibly various lengths) into a single stacked tensor.
- Parameters
tensors (
Iterable
[Union
[Tensor
,ndarray
]]) – an iterable of 1-D tensors.padding_value (
Union
[int
,float
]) – the padding value inserted to make all tensors have the same length.matching_shapes (
bool
) – whenTrue
, will fail when input tensors have different shapes.
- Return type
Tensor
- Returns
a tensor with shape
(B, L)
whereB
is the number of input tensors andL
is the number of items in the longest tensor.
- lhotse.dataset.collation.collate_matrices(tensors, padding_value=0, matching_shapes=False)[source]¶
Convert an iterable of 2-D tensors (of possibly various first dimension, but consistent second dimension) into a single stacked tensor.
- Parameters
tensors (
Iterable
[Union
[Tensor
,ndarray
]]) – an iterable of 2-D tensors.padding_value (
Union
[int
,float
]) – the padding value inserted to make all tensors have the same length.matching_shapes (
bool
) – whenTrue
, will fail when input tensors have different shapes.
- Return type
Tensor
- Returns
a tensor with shape
(B, L, F)
whereB
is the number of input tensors,L
is the largest found shape[0], andF
is equal to shape[1].
- lhotse.dataset.collation.maybe_pad(cuts, duration=None, num_frames=None, num_samples=None, direction='right')[source]¶
Check if all cuts’ durations are equal and pad them to match the longest cut otherwise.
- Return type
Experimental: LhotseDataLoader¶
- class lhotse.dataset.dataloading.LhotseDataLoader(dataset, sampler, num_workers=1, prefetch_factor=2)[source]¶
A simplified
DataLoader
implementation that relies on aProcessPoolExecutor
. The main difference between this andtorch.utils.data.DataLoader
is thatLhotseDataLoader
allows to launch subprocesses inside of its workers. This is useful for working with dataset classes which perform dynamic batching and need to perform concurrent I/O to read all the necessary data from disk/network.Note
LhotseDataLoader
does not supportnum_workers=0
.Warning
LhotseDataLoader
is experimental and not guaranteed to work correctly across all possible edge cases related to subprocess worker termination. If you experience stability problems, contact us or use a standardDataLoader
instead.Warning
LhotseDataLoader
requires Python >= 3.7.