API Reference¶
This page contains a comprehensive list of all classes and functions within lhotse.
Datasets¶
PyTorch Dataset wrappers for common tasks.
Speech Recognition¶
-
class
lhotse.dataset.speech_recognition.
SpeechRecognitionDataset
(cuts)¶ The PyTorch Dataset for the speech recognition task. Each item in this dataset is a dict of:
{ 'features': (T x F) tensor, 'text': string, 'supervisions_mask': (T) tensor }
The
supervisions_mask
field is a mask that specifies which frames are covered by a supervision by assigning a value of 1 (in this case: segments with transcribed speech contents), and which are not by asigning a value of 0 (in this case: padding, contextual noise, or in general the acoustic context without transcription).In the future, will be extended by graph supervisions.
-
__init__
(cuts)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.dataset.speech_recognition.
K2SpeechRecognitionIterableDataset
(cuts, max_frames=26000, max_cuts=None, shuffle=False, concat_cuts=True, concat_cuts_gap=1.0, concat_cuts_duration_factor=2)¶ The PyTorch Dataset for the speech recognition task using K2 library.
This dataset internally batches and collates the Cuts and should be used with PyTorch DataLoader with argument batch_size=None to work properly. The batch size is determined automatically to satisfy the constraints of
max_frames
andmax_cuts
.This dataset will automatically partition itself when used with a multiprocessing DataLoader (i.e. the same cut will not appear twice in the same epoch).
By default, we “pack” the batches to minimize the amount of padding - we achieve that by concatenating the cuts’ feature matrices with a small amount of silence (padding) in between.
Each item in this dataset is a dict of:
{ 'features': float tensor of shape (B, T, F) 'supervisions': [ { 'cut_id': List[str] of len S 'sequence_idx': Tensor[int] of shape (S,) 'text': List[str] of len S 'start_frame': Tensor[int] of shape (S,) 'num_frames': Tensor[int] of shape (S,) } ] }
Dimension symbols legend: *
B
- batch size (number of Cuts) *S
- number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions) *T
- number of frames of the longest Cut *F
- number of featuresThe ‘sequence_idx’ field is the index of the Cut used to create the example in the Dataset.
-
__init__
(cuts, max_frames=26000, max_cuts=None, shuffle=False, concat_cuts=True, concat_cuts_gap=1.0, concat_cuts_duration_factor=2)¶ K2 ASR IterableDataset constructor.
- Parameters
cuts (
CutSet
) – theCutSet
to sample data from.max_frames (
int
) – The maximum number of feature frames that we’re going to put in a single batch. The padding frames do not contribute to that limit, since we pack the batch by default to minimze the amount of padding.max_cuts (
Optional
[int
]) – The maximum number of cuts sampled to form a mini-batch. By default, this constraint is off.shuffle (
bool
) – WhenTrue
, the cuts will be shuffled at the start of iteration. Convenient when mini-batch loop is inside an outer epoch-level loop, e.g.: for epoch in range(10): for batch in dataset: … as every epoch will see a different cuts order.concat_cuts (
bool
) – WhenTrue
, we will concatenate the cuts to minimize the total amount of padding; e.g. instead of creating a batch with 40 examples, we will merge some of the examples together adding some silence between them to avoid a large number of padding frames that waste the computation. Enabled by default.concat_cuts_gap (
float
) – The duration of silence in seconds that is inserted between the cuts; it’s goal is to let the model “know” that there are separate utterances in a single example.concat_cuts_duration_factor (
float
) – Determines the maximum duration of the concatenated cuts; by default it’s twice the duration of the longest cut in the batch.
-
-
lhotse.dataset.speech_recognition.
concat_cuts
(cuts, gap=1.0, max_duration=None)¶ We’re going to concatenate the cuts to minimize the amount of total padding frames used. This is actually solving a knapsack problem. In this initial implementation we’re using a greedy approach: going from the back (i.e. the shortest cuts) we’ll try to concat them to the longest cut that still has some “space” at the end.
- Parameters
cuts (
List
[Union
[ForwardRef
,ForwardRef
,ForwardRef
]]) – a list of cuts to pack.gap (
float
) – the duration of silence inserted between concatenated cuts.max_duration (
Optional
[float
]) – the maximum duration for the concatenated cuts (by default set to the duration of the first cut).
:return a list of packed cuts.
- Return type
List
[Union
[ForwardRef
,ForwardRef
,ForwardRef
]]
-
class
lhotse.dataset.speech_recognition.
K2SpeechRecognitionDataset
(cuts)¶ The PyTorch Dataset for the speech recognition task using K2 library. Each item in this dataset is a dict of:
{ 'features': (T x F) tensor, 'supervisions': List[Dict] -> [ { 'sequence_idx': int 'text': string, 'start_frame': int, 'num_frames': int } (multiplied N times, for each of the N supervisions present in the Cut) ] }
The ‘sequence_idx’ field is the index of the Cut used to create the example in the Dataset. It is mapped to the batch index later in the DataLoader.
-
__init__
(cuts)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.dataset.speech_recognition.
K2DataLoader
(*args, **kwds)¶ A PyTorch DataLoader that has a custom collate_fn that complements the K2SpeechRecognitionDataset.
The ‘features’ tensor is collated in a standard way to return a tensor of shape (B, T, F).
The ‘supervisions’ dict contains the same fields as in
K2SpeechRecognitionDataset
, except that each sub-field (like ‘start_frame’) is a 1D PyTorch tensor with shape (B,). The ‘text’ sub-field is an exception - it’s a list of strings with length equal to batch size.The ‘sequence_idx’ sub-field in ‘supervisions’, which originally points to index of the example in the Dataset, is remapped to the index of the corresponding features matrix in the collated ‘features’. Multiple supervisions coming from the same cut will share the same ‘sequence_idx’.
For an example, see
test/dataset/test_speech_recognition_dataset.py::test_k2_dataloader()
.-
__init__
(*args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
dataset
¶
-
batch_size
¶
-
num_workers
¶
-
pin_memory
¶
-
drop_last
¶
-
timeout
¶
-
sampler
¶
-
prefetch_factor
¶
-
-
lhotse.dataset.speech_recognition.
multi_supervision_collate_fn
(batch)¶ Custom collate_fn for K2SpeechRecognitionDataset.
It merges the items provided by K2SpeechRecognitionDataset into the following structure:
{ 'features': float tensor of shape (B, T, F) 'supervisions': [ { 'sequence_idx': Tensor[int] of shape (S,) 'text': List[str] of len S 'start_frame': Tensor[int] of shape (S,) 'num_frames': Tensor[int] of shape (S,) } ] }
Dimension symbols legend: *
B
- batch size (number of Cuts), *S
- number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions), *T
- number of frames of the longest Cut *F
- number of features- Return type
Dict
Source Separation¶
-
class
lhotse.dataset.source_separation.
SourceSeparationDataset
(sources_set, mixtures_set)¶ An abstract base class, implementing PyTorch Dataset for the source separation task. It’s created from two CutSets - one provides the audio cuts for the sources, and the other one the audio cuts for the signal mix. When queried for data samples, it returns a dict of:
{ 'sources': (N x T x F) tensor, 'mixture': (T x F) tensor, 'real_mask': (N x T x F) tensor, 'binary_mask': (T x F) tensor }
-
__init__
(sources_set, mixtures_set)¶ Initialize self. See help(type(self)) for accurate signature.
-
validate
()¶
-
-
class
lhotse.dataset.source_separation.
DynamicallyMixedSourceSeparationDataset
(sources_set, mixtures_set, nonsources_set=None)¶ A PyTorch Dataset for the source separation task. It’s created from a number of CutSets:
sources_set
: provides the audio cuts for the sources that (the targets of source separation),mixtures_set
: provides the audio cuts for the signal mix (the input of source separation),nonsources_set
: (optional) provides the audio cuts for other signals that are in the mix, but are not the targets of source separation. Useful for adding noise.
When queried for data samples, it returns a dict of:
{ 'sources': (N x T x F) tensor, 'mixture': (T x F) tensor, 'real_mask': (N x T x F) tensor, 'binary_mask': (T x F) tensor }
This Dataset performs on-the-fly feature-domain mixing of the sources. It expects the mixtures_set to contain MixedCuts, so that it knows which Cuts should be mixed together.
-
__init__
(sources_set, mixtures_set, nonsources_set=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
class
lhotse.dataset.source_separation.
PreMixedSourceSeparationDataset
(sources_set, mixtures_set)¶ A PyTorch Dataset for the source separation task. It’s created from two CutSets - one provides the audio cuts for the sources, and the other one the audio cuts for the signal mix. When queried for data samples, it returns a dict of:
{ 'sources': (N x T x F) tensor, 'mixture': (T x F) tensor, 'real_mask': (N x T x F) tensor, 'binary_mask': (T x F) tensor }
It expects both CutSets to return regular Cuts, meaning that the signals were mixed in the time domain. In contrast to DynamicallyMixedSourceSeparationDataset, no on-the-fly feature-domain-mixing is performed.
-
__init__
(sources_set, mixtures_set)¶ Initialize self. See help(type(self)) for accurate signature.
-
Unsupervised¶
-
class
lhotse.dataset.unsupervised.
UnsupervisedDataset
(cuts)¶ Dataset that contains no supervision - it only provides the features extracted from recordings. The returned features are a
torch.Tensor
of shape(T x F)
, where T is the number of frames, and F is the feature dimension.-
__init__
(cuts)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.dataset.unsupervised.
UnsupervisedWaveformDataset
(cuts)¶ A variant of UnsupervisedDataset that provides waveform samples instead of features. The output is a tensor of shape (C, T), with C being the number of channels and T the number of audio samples. In this implemenation, there will always be a single channel.
-
class
lhotse.dataset.unsupervised.
DynamicUnsupervisedDataset
(feature_extractor, cuts, augment_fn=None)¶ An example dataset that shows how to use on-the-fly feature extraction in Lhotse. It accepts two additional inputs - a FeatureExtractor and an optional WavAugmenter for time-domain data augmentation.. The output is approximately the same as that of the
UnsupervisedDataset
- there might be slight differences forMixedCut``s, because this dataset mixes them in the time domain, and ``UnsupervisedDataset
does that in the feature domain. Cuts that are not mixed will yield identical results in both dataset classes.-
__init__
(feature_extractor, cuts, augment_fn=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
Voice Activity Detection¶
Diarization (experimental)¶
-
class
lhotse.dataset.diarization.
DiarizationDataset
(cuts, min_speaker_dim=None, global_speaker_ids=False)¶ A PyTorch Dataset for the speaker diarization task. Our assumptions about speaker diarization are the following:
- we assume a single channel input (for now), which could be either a true mono signal
or a beamforming result from a microphone array.
- we assume that the supervision used for model training is a speech activity matrix, with one
row dedicated to each speaker (either in the current cut or the whole dataset, depending on the settings). The columns correspond to feature frames. Each row is effectively a Voice Activity Detection supervision for a single speaker. This setup is somewhat inspired by the TS-VAD paper: https://arxiv.org/abs/2005.07272
Each item in this dataset is a dict of:
{ 'features': (T x F) tensor 'speaker_activity': (num_speaker x T) tensor }
Constructor arguments:
- Parameters
cuts (
CutSet
) – aCutSet
used to create the dataset object.min_speaker_dim (
Optional
[int
]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).global_speaker_ids (
bool
) – a bool, indicates whether the same speaker should always retain the same row index in the speaker activity matrix (useful for speaker-dependent systems)root_dir – a prefix path to be attached to the feature files paths.
-
__init__
(cuts, min_speaker_dim=None, global_speaker_ids=False)¶ Initialize self. See help(type(self)) for accurate signature.
Recording manifests¶
Data structures used for describing audio recordings in a dataset.
-
class
lhotse.audio.
AudioSource
(type: str, channels: List[int], source: str)¶ AudioSource represents audio data that can be retrieved from somewhere. Supported sources of audio are currently: - ‘file’ (formats supported by librosa, possibly multi-channel) - ‘command’ [unix pipe] (must be WAVE, possibly multi-channel)
-
type
: str¶
-
channels
: List[int]¶
-
source
: str¶
-
load_audio
(offset_seconds=0.0, duration_seconds=None)¶ Load the AudioSource (both files and commands) with librosa, accounting for many audio formats and multi-channel inputs. Returns numpy array with shapes: (n_samples) for single-channel, (n_channels, n_samples) for multi-channel.
- Return type
ndarray
-
with_path_prefix
(path)¶ - Return type
-
static
from_dict
(data)¶ - Return type
-
__init__
(type, channels, source)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
lhotse.audio.
read_audio
(path, offset, duration)¶ - Return type
Tuple
[ndarray
,int
]
-
class
lhotse.audio.
Recording
(id: str, sources: List[lhotse.audio.AudioSource], sampling_rate: int, num_samples: int, duration: float)¶ Recording represents an AudioSource along with some metadata.
-
id
: str¶
-
sources
: List[AudioSource]¶
-
sampling_rate
: int¶
-
num_samples
: int¶
-
duration
: Seconds¶
-
static
from_sphere
(sph_path, relative_path_depth=None)¶ Read a SPHERE file’s header and create the corresponding
Recording
.- Parameters
sph_path (
Union
[Path
,str
]) – Path to the sphere (.sph) file.relative_path_depth (
Optional
[int
]) – optional int specifying how many last parts of the file path should be retained in theAudioSource
. By default writes the path as is.
- Return type
- Returns
a new
Recording
instance pointing to the sphere file.
-
property
num_channels
¶
-
property
channel_ids
¶
-
load_audio
(channels=None, offset_seconds=0.0, duration_seconds=None)¶ - Return type
ndarray
-
__init__
(id, sources, sampling_rate, num_samples, duration)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.audio.
RecordingSet
(*args, **kwds)¶ RecordingSet represents a dataset of recordings. It does not contain any annotation - just the information needed to retrieve a recording (possibly multi-channel, from files or from shell commands and pipes) and some metadata for each of them.
It also supports (de)serialization to/from YAML and takes care of mapping between rich Python classes and YAML primitives during conversion.
-
recordings
: Dict[str, Recording]¶
-
static
from_recordings
(recordings)¶ - Return type
-
static
from_dicts
(data)¶ - Return type
-
to_dicts
()¶ - Return type
List
[dict
]
-
filter
(predicate)¶ Return a new RecordingSet with the Recordings that satisfy the predicate.
- Parameters
predicate (
Callable
[[Recording
],bool
]) – a function that takes a recording as an argument and returns bool.- Return type
- Returns
a filtered RecordingSet.
-
split
(num_splits, randomize=False)¶ Split the
RecordingSet
intonum_splits
pieces of equal size.- Parameters
num_splits (
int
) – Requested number of splits.randomize (
bool
) – Optionally randomize the recordings order first.
- Return type
List
[RecordingSet
]- Returns
A list of
RecordingSet
pieces.
-
load_audio
(recording_id, channels=None, offset_seconds=0.0, duration_seconds=None)¶ - Return type
ndarray
-
with_path_prefix
(path)¶ - Return type
-
num_channels
(recording_id)¶ - Return type
int
-
sampling_rate
(recording_id)¶ - Return type
int
-
num_samples
(recording_id)¶ - Return type
int
-
duration
(recording_id)¶ - Return type
float
-
__init__
(recordings)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.audio.
AudioMixer
(base_audio, sampling_rate)¶ Utility class to mix multiple waveforms into a single one. It should be instantiated separately for each mixing session (i.e. each
MixedCut
will create a separateAudioMixer
to mix its tracks). It is initialized with a numpy array of audio samples (typically float32 in [-1, 1] range) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using theadd_to_mix
method. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize theAudioMixer
.-
__init__
(base_audio, sampling_rate)¶ - Parameters
base_audio (
ndarray
) – A numpy array with the audio samples for the base signal (all the other signals will be mixed to it).sampling_rate (
int
) – Sampling rate of the audio.
-
property
unmixed_audio
¶ Return a numpy ndarray with the shape (num_tracks, num_samples), where each track is zero padded and scaled adequately to the offsets and SNR used in
add_to_mix
call.- Return type
ndarray
-
property
mixed_audio
¶ Return a numpy ndarray with the shape (1, num_samples) - a mono mix of the tracks supplied with
add_to_mix
calls.- Return type
ndarray
-
add_to_mix
(audio, snr=None, offset=0.0)¶ Add audio (only support mono-channel) of a new track into the mix. :type audio:
ndarray
:param audio: An array of audio samples to be mixed in. :type snr:Optional
[float
] :param snr: Signal-to-noise ratio, assuming audio represents noise (positive SNR - lower audio energy, negative SNR - higher audio energy) :type offset:float
:param offset: How many seconds to shift audio in time. For mixing, the signal will be padded before the start with low energy values. :return:
-
-
lhotse.audio.
audio_energy
(audio)¶ - Return type
float
Supervision manifests¶
Data structures used for describing supervisions in a dataset.
-
class
lhotse.supervision.
SupervisionSegment
(id: str, recording_id: str, start: float, duration: float, channel: int = 0, text: Union[str, NoneType] = None, language: Union[str, NoneType] = None, speaker: Union[str, NoneType] = None, gender: Union[str, NoneType] = None, custom: Union[Dict[str, Any], NoneType] = None)¶ -
id
: str¶
-
recording_id
: str¶
-
start
: Seconds¶
-
duration
: Seconds¶
-
channel
: int = 0¶
-
text
: Optional[str] = None¶
-
language
: Optional[str] = None¶
-
speaker
: Optional[str] = None¶
-
gender
: Optional[str] = None¶
-
custom
: Optional[Dict[str, Any]] = None¶
-
property
end
¶ - Return type
float
-
with_offset
(offset)¶ Return an identical
SupervisionSegment
, but with theoffset
added to thestart
field.- Return type
-
trim
(end)¶ Return an identical
SupervisionSegment
, but ensure thatself.start
is not negative (in which case it’s set to 0) andself.end
does not exceed theend
parameter.This method is useful for ensuring that the supervision does not exceed a cut’s bounds, in which case pass
cut.duration
as theend
argument, since supervision times are relative to the cut.- Return type
-
map
(transform_fn)¶ Return a copy of the current segment, transformed with
transform_fn
.- Parameters
transform_fn (
Callable
[[SupervisionSegment
],SupervisionSegment
]) – a function that takes a segment as input, transforms it and returns a new segment.- Return type
- Returns
a modified
SupervisionSegment
.
-
transform_text
(transform_fn)¶ Return a copy of the current segment with transformed
text
field. Useful for text normalization, phonetic transcription, etc.- Parameters
transform_fn (
Callable
[[str
],str
]) – a function that accepts a string and returns a string.- Return type
- Returns
a
SupervisionSegment
with adjusted text.
-
static
from_dict
(data)¶ - Return type
-
__init__
(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.supervision.
SupervisionSet
(*args, **kwds)¶ SupervisionSet represents a collection of segments containing some supervision information. The only required fields are the ID of the segment, ID of the corresponding recording, and the start and duration of the segment in seconds. All other fields, such as text, language or speaker, are deliberately optional to support a wide range of tasks, as well as adding more supervision types in the future, while retaining backwards compatibility.
-
segments
: Dict[str, SupervisionSegment]¶
-
static
from_segments
(segments)¶ - Return type
-
static
from_dicts
(data)¶ - Return type
-
to_dicts
()¶ - Return type
List
[dict
]
-
split
(num_splits, randomize=False)¶ Split the
SupervisionSet
intonum_splits
pieces of equal size.- Parameters
num_splits (
int
) – Requested number of splits.randomize (
bool
) – Optionally randomize the supervisions order first.
- Return type
List
[SupervisionSet
]- Returns
A list of
SupervisionSet
pieces.
-
filter
(predicate)¶ Return a new SupervisionSet with the SupervisionSegments that satisfy the predicate.
- Parameters
predicate (
Callable
[[SupervisionSegment
],bool
]) – a function that takes a supervision as an argument and returns bool.- Return type
- Returns
a filtered SupervisionSet.
-
map
(transform_fn)¶ Map a
transform_fn
to the SupervisionSegments and return a newSupervisionSet
.- Parameters
transform_fn (
Callable
[[SupervisionSegment
],SupervisionSegment
]) – a function that modifies a supervision as an argument.- Return type
- Returns
a new
SupervisionSet
with modified segments.
-
transform_text
(transform_fn)¶ Return a copy of the current
SupervisionSet
with the segments having a transformedtext
field. Useful for text normalization, phonetic transcription, etc.- Parameters
transform_fn (
Callable
[[str
],str
]) – a function that accepts a string and returns a string.- Return type
- Returns
a
SupervisionSet
with adjusted text.
-
find
(recording_id, channel=None, start_after=0, end_before=None, adjust_offset=False)¶ Return an iterable of segments that match the provided
recording_id
.- Parameters
recording_id (
str
) – Desired recording ID.channel (
Optional
[int
]) – When specified, return supervisions in that channel - otherwise, in all channels.start_after (
float
) – When specified, return segments that start after the given value.end_before (
Optional
[float
]) – When specified, return segments that end before the given value.adjust_offset (
bool
) – When true, return segments as if the recordings had started atstart_after
. This is useful for creating Cuts. Fom a user perspective, when dealing with a Cut, it is no longer helpful to know when the supervisions starts in a recording - instead, it’s useful to know when the supervision starts relative to the start of the Cut. In the anticipated use-case,start_after
andend_before
would be the beginning and end of a cut; this option converts the times to be relative to the start of the cut.
- Return type
Iterable
[SupervisionSegment
]- Returns
An iterator over supervision segments satisfying all criteria.
-
__init__
(segments, _segments_by_recording_id=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
Feature extraction and manifests¶
Data structures and tools used for feature extraction and description.
Features API - extractor and manifests¶
-
class
lhotse.features.base.
FeatureExtractor
(config=None)¶ The base class for all feature extractors in Lhotse. It is initialized with a config object, specific to a particular feature extraction method. The config is expected to be a dataclass so that it can be easily serialized.
All derived feature extractors must implement at least the following:
a
name
class attribute (how are these features called, e.g. ‘mfcc’)a
config_type
class attribute that points to the configuration dataclass typethe
extract
method,the
frame_shift
property.
Feature extractors that support feature-domain mixing should additionally specify two static methods:
compute_energy
, andmix
.
By itself, the
FeatureExtractor
offers the following high-level methods that are not intended for overriding:extract_from_samples_and_store
extract_from_recording_and_store
These methods run a larger feature extraction pipeline that involves data augmentation and disk storage.
-
name
= None¶
-
config_type
= None¶
-
__init__
(config=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
abstract
extract
(samples, sampling_rate)¶ Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type
ndarray
- Returns
a numpy ndarray representing the feature matrix.
-
abstract property
frame_shift
¶ - Return type
float
-
abstract
feature_dim
(sampling_rate)¶ - Return type
int
-
static
mix
(features_a, features_b, energy_scaling_factor_b)¶ Perform feature-domain mix of two singals,
a
andb
, and return the mixed signal.- Parameters
features_a (
ndarray
) – Left-hand side (reference) signal.features_b (
ndarray
) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float
) – A scaling factor forfeatures_b
energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_a
andfeatures_b
energies are 100, thefeatures_b
signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_b
to the signal is determined by the implementer.
- Return type
ndarray
- Returns
A mixed feature matrix.
-
static
compute_energy
(features)¶ Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energy
will never return zero.- Parameters
features (
ndarray
) – A feature matrix.- Return type
float
- Returns
A positive float value of the signal energy.
-
extract_from_samples_and_store
(samples, storage, sampling_rate, offset=0, augment_fn=None)¶ Extract the features from an array of audio samples in a full pipeline:
optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Features
object with a description of the extracted features.
Note, unlike in
extract_from_recording_and_store
, the returnedFeatures
object might not be suitable to store in aFeatureSet
, as it does not reference any particularRecording
. Instead, this method is useful when extracting features from cuts - especiallyMixedCut
instances, which may be created from multiple recordings and channels.- Parameters
samples (
ndarray
) – a numpy ndarray with the audio samples.sampling_rate (
int
) – integer sampling rate ofsamples
.storage (
FeaturesWriter
) – aFeaturesWriter
object that will handle storing the feature matrices.offset (
float
) – an offset in seconds for where to start reading the recording - when used forCut
feature extraction, must be equal toCut.start
.augment_fn (
Optional
[Callable
[[ndarray
,int
],ndarray
]]) – an optionalWavAugmenter
instance to modify the waveform before feature extraction.
- Returns
a
Features
manifest item for the extracted feature matrix (it is not written to disk).
-
extract_from_recording_and_store
(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)¶ Extract the features from a
Recording
in a full pipeline:load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Features
object with a description of the extracted features and the source data used.
- Parameters
recording (
Recording
) – aRecording
that specifies what’s the input audio.storage (
FeaturesWriter
) – aFeaturesWriter
object that will handle storing the feature matrices.offset (
float
) – an optional offset in seconds for where to start reading the recording.duration (
Optional
[float
]) – an optional duration specifying how much audio to load from the recording.channels (
Union
[int
,List
[int
],None
]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.augment_fn (
Optional
[Callable
[[ndarray
,int
],ndarray
]]) – an optionalWavAugmenter
instance to modify the waveform before feature extraction.
- Returns
a
Features
manifest item for the extracted feature matrix.
-
classmethod
from_dict
(data)¶ - Return type
-
classmethod
from_yaml
(path)¶ - Return type
-
to_yaml
(path)¶
-
lhotse.features.base.
get_extractor_type
(name)¶ Return the feature extractor type corresponding to the given name.
- Parameters
name (
str
) – specifies which feature extractor should be used.- Return type
Type
- Returns
A feature extractors type.
-
lhotse.features.base.
create_default_feature_extractor
(name)¶ Create a feature extractor object with a default configuration.
- Parameters
name (
str
) – specifies which feature extractor should be used.- Return type
Optional
[FeatureExtractor
]- Returns
A new feature extractor instance.
-
lhotse.features.base.
register_extractor
(cls)¶ This decorator is used to register feature extractor classes in Lhotse so they can be easily created just by knowing their name.
An example of usage:
@register_extractor class MyFeatureExtractor: …
- Parameters
cls – A type (class) that is being registered.
- Returns
Registered type.
-
class
lhotse.features.base.
TorchaudioFeatureExtractor
(config=None)¶ Common abstract base class for all torchaudio based feature extractors.
-
feature_fn
= None¶
-
extract
(samples, sampling_rate)¶ Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type
ndarray
- Returns
a numpy ndarray representing the feature matrix.
-
property
frame_shift
¶ - Return type
float
-
-
class
lhotse.features.base.
Features
(type: str, num_frames: int, num_features: int, sampling_rate: int, start: float, duration: float, storage_type: str, storage_path: str, storage_key: str, recording_id: Optional[str] = None, channels: Optional[Union[int, List[int]]] = None)¶ Represents features extracted for some particular time range in a given recording and channel. It contains metadata about how it’s stored: storage_type describes “how to read it”, for now it supports numpy arrays serialized with np.save, as well as arrays compressed with lilcom; storage_path is the path to the file on the local filesystem.
-
type
: str¶
-
num_frames
: int¶
-
num_features
: int¶
-
sampling_rate
: int¶
-
start
: Seconds¶
-
duration
: Seconds¶
-
storage_type
: str¶
-
storage_path
: str¶
-
storage_key
: str¶
-
recording_id
: Optional[str] = None¶
-
channels
: Optional[Union[int, List[int]]] = None¶
-
property
end
¶ - Return type
float
-
property
frame_shift
¶ - Return type
float
-
load
(start=None, duration=None)¶ - Return type
ndarray
-
__init__
(type, num_frames, num_features, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.features.base.
FeatureSet
(*args, **kwds)¶ Represents a feature manifest, and allows to read features for given recordings within particular channels and time ranges. It also keeps information about the feature extractor parameters used to obtain this set. When a given recording/time-range/channel is unavailable, raises a KeyError.
-
features
: List[Features]¶
-
static
from_features
(features)¶ - Return type
-
static
from_dicts
(data)¶ - Return type
-
to_dicts
()¶ - Return type
List
[dict
]
-
with_path_prefix
(path)¶ - Return type
-
split
(num_splits, randomize=False)¶ Split the
FeatureSet
intonum_splits
pieces of equal size.- Parameters
num_splits (
int
) – Requested number of splits.randomize (
bool
) – Optionally randomize the features order first.
- Return type
List
[FeatureSet
]- Returns
A list of
FeatureSet
pieces.
-
find
(recording_id, channel_id=0, start=0.0, duration=None, leeway=0.05)¶ Find and return a Features object that best satisfies the search criteria. Raise a KeyError when no such object is available.
- Parameters
recording_id (
str
) – str, requested recording ID.channel_id (
int
) – int, requested channel.start (
float
) – float, requested start time in seconds for the feature chunk.duration (
Optional
[float
]) – optional float, requested duration in seconds for the feature chunk. By default, return everything from the start.leeway (
float
) – float, controls how strictly we have to match the requested start and duration criteria. It is necessary to keep a small positive value here (default 0.05s), as there might be differneces between the duration of recording/supervision segment, and the duration of features. The latter one is constrained to be a multiple of frame_shift, while the former can be arbitrary.
- Return type
- Returns
a Features object satisfying the search criteria.
-
load
(recording_id, channel_id=0, start=0.0, duration=None)¶ Find a Features object that best satisfies the search criteria and load the features as a numpy ndarray. Raise a KeyError when no such object is available.
- Return type
ndarray
-
__init__
(features=<factory>, _features_by_recording_id=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.features.base.
FeatureSetBuilder
(feature_extractor, storage, augment_fn=None)¶ An extended constructor for the FeatureSet. Think of it as a class wrapper for a feature extraction script. It consumes an iterable of Recordings, extracts the features specified by the FeatureExtractor config, and saves stores them on the disk.
Eventually, we plan to extend it with the capability to extract only the features in specified regions of recordings and to perform some time-domain data augmentation.
-
__init__
(feature_extractor, storage, augment_fn=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
process_and_store_recordings
(recordings, output_manifest=None, num_jobs=1)¶ - Return type
-
-
lhotse.features.base.
store_feature_array
(feats, storage)¶ Store
feats
array on disk, usinglilcom
compression by default.- Parameters
feats (
ndarray
) – a numpy ndarray containing features.storage (
FeaturesWriter
) – aFeaturesWriter
object to use for array storage.
- Return type
str
- Returns
a path to the file containing the stored array.
Torchaudio feature extractors¶
-
class
lhotse.features.fbank.
FbankConfig
(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True, low_freq: float = 20.0, high_freq: float = - 400.0, num_mel_bins: int = 40, use_energy: bool = False, vtln_low: float = 100.0, vtln_high: float = - 500.0, vtln_warp: float = 1.0)¶ -
dither
: float = 0.0¶
-
window_type
: str = 'povey'¶
-
frame_length
: float = 0.025¶
-
frame_shift
: float = 0.01¶
-
remove_dc_offset
: bool = True¶
-
round_to_power_of_two
: bool = True¶
-
energy_floor
: float = 1e-10¶
-
min_duration
: float = 0.0¶
-
preemphasis_coefficient
: float = 0.97¶
-
raw_energy
: bool = True¶
-
low_freq
: float = 20.0¶
-
high_freq
: float = -400.0¶
-
num_mel_bins
: int = 40¶
-
use_energy
: bool = False¶
-
vtln_low
: float = 100.0¶
-
vtln_high
: float = -500.0¶
-
vtln_warp
: float = 1.0¶
-
__init__
(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=- 400.0, num_mel_bins=40, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.features.fbank.
Fbank
(config=None)¶ Log Mel energy filter bank feature extractor based on
torchaudio.compliance.kaldi.fbank
function.-
name
= 'fbank'¶
-
config_type
¶ alias of
FbankConfig
-
feature_dim
(sampling_rate)¶ - Return type
int
-
static
mix
(features_a, features_b, energy_scaling_factor_b)¶ Perform feature-domain mix of two singals,
a
andb
, and return the mixed signal.- Parameters
features_a (
ndarray
) – Left-hand side (reference) signal.features_b (
ndarray
) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float
) – A scaling factor forfeatures_b
energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_a
andfeatures_b
energies are 100, thefeatures_b
signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_b
to the signal is determined by the implementer.
- Return type
ndarray
- Returns
A mixed feature matrix.
-
static
compute_energy
(features)¶ Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energy
will never return zero.- Parameters
features (
ndarray
) – A feature matrix.- Return type
float
- Returns
A positive float value of the signal energy.
-
-
class
lhotse.features.mfcc.
MfccConfig
(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True, low_freq: float = 20.0, high_freq: float = 0.0, num_mel_bins: int = 23, use_energy: bool = False, vtln_low: float = 100.0, vtln_high: float = - 500.0, vtln_warp: float = 1.0, cepstral_lifter: float = 22.0, num_ceps: int = 13)¶ -
dither
: float = 0.0¶
-
window_type
: str = 'povey'¶
-
frame_length
: float = 0.025¶
-
frame_shift
: float = 0.01¶
-
remove_dc_offset
: bool = True¶
-
round_to_power_of_two
: bool = True¶
-
energy_floor
: float = 1e-10¶
-
min_duration
: float = 0.0¶
-
preemphasis_coefficient
: float = 0.97¶
-
raw_energy
: bool = True¶
-
low_freq
: float = 20.0¶
-
high_freq
: float = 0.0¶
-
num_mel_bins
: int = 23¶
-
use_energy
: bool = False¶
-
vtln_low
: float = 100.0¶
-
vtln_high
: float = -500.0¶
-
vtln_warp
: float = 1.0¶
-
cepstral_lifter
: float = 22.0¶
-
num_ceps
: int = 13¶
-
__init__
(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=0.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.features.mfcc.
Mfcc
(config=None)¶ MFCC feature extractor based on
torchaudio.compliance.kaldi.mfcc
function.-
name
= 'mfcc'¶
-
config_type
¶ alias of
MfccConfig
-
feature_dim
(sampling_rate)¶ - Return type
int
-
-
class
lhotse.features.spectrogram.
SpectrogramConfig
(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True)¶ -
dither
: float = 0.0¶
-
window_type
: str = 'povey'¶
-
frame_length
: float = 0.025¶
-
frame_shift
: float = 0.01¶
-
remove_dc_offset
: bool = True¶
-
round_to_power_of_two
: bool = True¶
-
energy_floor
: float = 1e-10¶
-
min_duration
: float = 0.0¶
-
preemphasis_coefficient
: float = 0.97¶
-
raw_energy
: bool = True¶
-
__init__
(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.features.spectrogram.
Spectrogram
(config=None)¶ Log spectrogram feature extractor based on
torchaudio.compliance.kaldi.spectrogram
function.-
name
= 'spectrogram'¶
-
config_type
¶ alias of
SpectrogramConfig
-
feature_dim
(sampling_rate)¶ - Return type
int
-
static
mix
(features_a, features_b, energy_scaling_factor_b)¶ Perform feature-domain mix of two singals,
a
andb
, and return the mixed signal.- Parameters
features_a (
ndarray
) – Left-hand side (reference) signal.features_b (
ndarray
) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float
) – A scaling factor forfeatures_b
energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_a
andfeatures_b
energies are 100, thefeatures_b
signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_b
to the signal is determined by the implementer.
- Return type
ndarray
- Returns
A mixed feature matrix.
-
static
compute_energy
(features)¶ Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energy
will never return zero.- Parameters
features (
ndarray
) – A feature matrix.- Return type
float
- Returns
A positive float value of the signal energy.
-
Feature storage¶
-
class
lhotse.features.io.
FeaturesWriter
¶ FeaturesWriter
defines the interface of how to store numpy arrays in a particular storage backend. This backend could either be:separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.
Each class inheriting from
FeaturesWriter
must define:- the
write()
method, which defines the storing operation (accepts a
key
used to place thevalue
array in the storage);
- the
- the
storage_path()
property, which is either a common directory for the files, the name of the file storing multiple arrays, name of the cloud bucket, etc.
- the
- the
name()
property that is unique to this particular storage mechanism - it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.
- the
Each
FeaturesWriter
can also be used as a context manager, as some implementations might need to free a resource after the writing is finalized. By default nothing happens in the context manager functions, and this can be modified by the inheriting subclasses.- Example:
- with MyWriter(‘some/path’) as storage:
extractor.extract_from_recording_and_store(recording, storage)
The features loading must be defined separately in a class inheriting from
FeaturesReader
.-
abstract property
name
¶ - Return type
str
-
abstract property
storage_path
¶ - Return type
str
-
abstract
write
(key, value)¶ - Return type
str
-
class
lhotse.features.io.
FeaturesReader
¶ FeaturesReader
defines the interface of how to load numpy arrays from a particular storage backend. This backend could either be:separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.
Each class inheriting from
FeaturesReader
must define:- the
read()
method, which defines the loading operation (accepts the
key
to locate the array in the storage and return it). The read method should support selecting only a subset of the feature matrix, with the bounds expressed as argumentsleft_offset_frames
andright_offset_frames
. It’s up to the Reader implementation to load only the required part or trim it to that range only after loading. It is assumed that the time dimension is always the first one.
- the
- the
name()
property that is unique to this particular storage mechanism - it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.
- the
The features writing must be defined separately in a class inheriting from
FeaturesWriter
.-
abstract property
name
¶ - Return type
str
-
abstract
read
(key, left_offset_frames=0, right_offset_frames=None)¶ - Return type
ndarray
-
lhotse.features.io.
available_storage_backends
()¶ - Return type
List
[str
]
-
lhotse.features.io.
register_reader
(cls)¶ Decorator used to add a new
FeaturesReader
to Lhotse’s registry.Example:
@register_reader class MyFeatureReader(FeatureReader):
…
-
lhotse.features.io.
register_writer
(cls)¶ Decorator used to add a new
FeaturesWriter
to Lhotse’s registry.Example:
@register_writer class MyFeatureWriter(FeatureWriter):
…
-
lhotse.features.io.
get_reader
(name)¶ Find a
FeaturesReader
sub-class that corresponds to the providedname
and return its type.Example:
reader_type = get_reader(“lilcom_files”) reader = reader_type(“/storage/features/”)
- Return type
Type
[FeaturesReader
]
-
lhotse.features.io.
get_writer
(name)¶ Find a
FeaturesWriter
sub-class that corresponds to the providedname
and return its type.Example:
writer_type = get_writer(“lilcom_files”) writer = writer_type(“/storage/features/”)
- Return type
Type
[FeaturesWriter
]
-
class
lhotse.features.io.
LilcomFilesReader
(storage_path, *args, **kwargs)¶ Reads Lilcom-compressed files from a directory on the local filesystem.
storage_path
corresponds to the directory path;storage_key
for each utterance is the name of the file in that directory.-
name
= 'lilcom_files'¶
-
__init__
(storage_path, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
read
(key, left_offset_frames=0, right_offset_frames=None)¶ - Return type
ndarray
-
-
class
lhotse.features.io.
LilcomFilesWriter
(storage_path, tick_power=- 5, *args, **kwargs)¶ Writes Lilcom-compressed files to a directory on the local filesystem.
storage_path
corresponds to the directory path;storage_key
for each utterance is the name of the file in that directory.-
name
= 'lilcom_files'¶
-
__init__
(storage_path, tick_power=- 5, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
property
storage_path
¶ - Return type
str
-
write
(key, value)¶ - Return type
str
-
-
class
lhotse.features.io.
NumpyFilesReader
(storage_path, *args, **kwargs)¶ Reads non-compressed numpy arrays from files in a directory on the local filesystem.
storage_path
corresponds to the directory path;storage_key
for each utterance is the name of the file in that directory.-
name
= 'numpy_files'¶
-
__init__
(storage_path, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
read
(key, left_offset_frames=0, right_offset_frames=None)¶ - Return type
ndarray
-
-
class
lhotse.features.io.
NumpyFilesWriter
(storage_path, *args, **kwargs)¶ Writes non-compressed numpy arrays to files in a directory on the local filesystem.
storage_path
corresponds to the directory path;storage_key
for each utterance is the name of the file in that directory.-
name
= 'numpy_files'¶
-
__init__
(storage_path, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
property
storage_path
¶ - Return type
str
-
write
(key, value)¶ - Return type
str
-
-
lhotse.features.io.
lookup_cache_or_open
(storage_path)¶ Helper internal function used in HDF5 readers. It opens the HDF files and keeps their handles open in a global program cache to avoid excessive amount of syscalls when the *Reader class is instantiated and destroyed in a loop repeatedly (frequent use-case).
The file handles can be freed at any time by calling
close_cached_file_handles()
.
-
lhotse.features.io.
close_cached_file_handles
()¶ Closes the cached file handles in
lookup_cache_or_open
(see its docs for more details).- Return type
None
-
class
lhotse.features.io.
NumpyHdf5Reader
(storage_path, *args, **kwargs)¶ Reads non-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Dataset
because their shapes (numbers of frames) may vary.storage_path
corresponds to the HDF5 file path;storage_key
for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).-
name
= 'numpy_hdf5'¶
-
__init__
(storage_path, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
read
(key, left_offset_frames=0, right_offset_frames=None)¶ - Return type
ndarray
-
-
class
lhotse.features.io.
NumpyHdf5Writer
(storage_path, *args, **kwargs)¶ Writes non-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Dataset
because their shapes (numbers of frames) may vary.storage_path
corresponds to the HDF5 file path;storage_key
for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).-
name
= 'numpy_hdf5'¶
-
__init__
(storage_path, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
property
storage_path
¶ - Return type
str
-
write
(key, value)¶ - Return type
str
-
close
()¶ - Return type
None
-
-
class
lhotse.features.io.
LilcomHdf5Reader
(storage_path, *args, **kwargs)¶ Reads lilcom-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Dataset
because their shapes (numbers of frames) may vary.storage_path
corresponds to the HDF5 file path;storage_key
for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).-
name
= 'lilcom_hdf5'¶
-
__init__
(storage_path, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
read
(key, left_offset_frames=0, right_offset_frames=None)¶ - Return type
ndarray
-
-
class
lhotse.features.io.
LilcomHdf5Writer
(storage_path, tick_power=- 5, *args, **kwargs)¶ Writes lilcom-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Dataset
because their shapes (numbers of frames) may vary.storage_path
corresponds to the HDF5 file path;storage_key
for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).-
name
= 'lilcom_hdf5'¶
-
__init__
(storage_path, tick_power=- 5, *args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
property
storage_path
¶ - Return type
str
-
write
(key, value)¶ - Return type
str
-
close
()¶ - Return type
None
-
Feature-domain mixing¶
-
class
lhotse.features.mixer.
FeatureMixer
(feature_extractor, base_feats, frame_shift, padding_value=- 1000.0)¶ Utility class to mix multiple feature matrices into a single one. It should be instantiated separately for each mixing session (i.e. each
MixedCut
will create a separateFeatureMixer
to mix its tracks). It is initialized with a numpy array of features (typically float32) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using theadd_to_mix
method. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize theFeatureMixer
.It relies on the
FeatureExtractor
to have definedmix
andcompute_energy
methods, so that theFeatureMixer
knows how to scale and add two feature matrices together.-
__init__
(feature_extractor, base_feats, frame_shift, padding_value=- 1000.0)¶ - Parameters
feature_extractor (
FeatureExtractor
) – TheFeatureExtractor
instance that specifies how to mix the features.base_feats (
ndarray
) – The features used to initialize theFeatureMixer
are a point of reference in terms of energy and offset for all features mixed into them.frame_shift (
float
) – Required to correctly compute offset and padding during the mix.padding_value (
float
) – The value used to pad the shorter features during the mix. This value is adequate only for log space features. For non-log space features, e.g. energies, use either 0 or a small positive value like 1e-5.
-
property
num_features
¶
-
property
unmixed_feats
¶ Return a numpy ndarray with the shape (num_tracks, num_frames, num_features), where each track’s feature matrix is padded and scaled adequately to the offsets and SNR used in
add_to_mix
call.- Return type
ndarray
-
property
mixed_feats
¶ Return a numpy ndarray with the shape (num_frames, num_features) - a mono mixed feature matrix of the tracks supplied with
add_to_mix
calls.- Return type
ndarray
-
add_to_mix
(feats, snr=None, offset=0.0)¶ Add feature matrix of a new track into the mix. :type feats:
ndarray
:param feats: A 2D feature matrix to be mixed in. :type snr:Optional
[float
] :param snr: Signal-to-noise ratio, assumingfeats
represents noise (positive SNR - lowerfeats
energy, negative SNR - higherfeats
energy) :type offset:float
:param offset: How many seconds to shiftfeats
in time. For mixing, the signal will be padded before the start with low energy values.
-
Augmentation¶
Cuts¶
Data structures and tools used to create training/testing examples.
-
class
lhotse.cut.
CutUtilsMixin
¶ A mixin class for cuts which contains all the methods that share common implementations.
Note: Ideally, this would’ve been an abstract base class specifying the common interface, but ABC’s do not mix well with dataclasses in Python. It is possible we’ll ditch the dataclass for cuts in the future and make this an ABC instead.
-
property
trimmed_supervisions
¶ Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.
Note that when
cut.supervisions
is called, the supervisions may have negativestart
values that indicate the supervision actually begins before the cut, orend
values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).- Return type
List
[SupervisionSegment
]
-
append
(other, snr=None)¶ Append the
other
Cut after the current Cut. Conceptually the same asmix
but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) theother
cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call toload_features
.- Return type
-
compute_features
(extractor, augment_fn=None)¶ Compute the features from this cut. This cut has to be able to load audio.
- Parameters
extractor (
FeatureExtractor
) – aFeatureExtractor
instance used to compute the features.augment_fn (
Optional
[Callable
[[ndarray
,int
],ndarray
]]) – optionalWavAugmenter
instance for audio augmentation.
- Return type
ndarray
- Returns
a numpy ndarray with the computed features.
-
plot_audio
()¶ Display a plot of the waveform. Requires matplotlib to be installed.
-
play_audio
()¶ Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).
-
plot_features
()¶ Display the feature matrix as an image. Requires matplotlib to be installed.
-
speakers_feature_mask
(min_speaker_dim=None, speaker_to_idx_map=None)¶ Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters
min_speaker_dim (
Optional
[int
]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional
[Dict
[str
,int
]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
- Return type
ndarray
-
speakers_audio_mask
(min_speaker_dim=None, speaker_to_idx_map=None)¶ Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters
min_speaker_dim (
Optional
[int
]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional
[Dict
[str
,int
]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
- Return type
ndarray
-
supervisions_feature_mask
()¶ Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.
- Return type
ndarray
-
supervisions_audio_mask
()¶ Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.
- Return type
ndarray
-
with_id
(id_)¶ Return a copy of the Cut with a new ID.
- Return type
Union
[Cut
,MixedCut
,PaddingCut
]
-
property
-
class
lhotse.cut.
Cut
(id: str, start: float, duration: float, channel: int, supervisions: List[lhotse.supervision.SupervisionSegment] = <factory>, features: Optional[lhotse.features.base.Features] = None, recording: Optional[lhotse.audio.Recording] = None)¶ A Cut is a single “segment” that we’ll train on. It contains the features corresponding to a piece of a recording, with zero or more SupervisionSegments.
The SupervisionSegments indicate which time spans of the Cut contain some kind of supervision information: e.g. transcript, speaker, language, etc. The regions without a corresponding SupervisionSegment may contain anything - usually we assume it’s either silence or some kind of noise.
Note: The SupervisionSegment time boundaries are relative to the beginning of the cut. E.g. if the underlying Recording starts at 0s (always true), the Cut starts at 100s, and the SupervisionSegment starts at 3s, it means that in the Recording the supervision actually started at 103s. In some cases, the supervision might have a negative start, or a duration exceeding the duration of the Cut; this means that the supervision in the recording extends beyond the Cut.
-
id
: str¶
-
start
: Seconds¶
-
duration
: Seconds¶
-
channel
: int¶
-
supervisions
: List[SupervisionSegment]¶
-
features
: Optional[lhotse.features.base.Features] = None¶
-
recording
: Optional[lhotse.audio.Recording] = None¶
-
property
recording_id
¶ - Return type
str
-
property
end
¶ - Return type
float
-
property
has_features
¶ - Return type
bool
-
property
has_recording
¶ - Return type
bool
-
property
frame_shift
¶ - Return type
Optional
[float
]
-
property
num_frames
¶ - Return type
Optional
[int
]
-
property
num_samples
¶ - Return type
Optional
[int
]
-
property
num_features
¶ - Return type
Optional
[int
]
-
property
features_type
¶ - Return type
Optional
[str
]
-
property
sampling_rate
¶ - Return type
int
-
load_features
()¶ Load the features from the underlying storage and cut them to the relevant [begin, duration] region of the current Cut.
- Return type
Optional
[ndarray
]
-
load_audio
()¶ Load the audio by locating the appropriate recording in the supplied RecordingSet. The audio is trimmed to the [begin, end] range specified by the Cut.
- Return type
Optional
[ndarray
]- Returns
a numpy ndarray with audio samples, with shape (1 <channel>, N <samples>)
-
compute_and_store_features
(extractor, storage, augment_fn=None, *args, **kwargs)¶ Compute the features from this cut, store them on disk, and attach a feature manifest to this cut. This cut has to be able to load audio.
- Parameters
extractor (
FeatureExtractor
) – aFeatureExtractor
instance used to compute the features.output_dir – the directory where the computed features will be stored.
augment_fn (
Optional
[Callable
[[ndarray
,int
],ndarray
]]) – an optional callable used for audio augmentation.
- Return type
Union
[Cut
,MixedCut
,PaddingCut
]- Returns
a new
Cut
instance with aFeatures
manifest attached to it.
-
truncate
(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False)¶ Returns a new Cut that is a sub-region of the current Cut.
Note that no operation is done on the actual features - it’s only during the call to load_features() when the actual changes happen (a subset of features is loaded).
- Parameters
offset (
float
) – float (seconds), controls the start of the new cut relative to the current Cut’s start. E.g., if the current Cut starts at 10.0, and offset is 2.0, the new start is 12.0.duration (
Optional
[float
]) – optional float (seconds), controls the duration of the resulting Cut. By default, the duration is (end of the cut before truncation) - (offset).keep_excessive_supervisions (
bool
) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.preserve_id (
bool
) – bool. Should the truncated cut keep the same ID or get a new, random one.
- Return type
- Returns
a new Cut instance. If the current Cut is shorter than the duration, return None.
-
pad
(duration)¶ Return a new MixedCut, padded to
duration
seconds with zeros in the recording, and low-energy values in each feature bin.- Parameters
duration (
float
) – The cut’s minimal duration after padding.- Return type
Union
[Cut
,MixedCut
,PaddingCut
]- Returns
a padded MixedCut if
duration
is greater than this cut’s duration, otherwiseself
.
-
map_supervisions
(transform_fn)¶ Modify the SupervisionSegments by transform_fn of this Cut.
- Parameters
transform_fn (
Callable
[[SupervisionSegment
],SupervisionSegment
]) – a function that modifies a supervision as an argument.- Return type
Union
[Cut
,MixedCut
,PaddingCut
]- Returns
a modified Cut.
-
__init__
(id, start, duration, channel, supervisions=<factory>, features=None, recording=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.cut.
PaddingCut
(id: str, duration: float, sampling_rate: int, use_log_energy: bool, num_frames: Optional[int] = None, num_features: Optional[int] = None, num_samples: Optional[int] = None)¶ This represents a cut filled with zeroes in the time domain, or low energy/log-energy values in the frequency domain. It’s used to make training samples evenly sized (same duration/number of frames).
-
id
: str¶
-
duration
: Seconds¶
-
sampling_rate
: int¶
-
use_log_energy
: bool¶
-
num_frames
: Optional[int] = None¶
-
num_features
: Optional[int] = None¶
-
num_samples
: Optional[int] = None¶
-
property
start
¶ - Return type
float
-
property
end
¶ - Return type
float
-
property
supervisions
¶
-
property
has_features
¶ - Return type
bool
-
property
has_recording
¶ - Return type
bool
-
property
frame_shift
¶
-
load_features
(*args, **kwargs)¶ - Return type
Optional
[ndarray
]
-
load_audio
(*args, **kwargs)¶ - Return type
Optional
[ndarray
]
-
truncate
(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False)¶ - Return type
-
pad
(duration)¶ Create a new PaddingCut with
duration
when its longer than this Cuts duration. Helper function used in batch cut padding.- Parameters
duration (
float
) – The cuts minimal duration after padding.- Return type
- Returns
self
or a new PaddingCut, depending onduration
.
-
compute_and_store_features
(extractor, *args, **kwargs)¶ Returns a new PaddingCut with updates information about the feature dimension and number of feature frames, depending on the
extractor
properties.- Return type
Union
[Cut
,MixedCut
,PaddingCut
]
-
map_supervisions
(transform_fn)¶ Just for consistency with Cut and MixedCut.
- Parameters
transform_fn (
Callable
[[Any
],Any
]) – a dummy function that would be never called actually.- Return type
Union
[Cut
,MixedCut
,PaddingCut
]- Returns
the PaddingCut itself.
-
static
from_dict
(data)¶ - Return type
-
with_features_path_prefix
(path)¶ - Return type
-
with_recording_path_prefix
(path)¶ - Return type
-
__init__
(id, duration, sampling_rate, use_log_energy, num_frames=None, num_features=None, num_samples=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.cut.
MixTrack
(cut: Union[lhotse.cut.Cut, lhotse.cut.PaddingCut], offset: float = 0.0, snr: Optional[float] = None)¶ Represents a single track in a mix of Cuts. Points to a specific Cut and holds information on how to mix it with other Cuts, relative to the first track in a mix.
-
cut
: Union[Cut, PaddingCut]¶
-
offset
: float = 0.0¶
-
snr
: Optional[float] = None¶
-
static
from_dict
(data)¶
-
__init__
(cut, offset=0.0, snr=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.cut.
MixedCut
(id: str, tracks: List[lhotse.cut.MixTrack])¶ Represents a Cut that’s created from other Cuts via mix or append operations. The actual mixing operations are performed upon loading the features into memory. In order to load the features, it needs to access the CutSet object that holds the “ingredient” cuts, as it only holds their IDs (“pointers”). The SNR and offset of all the tracks are specified relative to the first track.
-
id
: str¶
-
tracks
: List[MixTrack]¶
-
property
supervisions
¶ Lists the supervisions of the underlying source cuts. Each segment start time will be adjusted by the track offset.
- Return type
List
[SupervisionSegment
]
-
property
start
¶ - Return type
float
-
property
end
¶ - Return type
float
-
property
duration
¶ - Return type
float
-
property
has_features
¶ - Return type
bool
-
property
has_recording
¶ - Return type
bool
-
property
num_frames
¶ - Return type
Optional
[int
]
-
property
frame_shift
¶ - Return type
Optional
[float
]
-
property
sampling_rate
¶ - Return type
Optional
[int
]
-
property
num_samples
¶ - Return type
Optional
[int
]
-
property
num_features
¶ - Return type
Optional
[int
]
-
property
features_type
¶ - Return type
Optional
[str
]
-
truncate
(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False)¶ Returns a new MixedCut that is a sub-region of the current MixedCut. This method truncates the underlying Cuts and modifies their offsets in the mix, as needed. Tracks that do not fit in the truncated cut are removed.
Note that no operation is done on the actual features - it’s only during the call to load_features() when the actual changes happen (a subset of features is loaded).
- Parameters
offset (
float
) – float (seconds), controls the start of the new cut relative to the current MixedCut’s start.duration (
Optional
[float
]) – optional float (seconds), controls the duration of the resulting MixedCut. By default, the duration is (end of the cut before truncation) - (offset).keep_excessive_supervisions (
bool
) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.preserve_id (
bool
) – bool. Should the truncated cut keep the same ID or get a new, random one.
- Return type
- Returns
a new MixedCut instance.
-
pad
(duration)¶ Return a new MixedCut, padded to
duration
seconds with zeros in the recording, and low-energy values in each feature bin.- Parameters
duration (
float
) – The cut’s minimal duration after padding.- Return type
Union
[Cut
,MixedCut
,PaddingCut
]- Returns
a padded MixedCut if duration is greater than this cut’s duration, otherwise
self
.
-
load_features
(mixed=True)¶ Loads the features of the source cuts and mixes them on-the-fly.
- Parameters
mixed (
bool
) – when True (default), returns a 2D array of features mixed in the feature domain. Otherwise returns a 3D array with the first dimension equal to the number of tracks.- Return type
Optional
[ndarray
]- Returns
A numpy ndarray with features and with shape
(num_frames, num_features)
, or(num_tracks, num_frames, num_features)
-
load_audio
(mixed=True)¶ Loads the audios of the source cuts and mix them on-the-fly.
- Parameters
mixed (
bool
) – When True (default), returns a mono mix of the underlying tracks. Otherwise returns a numpy array with the number of channels equal to the number of tracks.- Return type
Optional
[ndarray
]- Returns
A numpy ndarray with audio samples and with shape
(num_channels, num_samples)
-
plot_tracks_features
()¶ Display the feature matrix as an image. Requires matplotlib to be installed.
-
plot_tracks_audio
()¶ Display plots of the individual tracks’ waveforms. Requires matplotlib to be installed.
-
compute_and_store_features
(extractor, storage, augment_fn=None, mix_eagerly=True)¶ Compute the features from this cut, store them on disk, and create a new Cut object with the feature manifest attached. This cut has to be able to load audio.
- Parameters
extractor (
FeatureExtractor
) – aFeatureExtractor
instance used to compute the features.storage (
FeaturesWriter
) – aFeaturesWriter
instance used to store the features.augment_fn (
Optional
[Callable
[[ndarray
,int
],ndarray
]]) – an optional callable used for audio augmentation.mix_eagerly (
bool
) – when False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a newCut
instance with the same ID. The returnedCut
will not have aRecording
attached.
- Return type
Union
[Cut
,MixedCut
,PaddingCut
]- Returns
a new
Cut
instance ifmix_eagerly
is True, or returnsself
with each of the tracks containing theFeatures
manifests.
-
map_supervisions
(transform_fn)¶ Modify the SupervisionSegments by transform_fn of this MixedCut.
- Parameters
transform_fn (
Callable
[[SupervisionSegment
],SupervisionSegment
]) – a function that modifies a supervision as an argument.- Return type
Union
[Cut
,MixedCut
,PaddingCut
]- Returns
a modified MixedCut.
-
__init__
(id, tracks)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
lhotse.cut.
CutSet
(*args, **kwds)¶ CutSet combines features with their corresponding supervisions. It may have wider span than the actual supervisions, provided the features for the whole span exist. It is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.
-
cuts
: Dict[str, AnyCut]¶
-
property
ids
¶ - Return type
Iterable
[str
]
-
property
speakers
¶ - Return type
FrozenSet
[str
]
-
static
from_manifests
(recordings=None, supervisions=None, features=None)¶ Create a CutSet from any combination of supervision, feature and recording manifests. At least one of
recording_set
orfeature_set
is required. The Cut boundaries correspond to those found in thefeature_set
, when available, otherwise to those found in therecording_set
When asupervision_set
is provided, we’ll attach to the Cut all supervisions that have a matching recording ID and are fully contained in the Cut’s boundaries.- Return type
-
to_dicts
()¶ - Return type
List
[dict
]
-
describe
()¶ Print a message describing details about the
CutSet
- the number of cuts and the duration statistics, including the total duration and the percentage of speech segments.- Example output:
Cuts count: 547 Total duration (hours): 326.4 Speech duration (hours): 79.6 (24.4%) *** Duration statistics (seconds): mean 2148.0 std 870.9 min 477.0 25% 1523.0 50% 2157.0 75% 2423.0 max 5415.0 dtype: float64
- Return type
None
-
split
(num_splits, randomize=False)¶ Split the
CutSet
intonum_splits
pieces of equal size.- Parameters
num_splits (
int
) – Requested number of splits.randomize (
bool
) – Optionally randomize the cuts order first.
- Return type
List
[CutSet
]- Returns
A list of
CutSet
pieces.
-
filter
(predicate)¶ Return a new CutSet with the Cuts that satisfy the predicate.
- Parameters
predicate (
Callable
[[Union
[Cut
,MixedCut
,PaddingCut
]],bool
]) – a function that takes a cut as an argument and returns bool.- Return type
- Returns
a filtered CutSet.
-
trim_to_supervisions
()¶ Return a new CutSet with Cuts that have identical spans as their supervisions.
- Return type
- Returns
a
CutSet
.
-
trim_to_unsupervised_segments
()¶ Return a new CutSet with Cuts created from segments that have no supervisions (likely silence or noise).
- Return type
- Returns
a
CutSet
.
-
mix_same_recording_channels
()¶ Find cuts that come from the same recording and have matching start and end times, but represent different channels. Then, mix them together (in matching groups) and return a new
CutSet
that contains their mixes. This is useful for processing microphone array recordings.It is intended to be used as the first operation after creating a new
CutSet
(but might also work in other circumstances, e.g. if it was cut to windows first).- Example:
>>> ami = prepare_ami('path/to/ami') >>> cut_set = CutSet.from_manifests(recordings=ami['train']['recordings']) >>> multi_channel_cut_set = cut_set.mix_same_recording_channels()
In the AMI example, the
multi_channel_cut_set
will yield MixedCuts that hold all single-channel Cuts together.- Return type
-
sort_by_duration
(ascending=False)¶ Sort the CutSet according to cuts duration. Descending by default.
- Return type
-
pad
(duration=None)¶ Return a new CutSet with Cuts padded to
duration
in seconds. Cuts longer thanduration
will not be affected. Cuts will be padded to the right (i.e. after the signal). :type duration:Optional
[float
] :param duration: The cuts minimal duration after padding. When not specified, we’ll choose the duration of the longest cut in the CutSet. :rtype:CutSet
:return: A padded CutSet.
-
truncate
(max_duration, offset_type, keep_excessive_supervisions=True, preserve_id=False)¶ Return a new CutSet with the Cuts truncated so that their durations are at most max_duration. Cuts shorter than max_duration will not be changed. :type max_duration:
float
:param max_duration: float, the maximum duration in seconds of a cut in the resulting manifest. :type offset_type:str
:param offset_type: str, can be: - ‘start’ => cuts are truncated from their start; - ‘end’ => cuts are truncated from their end minus max_duration; - ‘random’ => cuts are truncated randomly between their start and their end minus max_duration :type keep_excessive_supervisions:bool
:param keep_excessive_supervisions: bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept. :type preserve_id:bool
:param preserve_id: bool. Should the truncated cut keep the same ID or get a new, random one. :rtype:CutSet
:return: a new CutSet instance with truncated cuts.
-
cut_into_windows
(duration, keep_excessive_supervisions=True)¶ Return a new
CutSet
, made by traversing eachCut
in windows ofduration
seconds and creating newCut
out of them.The last window might have a shorter duration if there was not enough audio, so you might want to use either
.filter()
or.pad()
afterwards to obtain a uniform durationCutSet
.- Parameters
duration (
float
) – Desired duration of the new cuts in seconds.keep_excessive_supervisions (
bool
) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
- Return type
- Returns
a new CutSet with cuts made from shorter duration windows.
-
compute_and_store_features
(extractor, storage, augment_fn=None, executor=None, mix_eagerly=True)¶ Modify the current
CutSet
with by extracting features and attaching the feature manifests to the cuts.- Parameters
extractor (
FeatureExtractor
) – AFeatureExtractor
instance (either Lhotse’s built-in or a custom implementation).storage (
FeaturesWriter
) – AFeaturesWriter
instance used to store the features.augment_fn (
Optional
[Callable
[[ndarray
,int
],ndarray
]]) – an optional callable used for audio augmentation.executor (
Optional
[Any
]) – when provided, will be used to parallelize the feature extraction process. Any executor satisfying the standard concurrent.futures interface will be suitable; e.g. ProcessPoolExecutor, ThreadPoolExecutor, or dask.Client for distributed task execution (see: https://docs.dask.org/en/latest/futures.html?highlight=Client#start-dask-client)mix_eagerly (
bool
) – Related to how the features are extracted forMixedCut
instances, if any are present. When False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a newCut
instance with the same ID. The returnedCut
will not have aRecording
attached.
- Return type
- Returns
a new CutSet instance with the same
Cut``s, but with attached ``Features
objects
-
map_supervisions
(transform_fn)¶ Modify the SupervisionSegments by transform_fn in this CutSet.
- Parameters
transform_fn (
Callable
[[SupervisionSegment
],SupervisionSegment
]) – a function that modifies a supervision as an argument.- Return type
- Returns
a new, modified CutSet.
-
transform_text
(transform_fn)¶ Return a copy of this
CutSet
with allSupervisionSegments
text transformed withtransform_fn
. Useful for text normalization, phonetic transcription, etc.- Parameters
transform_fn (
Callable
[[str
],str
]) – a function that accepts a string and returns a string.- Return type
- Returns
a new, modified CutSet.
-
__init__
(cuts)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
lhotse.cut.
make_windowed_cuts_from_features
(feature_set, cut_duration, cut_shift=None, keep_shorter_windows=False)¶ Converts a FeatureSet to a CutSet by traversing each Features object in - possibly overlapping - windows, and creating a Cut out of that area. By default, the last window in traversal will be discarded if it cannot satisfy the cut_duration requirement.
- Parameters
feature_set (
FeatureSet
) – a FeatureSet object.cut_duration (
float
) – float, duration of created Cuts in seconds.cut_shift (
Optional
[float
]) – optional float, specifies how many seconds are in between the starts of consecutive windows. Equals cut_duration by default.keep_shorter_windows (
bool
) – bool, when True, the last window will be used to create a Cut even if its duration is shorter than cut_duration.
- Return type
- Returns
a CutSet object.
-
lhotse.cut.
mix
(reference_cut, mixed_in_cut, offset=0, snr=None)¶ Overlay, or mix, two cuts. Optionally the mixed_in_cut may be shifted by offset seconds and scaled down (positive SNR) or scaled up (negative SNR). Returns a MixedCut, which contains both cuts and the mix information. The actual feature mixing is performed during the call to
MixedCut.load_features()
.- Parameters
reference_cut (
Union
[Cut
,MixedCut
,PaddingCut
]) – The reference cut for the mix - offset and snr are specified w.r.t this cut.mixed_in_cut (
Union
[Cut
,MixedCut
,PaddingCut
]) – The mixed-in cut - it will be offset and rescaled to match the offset and snr parameters.offset (
float
) – How many seconds to shift themixed_in_cut
w.r.t. thereference_cut
.snr (
Optional
[float
]) – Desired SNR of the right_cut w.r.t. the left_cut in the mix.
- Return type
- Returns
A MixedCut instance.
-
lhotse.cut.
append
(left_cut, right_cut, snr=None)¶ Helper method for functional-style appending of Cuts.
- Return type
-
lhotse.cut.
mix_cuts
(cuts)¶ Return a MixedCut that consists of the input Cuts mixed with each other as-is.
- Return type
-
lhotse.cut.
append_cuts
(cuts)¶ Return a MixedCut that consists of the input Cuts appended to each other as-is.
- Return type
Union
[Cut
,MixedCut
,PaddingCut
]
Recipes¶
Convenience methods used to prepare recording and supervision manifests for standard corpora.
Kaldi conversion¶
Convenience methods used to interact with Kaldi data directories.
-
lhotse.kaldi.
load_kaldi_data_dir
(path, sampling_rate)¶ Load a Kaldi data directory and convert it to a Lhotse RecordingSet and SupervisionSet manifests. For this to work, at least the wav.scp file must exist. SupervisionSet is created only when a segments file exists. All the other files (text, utt2spk, etc.) are optional, and some of them might not be handled yet. In particular, feats.scp files are ignored.
- Return type
Tuple
[RecordingSet
,Optional
[SupervisionSet
]]
-
lhotse.kaldi.
load_kaldi_text_mapping
(path, must_exist=False)¶ Load Kaldi files such as utt2spk, spk2gender, text, etc. as a dict.
- Return type
Dict
[str
,Optional
[str
]]
Others¶
Helper methods used throughout the codebase.
-
lhotse.manipulation.
combine
(*manifests)¶ Combine multiple manifests of the same type into one.
- Return type
~Manifest
-
lhotse.manipulation.
to_manifest
(items)¶ Take an iterable of data types in Lhotse such as Recording, SupervisonSegment or Cut, and create the manifest of the corresponding type. When the iterable is empty, returns None.
- Return type
Optional
[~Manifest]
-
lhotse.manipulation.
load_manifest
(path)¶ Generic utility for reading an arbitrary manifest.
- Return type
~Manifest