API Reference 

Returns:

a modified copy of the current SupervisionSegment.

perturb_tempo(factor, sampling_rate, affix_id=True)[source]

Return a SupervisionSegment that has time boundaries matching the recording/cut perturbed with the same factor.

Parameters:

factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
sampling_rate (int) – The sampling rate is necessary to accurately perturb the start and duration (going through the sample counts).
affix_id (bool) – When true, we will modify the id and recording_id fields by affixing it with “_tp{factor}”.

Return type:

Returns:

a modified copy of the current SupervisionSegment.

perturb_volume(factor, affix_id=True)[source]

Return a SupervisionSegment with modified ids.

Parameters:

factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).
affix_id (bool) – When true, we will modify the id and recording_id fields by affixing it with “_vp{factor}”.

Return type:

Returns:

a modified copy of the current SupervisionSegment.

reverb_rir(affix_id=True, channel=None)[source]

Return a SupervisionSegment with modified ids.

Parameters:: affix_id (bool) – When true, we will modify the id and recording_id fields by affixing it with “_rvb”.
Return type:: SupervisionSegment
Returns:: a modified copy of the current SupervisionSegment.

trim(end, start=0)[source]

Return an identical SupervisionSegment, but ensure that self.start is not negative (in which case it’s set to 0) and self.end does not exceed the end parameter. If a start is optionally provided, the supervision is trimmed from the left (note that start should be relative to the cut times).

This method is useful for ensuring that the supervision does not exceed a cut’s bounds, in which case pass cut.duration as the end argument, since supervision times are relative to the cut.

Return type:: SupervisionSegment

map(transform_fn)[source]

Return a copy of the current segment, transformed with transform_fn.

Parameters:: transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that takes a segment as input, transforms it and returns a new segment.
Return type:: SupervisionSegment
Returns:: a modified SupervisionSegment.

transform_text(transform_fn)[source]

Return a copy of the current segment with transformed text field. Useful for text normalization, phonetic transcription, etc.

Parameters:: transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.
Return type:: SupervisionSegment
Returns:: a SupervisionSegment with adjusted text.

transform_alignment(transform_fn, type='word')[source]

Return a copy of the current segment with transformed alignment field. Useful for text normalization, phonetic transcription, etc.

Parameters:

type (Optional[str]) – alignment type to transform (key for alignment dict).
transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type:

Returns:

a SupervisionSegment with adjusted alignments.

to_dict()[source]

Return type:: dict

static from_dict(data)[source]

Return type:: SupervisionSegment

__init__(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None)

drop_custom(name)

has_custom(name)

Check if the Cut has a custom attribute with name name.

Parameters:: name (str) – name of the custom attribute.
Return type:: bool
Returns:: a boolean.

load_custom(name)

Load custom data as numpy array. The custom data is expected to have been stored in cuts custom field as an Array or TemporalArray manifest.

Note

It works with Array manifests stored via attribute assignments, e.g.: cut.my_custom_data = Array(...).

Parameters:: name (str) – name of the custom attribute.
Return type:: ndarray
Returns:: a numpy array with the data.

with_custom(name, value): Return a copy of this object with an extra custom field assigned to it.

class lhotse.supervision.SupervisionSet(segments=None)[source]

SupervisionSet represents a collection of segments containing some supervision information (see SupervisionSegment).

It acts as a Python list, extended with an efficient find operation that indexes and caches the supervision segments in an interval tree. It allows to quickly find supervision segments that correspond to a specific time interval. However, it can also work with lazy iterables.

When coming from Kaldi, think of SupervisionSet as a segments file on steroids, that may also contain text, utt2spk, utt2gender, utt2dur, etc.

Examples

Building a SupervisionSet:

>>> from lhotse import SupervisionSet, SupervisionSegment
>>> sups = SupervisionSet.from_segments([SupervisionSegment(...), ...])

Writing/reading a SupervisionSet:

>>> sups.to_file('supervisions.jsonl.gz')
>>> sups2 = SupervisionSet.from_file('supervisions.jsonl.gz')

Using SupervisionSet like a dict:

>>> 'rec00001-sup00000' in sups
True
>>> sups['rec00001-sup00000']
SupervisionSegment(id='rec00001-sup00000', recording_id='rec00001', start=0.5, ...)
>>> for segment in sups:
...     pass

Searching by recording_id and time interval:

>>> matched_segments = sups.find(recording_id='rec00001', start_after=17.0, end_before=25.0)

Manipulation:

>>> longer_than_5s = sups.filter(lambda s: s.duration > 5)
>>> first_100 = sups.subset(first=100)
>>> split_into_4 = sups.split(num_splits=4)
>>> shuffled = sups.shuffle()

__init__(segments=None)[source]

property data: Dict[str, SupervisionSegment] | Iterable[SupervisionSegment]: Alias property for self.segments

property ids: Iterable[str]

static from_segments(segments)[source]

Return type:: SupervisionSet

static from_items(segments)

Function to be implemented by every sub-class of this mixin. It’s expected to create a sub-class instance out of an iterable of items that are held by the sub-class (e.g., CutSet.from_items(iterable_of_cuts)).

Return type:: SupervisionSet

static from_dicts(data)[source]

Return type:: SupervisionSet

static from_rttm(path)[source]

Read an RTTM file located at path (or an iterator) and create a SupervisionSet manifest for them. Can be used to create supervisions from custom RTTM files (see, for example, lhotse.dataset.DiarizationDataset).

>>> from lhotse import SupervisionSet
>>> sup1 = SupervisionSet.from_rttm('/path/to/rttm_file')
>>> sup2 = SupervisionSet.from_rttm(Path('/path/to/rttm_dir').rglob('ref_*'))

The following description is taken from the [dscore](https://github.com/nryant/dscore#rttm) toolkit:

Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields:

Type – segment type; should always by SPEAKER
File ID – file name; basename of the recording minus extension (e.g.,

rec1_a) - Channel ID – channel (1-indexed) that turn is on; should always be 1 - Turn Onset – onset of turn in seconds from beginning of recording - Turn Duration – duration of turn in seconds - Orthography Field – should always by <NA> - Speaker Type – should always be <NA> - Speaker Name – name of speaker of turn; should be unique within scope of each file - Confidence Score – system confidence (probability) that information is correct; should always be <NA> - Signal Lookahead Time – should always be <NA>

For instance:

SPEAKER CMU_20020319-1400_d01_NONE 1 130.430000 2.350 <NA> <NA> juliet <NA> <NA> SPEAKER CMU_20020319-1400_d01_NONE 1 157.610000 3.060 <NA> <NA> tbc <NA> <NA> SPEAKER CMU_20020319-1400_d01_NONE 1 130.490000 0.450 <NA> <NA> chek <NA> <NA>

Parameters:: path (Union[Path, str, Iterable[Union[Path, str]]]) – Path to RTTM file or an iterator of paths to RTTM files.
Return type:: SupervisionSet
Returns:: a new SupervisionSet instance containing segments from the RTTM file.

with_alignment_from_ctm(ctm_file, type='word', match_channel=False, verbose=False)[source]

Add alignments from CTM file to the supervision set.

Parameters:

ctm – Path to CTM file.
type (str) – Alignment type (optional, default = word).
match_channel (bool) – if True, also match channel between CTM and SupervisionSegment
verbose (bool) – if True, show progress bar

Return type:

SupervisionSet

Returns:

A new SupervisionSet with AlignmentItem objects added to the segments.

write_alignment_to_ctm(ctm_file, type='word')[source]

Write alignments to CTM file.

Parameters:

ctm_file (Union[Path, str]) – Path to output CTM file (will be created if not exists)
type (str) – Alignment type to write (default = word)

Return type:

None

to_dicts()[source]

Return type:: Iterable[dict]

split(num_splits, shuffle=False, drop_last=False)[source]

Split the SupervisionSet into num_splits pieces of equal size.

Parameters:

num_splits (int) – Requested number of splits.
shuffle (bool) – Optionally shuffle the recordings order first.
drop_last (bool) – determines how to handle splitting when len(seq) is not divisible by num_splits. When False (default), the splits might have unequal lengths. When True, it may discard the last element in some splits to ensure they are equally long.

Return type:

List[SupervisionSet]

Returns:

A list of SupervisionSet pieces.

split_lazy(output_dir, chunk_size, prefix='')[source]

Splits a manifest (either lazily or eagerly opened) into chunks, each with chunk_size items (except for the last one, typically).

In order to be memory efficient, this implementation saves each chunk to disk in a .jsonl.gz format as the input manifest is sampled.

Note

For lowest memory usage, use load_manifest_lazy to open the input manifest for this method.

Parameters:

it – any iterable of Lhotse manifests.
output_dir (Union[Path, str]) – directory where the split manifests are saved. Each manifest is saved at: {output_dir}/{prefix}.{split_idx}.jsonl.gz
chunk_size (int) – the number of items in each chunk.
prefix (str) – the prefix of each manifest.

Return type:

List[SupervisionSet]

Returns:

a list of lazily opened chunk manifests.

subset(first=None, last=None)[source]

Return a new SupervisionSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

Parameters:

first (Optional[int]) – int, the number of first supervisions to keep.
last (Optional[int]) – int, the number of last supervisions to keep.

Return type:

SupervisionSet

Returns:

a new SupervisionSet with the subset results.

transform_text(transform_fn)[source]

Return a copy of the current SupervisionSet with the segments having a transformed text field. Useful for text normalization, phonetic transcription, etc.

Parameters:: transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.
Return type:: SupervisionSet
Returns:: a SupervisionSet with adjusted text.

transform_alignment(transform_fn, type='word')[source]

Return a copy of the current SupervisionSet with the segments having a transformed alignment field. Useful for text normalization, phonetic transcription, etc.

Parameters:

transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.
type (str) – alignment type to transform (key for alignment dict).

Return type:

SupervisionSet

Returns:

a SupervisionSet with adjusted text.

find(recording_id, channel=None, start_after=0, end_before=None, adjust_offset=False, tolerance=0.001)[source]

Return an iterable of segments that match the provided recording_id.

Parameters:

recording_id (str) – Desired recording ID.
channel (Optional[int]) – When specified, return supervisions in that channel - otherwise, in all channels.
start_after (float) – When specified, return segments that start after the given value.
end_before (Optional[float]) – When specified, return segments that end before the given value.
adjust_offset (bool) – When true, return segments as if the recordings had started at start_after. This is useful for creating Cuts. From a user perspective, when dealing with a Cut, it is no longer helpful to know when the supervisions starts in a recording - instead, it’s useful to know when the supervision starts relative to the start of the Cut. In the anticipated use-case, start_after and end_before would be the beginning and end of a cut; this option converts the times to be relative to the start of the cut.
tolerance (float) – Additional margin to account for floating point rounding errors when comparing segment boundaries.

Return type:

Iterable[SupervisionSegment]

Returns:

An iterator over supervision segments satisfying all criteria.

filter(predicate)

Return a new manifest containing only the items that satisfy predicate. If the manifest is lazy, the filtering will also be applied lazily.

Parameters:: predicate (Callable[[TypeVar(T)], bool]) – a function that takes a cut as an argument and returns bool.
Returns:: a filtered manifest.

classmethod from_file(path)

Return type:: Any

classmethod from_json(path)

Return type:: Any

classmethod from_jsonl(path)

Return type:: Any

classmethod from_jsonl_lazy(path): Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration. :rtype: Any

Warning

Opening the manifest in this way might cause some methods that rely on random access to fail.

classmethod from_yaml(path)

Return type:: Any

classmethod infinite_mux(*manifests, weights=None, seed=0, max_open_streams=None)

Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. Unlike mux(), this method allows to limit the number of max open sub-iterators at any given time.

To enable this, it performs 2-stage sampling. First, it samples with replacement the set of iterators I to construct a subset I_sub of size max_open_streams. Then, for each iteration step, it samples an iterator i from I_sub, fetches the next item from it, and yields it. Once i becomes exhausted, it is replaced with a new iterator j sampled from I_sub.

Caution

Do not use this method with inputs that are infinitely iterable as they will silently break the multiplexing property by only using a subset of the input iterables.

Caution

This method is not recommended for multiplexing for a small amount of iterations, as it may be much less accurate than mux() depending on the number of open streams, iterable sizes, and the random seed.

Parameters:

manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
weights (Optional[List[Union[int, float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.
seed (Union[int, Literal['trng', 'randomized']]) – the random seed, ensures deterministic order across multiple iterations.
max_open_streams (Optional[int]) – the number of iterables that can be open simultaneously at any given time.

property is_lazy: bool: Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.

map(transform_fn)

Apply transform_fn to each item in this manifest and return a new manifest. If the manifest is opened lazy, the transform is also applied lazily.

Parameters:: transform_fn (Callable[[TypeVar(T)], TypeVar(T)]) – A callable (function) that accepts a single item instance and returns a new (or the same) instance of the same type. E.g. with CutSet, callable accepts Cut and returns also Cut.
Returns:: a new CutSet with transformed cuts.

classmethod mux(*manifests, stop_early=False, weights=None, seed=0)

Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. If one of the iterables is exhausted before the others, we will keep iterating until all iterables are exhausted. This behavior can be changed with stop_early parameter.

Parameters:

manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
stop_early (bool) – should we stop the iteration as soon as we exhaust one of the manifests.
weights (Optional[List[Union[int, float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.
seed (Union[int, Literal['trng', 'randomized']]) – the random seed, ensures deterministic order across multiple iterations.

classmethod open_writer(path, overwrite=True)

Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (.jsonl), with optional gzip compression (.jsonl.gz). :rtype: Union[SequentialJsonlWriter, InMemoryWriter]

Note

when path is None, we will return a InMemoryWriter instead has the same API but stores the manifests in memory. It is convenient when you want to make disk saving optional.

Example:

>>> from lhotse import RecordingSet
... recordings = [...]
... with RecordingSet.open_writer('recordings.jsonl.gz') as writer:
...     for recording in recordings:
...         writer.write(recording)

This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifests.

Example:

>>> from lhotse import RecordingSet, Recording
... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer:
...     for path in Path('.').rglob('*.wav'):
...         recording_id = path.stem
...         if writer.contains(recording_id):
...             # Item already written previously - skip processing.
...             continue
...         # Item doesn't exist yet - run extra work to prepare the manifest
...         # and store it.
...         recording = Recording.from_file(path, recording_id=recording_id)
...         writer.write(recording)

repeat(times=None, preserve_id=False)

Return a new, lazily evaluated manifest that iterates over the original elements times number of times.

Parameters:

times (Optional[int]) – how many times to repeat (infinite by default).
preserve_id (bool) – when True, we won’t update the element ID with repeat number.

Returns:

a repeated manifest.

shuffle(rng=None, buffer_size=10000)

Shuffles the elements and returns a shuffled variant of self. If the manifest is opened lazily, performs shuffling on-the-fly with a fixed buffer size.

Parameters:: rng (Optional[Random]) – an optional instance of random.Random for precise control of randomness.
Returns:: a shuffled copy of self, or a manifest that is shuffled lazily.

to_eager(): Evaluates all lazy operations on this manifest, if any, and returns a copy that keeps all items in memory. If the manifest was “eager” already, this is a no-op and won’t copy anything.

to_file(path)

Return type:: None

to_json(path)

Return type:: None

to_jsonl(path)

Return type:: None

to_yaml(path)

Return type:: None

Lhotse Shar – sequential storage

Documentation for Lhotse Shar multi-tarfile sequential I/O format.

Lhotse Shar readers

class lhotse.shar.readers.LazySharIterator(fields=None, in_dir=None, split_for_dataloading=False, shuffle_shards=False, stateful_shuffle=True, seed=42, cut_map_fns=None)[source]

LazySharIterator reads cuts and their corresponding data from multiple shards, also recognized as the Lhotse Shar format. Each shard is numbered and represented as a collection of one text manifest and one or more binary tarfiles. Each tarfile contains a single type of data, e.g., recordings, features, or custom fields.

Given an example directory named some_dir, its expected layout is some_dir/cuts.000000.jsonl.gz, some_dir/recording.000000.tar, some_dir/features.000000.tar, and then the same names but numbered with 000001, etc. There may also be other files if the cuts have custom data attached to them.

The main idea behind Lhotse Shar format is to optimize dataloading with sequential reads, while keeping the data composition more flexible than e.g. WebDataset tar archives do. To achieve this, Lhotse Shar keeps each data type in a separate archive, along a single CutSet JSONL manifest. This way, the metadata can be investigated without iterating through the binary data. The format also allows iteration over a subset of fields, or extension of existing data with new fields.

As you iterate over cuts from LazySharIterator, it keeps a file handle open for the JSONL manifest and all of the tar files that correspond to the current shard. The tar files are read item by item together, and their binary data is attached to the cuts. It can be normally accessed using methods such as cut.load_audio().

We can simply load a directory created by SharWriter. Example:

>>> cuts = LazySharIterator(in_dir="some_dir")
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()
...     fbank = cut.load_features()

LazySharIterator can also be initialized from a dict, where the keys indicate fields to be read, and the values point to actual shard locations. This is useful when only a subset of data is needed, or it is stored in different directories. Example:

>>> cuts = LazySharIterator({
...     "cuts": ["some_dir/cuts.000000.jsonl.gz"],
...     "recording": ["another_dir/recording.000000.tar"],
...     "features": ["yet_another_dir/features.000000.tar"],
... })
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()
...     fbank = cut.load_features()

We also support providing shell commands as shard sources, inspired by WebDataset. Example:

>>> cuts = LazySharIterator({
...     "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl.gz"],
...     "recording": ["pipe:curl https://my.page/recording.000000.tar"],
... })
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()

Finally, we allow specifying URLs or cloud storage URIs for the shard sources. We defer to smart_open library to handle those. Example:

>>> cuts = LazySharIterator({
...     "cuts": ["s3://my-bucket/cuts.000000.jsonl.gz"],
...     "recording": ["s3://my-bucket/recording.000000.tar"],
... })
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()

Parameters:

fields (Optional[Dict[str, Sequence[Union[Path, str]]]]) – a dict whose keys specify which fields to load, and values are lists of shards (either paths or shell commands). The field “cuts” pointing to CutSet shards always has to be present.
in_dir (Union[Path, str, None]) – path to a directory created with SharWriter with all the shards in a single place. Can be used instead of fields.
split_for_dataloading (bool) – bool, by default False which does nothing. Setting it to True is intended for PyTorch training with multiple dataloader workers and possibly multiple DDP nodes. It results in each node+worker combination receiving a unique subset of shards from which to read data to avoid data duplication. This is mutually exclusive with seed='randomized'.
shuffle_shards (bool) – bool, by default False. When True, the shards are shuffled (in case of multi-node training, the shuffling is the same on each node given the same seed).
seed (Union[int, Literal['randomized'], Literal['trng']]) – When shuffle_shards is True, we use this number to seed the RNG. Seed can be set to 'randomized' in which case we expect that the user provided lhotse.dataset.dataloading.worker_init_fn() as DataLoader’s worker_init_fn argument. It will cause the iterator to shuffle shards differently on each node and dataloading worker in PyTorch training. This is mutually exclusive with split_for_dataloading=True. Seed can be set to 'trng' which, like 'randomized', shuffles the shards differently on each iteration, but is not possible to control (and is not reproducible). trng mode is mostly useful when the user has limited control over the training loop and may not be able to guarantee internal Shar epoch is being incremented, but needs randomness on each iteration (e.g. useful with PyTorch Lightning).
stateful_shuffle (bool) – bool, by default False. When True, every time this object is fully iterated, it increments an internal epoch counter and triggers shard reshuffling with RNG seeded by seed + epoch. Doesn’t have any effect when shuffle_shards is False.
cut_map_fns (Optional[Sequence[Callable[[Cut], Cut]]]) – optional sequence of callables that accept cuts and return cuts. It’s expected to have the same length as the number of shards, so each function corresponds to a specific shard. It can be used to attach shard-specific custom attributes to cuts.

Lhotse Shar writers

class lhotse.shar.writers.ArrayTarWriter(pattern, shard_size=1000, compression='numpy', lilcom_tick_power=-5)[source]

ArrayTarWriter writes numpy arrays or PyTorch tensors into a tar archive that is automatically sharded.

For floating point tensors, we support the option to use lilcom compression. Note that lilcom is only suitable for log-space features such as log-Mel filter banks.

Example:

>>> with ArrayTarWriter("some_dir/fbank.%06d.tar", shard_size=100, compression="lilcom") as w:
...     w.write("fbank1", fbank1_array)
...     w.write("fbank2", fbank2_array)  # etc.

It would create files such as some_dir/fbank.000000.tar, some_dir/fbank.000001.tar, etc.

It’s also possible to use ArrayTarWriter with automatic sharding disabled:

>>> with ArrayTarWriter("some_dir/fbank.tar", shard_size=None, compression="numpy") as w:
...     w.write("fbank1", fbank1_array)
...     w.write("fbank2", fbank2_array)  # etc.

Feature extraction and manifests

Data structures and tools used for feature extraction and description.

Features API - extractor and manifests

class lhotse.features.base.FeatureExtractor(config=None)[source]

The base class for all feature extractors in Lhotse. It is initialized with a config object, specific to a particular feature extraction method. The config is expected to be a dataclass so that it can be easily serialized.

All derived feature extractors must implement at least the following:

a name class attribute (how are these features called, e.g. ‘mfcc’)
a config_type class attribute that points to the configuration dataclass type
the extract method,
the frame_shift property.

Feature extractors that support feature-domain mixing should additionally specify two static methods:

compute_energy, and
mix.

By itself, the FeatureExtractor offers the following high-level methods that are not intended for overriding:

extract_from_samples_and_store
extract_from_recording_and_store

These methods run a larger feature extraction pipeline that involves data augmentation and disk storage.

name = None

config_type = None

__init__(config=None)[source]

abstract extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: ndarray
Returns:: a numpy ndarray representing the feature matrix.

abstract property frame_shift: float

abstract feature_dim(sampling_rate)[source]

Return type:: int

property device: str | device

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters:

features_a (ndarray) – Left-hand side (reference) signal.
features_b (ndarray) – Right-hand side (mixed-in) signal.
energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type:

ndarray

Returns:

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters:: features (ndarray) – A feature matrix.
Return type:: float
Returns:: A positive float value of the signal energy.

extract_batch(samples, sampling_rate, lengths=None)[source]: Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)[source]

Extract the features from an array of audio samples in a full pipeline:

optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters:

samples (ndarray) – a numpy ndarray with the audio samples.
sampling_rate (int) – integer sampling rate of samples.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.
channel (Union[List[int], int, None]) – an optional channel number(s) to insert into Features manifest.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix (it is not written to disk).

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)[source]

Extract the features from a Recording in a full pipeline:

load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features and the source data used.

Parameters:

recording (Recording) – a Recording that specifies what’s the input audio.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an optional offset in seconds for where to start reading the recording.
duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.
channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix.

classmethod from_dict(data)[source]

Return type:: FeatureExtractor

to_dict()[source]

Return type:: Dict[str, Any]

classmethod from_yaml(path)[source]

Return type:: FeatureExtractor

to_yaml(path)[source]

lhotse.features.base.get_extractor_type(name)[source]

Return the feature extractor type corresponding to the given name.

Parameters:: name (str) – specifies which feature extractor should be used.
Return type:: Type
Returns:: A feature extractors type.

lhotse.features.base.create_default_feature_extractor(name)[source]

Create a feature extractor object with a default configuration.

Parameters:: name (str) – specifies which feature extractor should be used.
Return type:: Optional[FeatureExtractor]
Returns:: A new feature extractor instance.

lhotse.features.base.register_extractor(cls)[source]

This decorator is used to register feature extractor classes in Lhotse so they can be easily created just by knowing their name.

An example of usage:

@register_extractor class MyFeatureExtractor: …

Parameters:: cls – A type (class) that is being registered.
Returns:: Registered type.

class lhotse.features.base.TorchaudioFeatureExtractor(config=None)[source]

Common abstract base class for all torchaudio based feature extractors.

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: ndarray
Returns:: a numpy ndarray representing the feature matrix.

property frame_shift: float

__init__(config=None)

static compute_energy(features)

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters:: features (ndarray) – A feature matrix.
Return type:: float
Returns:: A positive float value of the signal energy.

config_type = None

property device: str | device

extract_batch(samples, sampling_rate, lengths=None): Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features and the source data used.

Parameters:

recording (Recording) – a Recording that specifies what’s the input audio.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an optional offset in seconds for where to start reading the recording.
duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.
channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters:

samples (ndarray) – a numpy ndarray with the audio samples.
sampling_rate (int) – integer sampling rate of samples.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.
channel (Union[List[int], int, None]) – an optional channel number(s) to insert into Features manifest.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix (it is not written to disk).

abstract feature_dim(sampling_rate)

Return type:: int

classmethod from_dict(data)

Return type:: FeatureExtractor

classmethod from_yaml(path)

Return type:: FeatureExtractor

static mix(features_a, features_b, energy_scaling_factor_b)

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters:

features_a (ndarray) – Left-hand side (reference) signal.
features_b (ndarray) – Right-hand side (mixed-in) signal.
energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type:

ndarray

Returns:

A mixed feature matrix.

name = None

to_dict()

Return type:: Dict[str, Any]

to_yaml(path)

class lhotse.features.base.Features(type, num_frames, num_features, frame_shift, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)[source]

Represents features extracted for some particular time range in a given recording and channel. It contains metadata about how it’s stored: storage_type describes “how to read it”, for now it supports numpy arrays serialized with np.save, as well as arrays compressed with lilcom; storage_path is the path to the file on the local filesystem.

type: str

num_frames: int

num_features: int

frame_shift: float

sampling_rate: int

start: float

duration: float

storage_type: str

storage_path: str

storage_key: Union[str, bytes]

recording_id: Optional[str] = None

channels: Union[List[int], int, None] = None

property end: float

load(start=None, duration=None, channel_id=0)[source]

Return type:: ndarray

move_to_memory(start=0, duration=None, lilcom=False)[source]

Return type:: Features

with_path_prefix(path)[source]

Return type:: Features

to_dict()[source]

Return type:: dict

copy_feats(writer)[source]

Read the referenced feature array and save it using writer. Returns a copy of the manifest with updated fields related to the feature storage.

Return type:: Features

static from_dict(data)[source]

Return type:: Features

__init__(type, num_frames, num_features, frame_shift, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)

class lhotse.features.base.FeatureSet(features=None)[source]

Represents a feature manifest, and allows to read features for given recordings within particular channels and time ranges. It also keeps information about the feature extractor parameters used to obtain this set. When a given recording/time-range/channel is unavailable, raises a KeyError.

__init__(features=None)[source]

property data: Dict[str, Features] | Iterable[Features]: Alias property for self.features

static from_features(features)[source]

Return type:: FeatureSet

static from_items(features)

Function to be implemented by every sub-class of this mixin. It’s expected to create a sub-class instance out of an iterable of items that are held by the sub-class (e.g., CutSet.from_items(iterable_of_cuts)).

Return type:: FeatureSet

static from_dicts(data)[source]

Return type:: FeatureSet

to_dicts()[source]

Return type:: Iterable[dict]

with_path_prefix(path)[source]

Return type:: FeatureSet

split(num_splits, shuffle=False, drop_last=False)[source]

Split the FeatureSet into num_splits pieces of equal size.

Parameters:

num_splits (int) – Requested number of splits.
shuffle (bool) – Optionally shuffle the recordings order first.
drop_last (bool) – determines how to handle splitting when len(seq) is not divisible by num_splits. When False (default), the splits might have unequal lengths. When True, it may discard the last element in some splits to ensure they are equally long.

Return type:

List[FeatureSet]

Returns:

A list of FeatureSet pieces.

split_lazy(output_dir, chunk_size, prefix='')[source]

Splits a manifest (either lazily or eagerly opened) into chunks, each with chunk_size items (except for the last one, typically).

In order to be memory efficient, this implementation saves each chunk to disk in a .jsonl.gz format as the input manifest is sampled.

Note

For lowest memory usage, use load_manifest_lazy to open the input manifest for this method.

Parameters:

it – any iterable of Lhotse manifests.
output_dir (Union[Path, str]) – directory where the split manifests are saved. Each manifest is saved at: {output_dir}/{prefix}.{split_idx}.jsonl.gz
chunk_size (int) – the number of items in each chunk.
prefix (str) – the prefix of each manifest.

Return type:

List[FeatureSet]

Returns:

a list of lazily opened chunk manifests.

shuffle(*args, **kwargs)[source]

Shuffles the elements and returns a shuffled variant of self. If the manifest is opened lazily, performs shuffling on-the-fly with a fixed buffer size.

Parameters:: rng – an optional instance of random.Random for precise control of randomness.
Returns:: a shuffled copy of self, or a manifest that is shuffled lazily.

subset(first=None, last=None)[source]

Return a new FeatureSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

Parameters:

first (Optional[int]) – int, the number of first supervisions to keep.
last (Optional[int]) – int, the number of last supervisions to keep.

Return type:

FeatureSet

Returns:

a new FeatureSet with the subset results.

find(recording_id, channel_id=0, start=0.0, duration=None, leeway=0.05)[source]

Find and return a Features object that best satisfies the search criteria. Raise a KeyError when no such object is available.

Parameters:

recording_id (str) – str, requested recording ID.
channel_id (Union[int, List[int]]) – int, requested channel.
start (float) – float, requested start time in seconds for the feature chunk.
duration (Optional[float]) – optional float, requested duration in seconds for the feature chunk. By default, return everything from the start.
leeway (float) – float, controls how strictly we have to match the requested start and duration criteria. It is necessary to keep a small positive value here (default 0.05s), as there might be differences between the duration of recording/supervision segment, and the duration of features. The latter one is constrained to be a multiple of frame_shift, while the former can be arbitrary.

Return type:

Returns:

a Features object satisfying the search criteria.

load(recording_id, channel_id=0, start=0.0, duration=None)[source]

Find a Features object that best satisfies the search criteria and load the features as a numpy ndarray. Raise a KeyError when no such object is available.

Return type:: ndarray

copy_feats(writer)[source]

For each manifest in this FeatureSet, read the referenced feature array and save it using writer. Returns a copy of the manifest with updated fields related to the feature storage.

Return type:: FeatureSet

compute_global_stats(storage_path=None)[source]

Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

Parameters:: storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.
Return a dict of ``{‘norm_means’``{‘norm_means’:: np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.
Return type:: Dict[str, ndarray]

filter(predicate)

Return a new manifest containing only the items that satisfy predicate. If the manifest is lazy, the filtering will also be applied lazily.

Parameters:: predicate (Callable[[TypeVar(T)], bool]) – a function that takes a cut as an argument and returns bool.
Returns:: a filtered manifest.

classmethod from_file(path)

Return type:: Any

classmethod from_json(path)

Return type:: Any

classmethod from_jsonl(path)

Return type:: Any

classmethod from_jsonl_lazy(path): Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration. :rtype: Any

Warning

Opening the manifest in this way might cause some methods that rely on random access to fail.

classmethod from_yaml(path)

Return type:: Any

classmethod infinite_mux(*manifests, weights=None, seed=0, max_open_streams=None)

Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. Unlike mux(), this method allows to limit the number of max open sub-iterators at any given time.

To enable this, it performs 2-stage sampling. First, it samples with replacement the set of iterators I to construct a subset I_sub of size max_open_streams. Then, for each iteration step, it samples an iterator i from I_sub, fetches the next item from it, and yields it. Once i becomes exhausted, it is replaced with a new iterator j sampled from I_sub.

Caution

Do not use this method with inputs that are infinitely iterable as they will silently break the multiplexing property by only using a subset of the input iterables.

Caution

This method is not recommended for multiplexing for a small amount of iterations, as it may be much less accurate than mux() depending on the number of open streams, iterable sizes, and the random seed.

Parameters:

manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
weights (Optional[List[Union[int, float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.
seed (Union[int, Literal['trng', 'randomized']]) – the random seed, ensures deterministic order across multiple iterations.
max_open_streams (Optional[int]) – the number of iterables that can be open simultaneously at any given time.

property is_lazy: bool: Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.

map(transform_fn)

Apply transform_fn to each item in this manifest and return a new manifest. If the manifest is opened lazy, the transform is also applied lazily.

Parameters:: transform_fn (Callable[[TypeVar(T)], TypeVar(T)]) – A callable (function) that accepts a single item instance and returns a new (or the same) instance of the same type. E.g. with CutSet, callable accepts Cut and returns also Cut.
Returns:: a new CutSet with transformed cuts.

classmethod mux(*manifests, stop_early=False, weights=None, seed=0)

Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. If one of the iterables is exhausted before the others, we will keep iterating until all iterables are exhausted. This behavior can be changed with stop_early parameter.

Parameters:

manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
stop_early (bool) – should we stop the iteration as soon as we exhaust one of the manifests.
weights (Optional[List[Union[int, float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.
seed (Union[int, Literal['trng', 'randomized']]) – the random seed, ensures deterministic order across multiple iterations.

classmethod open_writer(path, overwrite=True)

Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (.jsonl), with optional gzip compression (.jsonl.gz). :rtype: Union[SequentialJsonlWriter, InMemoryWriter]

Note

when path is None, we will return a InMemoryWriter instead has the same API but stores the manifests in memory. It is convenient when you want to make disk saving optional.

Example:

>>> from lhotse import RecordingSet
... recordings = [...]
... with RecordingSet.open_writer('recordings.jsonl.gz') as writer:
...     for recording in recordings:
...         writer.write(recording)

This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifests.

Example:

>>> from lhotse import RecordingSet, Recording
... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer:
...     for path in Path('.').rglob('*.wav'):
...         recording_id = path.stem
...         if writer.contains(recording_id):
...             # Item already written previously - skip processing.
...             continue
...         # Item doesn't exist yet - run extra work to prepare the manifest
...         # and store it.
...         recording = Recording.from_file(path, recording_id=recording_id)
...         writer.write(recording)

repeat(times=None, preserve_id=False)

Return a new, lazily evaluated manifest that iterates over the original elements times number of times.

Parameters:

times (Optional[int]) – how many times to repeat (infinite by default).
preserve_id (bool) – when True, we won’t update the element ID with repeat number.

Returns:

a repeated manifest.

to_eager(): Evaluates all lazy operations on this manifest, if any, and returns a copy that keeps all items in memory. If the manifest was “eager” already, this is a no-op and won’t copy anything.

to_file(path)

Return type:: None

to_json(path)

Return type:: None

to_jsonl(path)

Return type:: None

to_yaml(path)

Return type:: None

class lhotse.features.base.FeatureSetBuilder(feature_extractor, storage, augment_fn=None)[source]

An extended constructor for the FeatureSet. Think of it as a class wrapper for a feature extraction script. It consumes an iterable of Recordings, extracts the features specified by the FeatureExtractor config, and saves stores them on the disk.

Eventually, we plan to extend it with the capability to extract only the features in specified regions of recordings and to perform some time-domain data augmentation.

__init__(feature_extractor, storage, augment_fn=None)[source]

process_and_store_recordings(recordings, output_manifest=None, num_jobs=1)[source]

Return type:: FeatureSet

lhotse.features.base.store_feature_array(feats, storage)[source]

Store feats array on disk, using lilcom compression by default.

Parameters:

feats (ndarray) – a numpy ndarray containing features.
storage (FeaturesWriter) – a FeaturesWriter object to use for array storage.

Return type:

str

Returns:

a path to the file containing the stored array.

lhotse.features.base.compute_global_stats(feature_manifests, storage_path=None)[source]

Compute the global means and standard deviations for each feature bin in the manifest. It performs only a single pass over the data and iteratively updates the estimate of the means and variances.

We follow the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

Parameters:

feature_manifests (Iterable[Features]) – an iterable of Features objects.
storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.

Return a dict of ``{‘norm_means’``{‘norm_means’:

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type:

Dict[str, ndarray]

class lhotse.features.base.StatsAccumulator(feature_dim)[source]

__init__(feature_dim)[source]

update(arr)[source]

Return type:: None

property norm_means: ndarray

property norm_stds: ndarray

get()[source]

Return type:: Dict[str, ndarray]

Lhotse’s feature extractors

class lhotse.features.kaldi.extractors.Fbank(config=None)[source]

name = 'kaldi-fbank'

config_type: alias of FbankConfig

__init__(config=None)[source]

property device: str | device

property frame_shift: float

to(device)[source]

feature_dim(sampling_rate)[source]

Return type:: int

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: Union[ndarray, Tensor]
Returns:: a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate, lengths=None)[source]: Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters:

features_a (ndarray) – Left-hand side (reference) signal.
features_b (ndarray) – Right-hand side (mixed-in) signal.
energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type:

ndarray

Returns:

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters:: features (ndarray) – A feature matrix.
Return type:: float
Returns:: A positive float value of the signal energy.

class lhotse.features.kaldi.extractors.Mfcc(config=None)[source]

name = 'kaldi-mfcc'

config_type: alias of MfccConfig

__init__(config=None)[source]

property device: str | device

property frame_shift: float

feature_dim(sampling_rate)[source]

Return type:: int

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: Union[ndarray, Tensor]
Returns:: a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate, lengths=None)[source]: Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

Kaldi feature extractors as network layers

Copyright 2019 Johns Hopkins University (Author: Jesus Villalba)
2021 Johns Hopkins University (Author: Piotr Żelasko)

Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

This whole module is authored and contributed by Jesus Villalba, with minor changes by Piotr Żelasko to make it more consistent with Lhotse.

It contains a PyTorch implementation of feature extractors that is very close to Kaldi’s – notably, it differs in that the preemphasis and DC offset removal are applied in the time, rather than frequency domain. This should not significantly affect any results, as confirmed by Jesus.

This implementation works well with autograd and batching, and can be used neural network layers.

Update January 2022: These modules now expose a new API function called “online_inference” that may be used to compute the features when the audio is streaming. The implementation is stateless, and passes the waveform remainders back to the user to feed them to the modules once new data becomes available. The implementation is compatible with JIT scripting via TorchScript.

class lhotse.features.kaldi.layers.Wav2Win(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, pad_length=None, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, return_log_energy=False)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and partition them into overlapping frames (of audio samples). Note: no feature extraction happens in here, the output is still a time-domain signal.

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2Win()
>>> t(x).shape
torch.Size([1, 100, 400])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, window_length). When return_log_energy==True, returns a tuple where the second element is a log-energy tensor of shape (batch_size, num_frames).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, pad_length=None, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, return_log_energy=False)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tuple[Tensor, Optional[Tensor]]

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

online_inference(x, context=None)[source]

The same as the forward() method, except it accepts an extra argument with the remainder waveform from the previous call of online_inference(), and returns a tuple of ((frames, log_energy), remainder).

Return type:: Tuple[Tuple[Tensor, Optional[Tensor]], Tensor]

T_destination = ~T_destination

add_module(name, module)

Add a child module to the current module.

The module can be accessed as an attribute using the given name.

Return type:: None

Args:

name (str): name of the child module. The child module can be: accessed from this module using the given name

module (Module): child module to be added to the module.

apply(fn)

Apply fn recursively to every submodule (as returned by .children()) as well as self.

Typical use includes initializing the parameters of a model (see also nn-init-doc).

Return type:: TypeVar(T, bound= Module)

Args:: fn (Module -> None): function to be applied to each submodule
Returns:: Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

buffers(recurse=True)

Return an iterator over module buffers.

Return type:: Iterator[Tensor]

Args:

recurse (bool): if True, then yields buffers of this module: and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

call_super_init: bool = False

children()

Return an iterator over immediate children modules.

Return type:: Iterator[Module]

Yields:: Module: a child module

compile(*args, **kwargs)

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu()

Move all model parameters and buffers to the CPU. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

cuda(device=None)

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

double()

Casts all floating point parameters and buffers to double datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

dump_patches: bool = False

eval()

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Returns:: Module: self

extra_repr()

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type:: str

float()

Casts all floating point parameters and buffers to float datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

get_buffer(target)

Return the buffer given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Tensor

Args:

target: The fully-qualified string name of the buffer: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not a buffer

get_extra_state()

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Return type:: Any

Returns:: object: Any extra state to store in the module’s state_dict

get_parameter(target)

Return the parameter given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Parameter

Args:

target: The fully-qualified string name of the Parameter: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Parameter

get_submodule(target)

Return the submodule given by target if it exists, otherwise throw an error.

For example, let’s say you have an nn.Module A that looks like this:

A(
    (net_b): Module(
        (net_c): Module(
            (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2))
        )
        (linear): Linear(in_features=100, out_features=200, bias=True)
    )
)

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Return type:: Module

Args:

target: The fully-qualified string name of the submodule: to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Module

half()

Casts all floating point parameters and buffers to half datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

ipu(device=None)

Move all model parameters and buffers to the IPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on IPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

load_state_dict(state_dict, strict=True, assign=False)

Copy parameters and buffers from state_dict into this module and its descendants.

If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:

state_dict (dict): a dict containing parameters and: persistent buffers.
strict (bool, optional): whether to strictly enforce that the keys: in state_dict match the keys returned by this module’s state_dict() function. Default: True
assign (bool, optional): whether to assign items in the state: dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:

NamedTuple with missing_keys and unexpected_keys fields:

missing_keys is a list of str containing the missing keys
unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Return an iterator over all modules in the network.

Return type:: Iterator[Module]

Yields:: Module: a module in the network
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
...     print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)

named_buffers(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Return type:: Iterator[Tuple[str, Tensor]]

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.

remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.

Yields:

(str, torch.Tensor): Tuple containing the name and buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, buf in self.named_buffers():
>>>     if name in ['running_var']:
>>>         print(buf.size())

named_children()

Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Return type:: Iterator[Tuple[str, Module]]

Yields:: (str, Module): Tuple containing a name and child module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)

named_modules(memo=None, prefix='', remove_duplicate=True)

Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:: memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result

or not
Yields:: (str, Module): Tuple of name and module
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
...     print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))

named_parameters(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Return type:: Iterator[Tuple[str, Parameter]]

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

remove_duplicate (bool, optional): whether to remove the duplicated: parameters in the result. Defaults to True.

Yields:

(str, Parameter): Tuple containing the name and parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, param in self.named_parameters():
>>>     if name in ['bias']:
>>>         print(param.size())

parameters(recurse=True)

Return an iterator over module parameters.

This is typically passed to an optimizer.

Return type:: Iterator[Parameter]

Args:

recurse (bool): if True, then yields parameters of this module: and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

register_backward_hook(hook)

Register a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Return type:: RemovableHandle

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_buffer(name, tensor, persistent=True)

Add a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Return type:: None

Args:

name (str): name of the buffer. The buffer can be accessed: from this module using the given name
tensor (Tensor or None): buffer to be registered. If None, then operations: that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.
persistent (bool): whether the buffer is part of this module’s: state_dict.

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> self.register_buffer('running_mean', torch.zeros(num_features))

register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)

Register a forward hook on the module.

The hook will be called every time after forward() has computed an output.

If with_kwargs is False or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called. The hook should have the following signature:

hook(module, args, output) -> None or modified output

If with_kwargs is True, the forward hook will be passed the kwargs given to the forward function and be expected to return the output possibly modified. The hook should have the following signature:

hook(module, args, kwargs, output) -> None or modified output

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If True, the provided hook will be fired

before all existing forward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward hooks on this torch.nn.modules.Module. Note that global forward hooks registered with register_module_forward_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If True, the hook will be passed the: kwargs given to the forward function. Default: False
always_call (bool): If True the hook will be run regardless of: whether an exception is raised while calling the Module. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)

Register a forward pre-hook on the module.

The hook will be called every time before forward() is invoked.

If with_kwargs is false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:

hook(module, args) -> None or modified input

If with_kwargs is true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:

hook(module, args, kwargs) -> None or a tuple of modified input and kwargs

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing forward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward_pre hooks on this torch.nn.modules.Module. Note that global forward_pre hooks registered with register_module_forward_pre_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If true, the hook will be passed the kwargs: given to the forward function. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_hook(hook, prepend=False)

Register a backward hook on the module.

The hook will be called every time the gradients with respect to a module are computed, i.e. the hook will execute if and only if the gradients with respect to module outputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward hooks on this torch.nn.modules.Module. Note that global backward hooks registered with register_module_full_backward_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_pre_hook(hook, prepend=False)

Register a backward pre-hook on the module.

The hook will be called every time the gradients for the module are computed. The hook should have the following signature:

hook(module, grad_output) -> tuple[Tensor] or None

The grad_output is a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place of grad_output in subsequent computations. Entries in grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward_pre hooks on this torch.nn.modules.Module. Note that global backward_pre hooks registered with register_module_full_backward_pre_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_load_state_dict_post_hook(hook)

Register a post hook to be run after module’s load_state_dict is called.

It should have the following signature::: hook(module, incompatible_keys) -> None

The module argument is the current module that this hook is registered on, and the incompatible_keys argument is a NamedTuple consisting of attributes missing_keys and unexpected_keys. missing_keys is a list of str containing the missing keys and unexpected_keys is a list of str containing the unexpected keys.

The given incompatible_keys can be modified inplace if needed.

Note that the checks performed when calling load_state_dict() with strict=True are affected by modifications the hook makes to missing_keys or unexpected_keys, as expected. Additions to either set of keys will result in an error being thrown when strict=True, and clearing out both missing and unexpected keys will avoid an error.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_module(name, module)

Alias for add_module().

Return type:: None

register_parameter(name, param)

Add a parameter to the module.

The parameter can be accessed as an attribute using given name.

Return type:: None

Args:

name (str): name of the parameter. The parameter can be accessed: from this module using the given name
param (Parameter or None): parameter to be added to the module. If: None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

register_state_dict_pre_hook(hook)

Register a pre-hook for the load_state_dict() method.

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self. The registered hooks can be used to perform pre-processing before the state_dict call is made.

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Args:

requires_grad (bool): whether autograd should record operations on: parameters in this module. Default: True.

Returns:

Module: self

set_extra_state(state)

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:: state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_().

Return type:: TypeVar(T, bound= Module)

state_dict(*args, destination=None, prefix='', keep_vars=False)

Return a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:

destination (dict, optional): If provided, the state of module will: be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.
prefix (str, optional): a prefix added to parameter and buffer: names to compose the keys in state_dict. Default: ''.
keep_vars (bool, optional): by default the Tensor s: returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:

dict:: a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']

to(*args, **kwargs)

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)

to(dtype, non_blocking=False)

to(tensor, non_blocking=False)

to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:

device (torch.device): the desired device of the parameters: and buffers in this module
dtype (torch.dtype): the desired floating point or complex dtype of: the parameters and buffers in this module
tensor (torch.Tensor): Tensor whose dtype and device are the desired: dtype and device for all parameters and buffers in this module
memory_format (torch.memory_format): the desired memory: format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)

to_empty(*, device, recurse=True)

Move the parameters and buffers to the specified device without copying storage.

Return type:: TypeVar(T, bound= Module)

Args:

device (torch.device): The desired device of the parameters: and buffers in this module.
recurse (bool): Whether parameters and buffers of submodules should: be recursively moved to the specified device.

Returns:

Module: self

train(mode=True)

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Return type:: TypeVar(T, bound= Module)

Args:

mode (bool): whether to set training mode (True) or evaluation: mode (False). Default: True.

Returns:

Module: self

type(dst_type)

Casts all parameters and buffers to dst_type. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:: dst_type (type or string): the desired type
Returns:: Module: self

xpu(device=None)

Move all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

zero_grad(set_to_none=True)

Reset gradients of all model parameters.

See similar function under torch.optim.Optimizer for more context.

Return type:: None

Args:

set_to_none (bool): instead of setting to zero, set the grads to None.: See torch.optim.Optimizer.zero_grad() for details.

training: bool

class lhotse.features.kaldi.layers.Wav2FFT(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The output is a complex-valued tensor.

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2FFT()
>>> t(x).shape
torch.Size([1, 100, 257])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_fft_bins) with dtype torch.complex64.

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

property sampling_rate: int

property frame_length: float

property frame_shift: float

property remove_dc_offset: bool

property preemph_coeff: float

property window_type: str

property dither: float

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

online_inference(x, context=None)[source]

Return type:: Tuple[Tensor, Tensor]

T_destination = ~T_destination

add_module(name, module)

Add a child module to the current module.

The module can be accessed as an attribute using the given name.

Return type:: None

Args:

name (str): name of the child module. The child module can be: accessed from this module using the given name

module (Module): child module to be added to the module.

apply(fn)

Apply fn recursively to every submodule (as returned by .children()) as well as self.

Typical use includes initializing the parameters of a model (see also nn-init-doc).

Return type:: TypeVar(T, bound= Module)

Args:: fn (Module -> None): function to be applied to each submodule
Returns:: Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

buffers(recurse=True)

Return an iterator over module buffers.

Return type:: Iterator[Tensor]

Args:

recurse (bool): if True, then yields buffers of this module: and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

call_super_init: bool = False

children()

Return an iterator over immediate children modules.

Return type:: Iterator[Module]

Yields:: Module: a child module

compile(*args, **kwargs)

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu()

Move all model parameters and buffers to the CPU. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

cuda(device=None)

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

double()

Casts all floating point parameters and buffers to double datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

dump_patches: bool = False

eval()

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Returns:: Module: self

extra_repr()

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type:: str

float()

Casts all floating point parameters and buffers to float datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

get_buffer(target)

Return the buffer given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Tensor

Args:

target: The fully-qualified string name of the buffer: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not a buffer

get_extra_state()

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Return type:: Any

Returns:: object: Any extra state to store in the module’s state_dict

get_parameter(target)

Return the parameter given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Parameter

Args:

target: The fully-qualified string name of the Parameter: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Parameter

get_submodule(target)

Return the submodule given by target if it exists, otherwise throw an error.

For example, let’s say you have an nn.Module A that looks like this:

A(
    (net_b): Module(
        (net_c): Module(
            (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2))
        )
        (linear): Linear(in_features=100, out_features=200, bias=True)
    )
)

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Return type:: Module

Args:

target: The fully-qualified string name of the submodule: to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Module

half()

Casts all floating point parameters and buffers to half datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

ipu(device=None)

Move all model parameters and buffers to the IPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on IPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

load_state_dict(state_dict, strict=True, assign=False)

Copy parameters and buffers from state_dict into this module and its descendants.

If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:

state_dict (dict): a dict containing parameters and: persistent buffers.
strict (bool, optional): whether to strictly enforce that the keys: in state_dict match the keys returned by this module’s state_dict() function. Default: True
assign (bool, optional): whether to assign items in the state: dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:

NamedTuple with missing_keys and unexpected_keys fields:

missing_keys is a list of str containing the missing keys
unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Return an iterator over all modules in the network.

Return type:: Iterator[Module]

Yields:: Module: a module in the network
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
...     print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)

named_buffers(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Return type:: Iterator[Tuple[str, Tensor]]

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.

remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.

Yields:

(str, torch.Tensor): Tuple containing the name and buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, buf in self.named_buffers():
>>>     if name in ['running_var']:
>>>         print(buf.size())

named_children()

Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Return type:: Iterator[Tuple[str, Module]]

Yields:: (str, Module): Tuple containing a name and child module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)

named_modules(memo=None, prefix='', remove_duplicate=True)

Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:: memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result

or not
Yields:: (str, Module): Tuple of name and module
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
...     print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))

named_parameters(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Return type:: Iterator[Tuple[str, Parameter]]

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

remove_duplicate (bool, optional): whether to remove the duplicated: parameters in the result. Defaults to True.

Yields:

(str, Parameter): Tuple containing the name and parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, param in self.named_parameters():
>>>     if name in ['bias']:
>>>         print(param.size())

parameters(recurse=True)

Return an iterator over module parameters.

This is typically passed to an optimizer.

Return type:: Iterator[Parameter]

Args:

recurse (bool): if True, then yields parameters of this module: and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

register_backward_hook(hook)

Register a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Return type:: RemovableHandle

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_buffer(name, tensor, persistent=True)

Add a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Return type:: None

Args:

name (str): name of the buffer. The buffer can be accessed: from this module using the given name
tensor (Tensor or None): buffer to be registered. If None, then operations: that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.
persistent (bool): whether the buffer is part of this module’s: state_dict.

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> self.register_buffer('running_mean', torch.zeros(num_features))

register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)

Register a forward hook on the module.

The hook will be called every time after forward() has computed an output.

If with_kwargs is False or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called. The hook should have the following signature:

hook(module, args, output) -> None or modified output

If with_kwargs is True, the forward hook will be passed the kwargs given to the forward function and be expected to return the output possibly modified. The hook should have the following signature:

hook(module, args, kwargs, output) -> None or modified output

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If True, the provided hook will be fired

before all existing forward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward hooks on this torch.nn.modules.Module. Note that global forward hooks registered with register_module_forward_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If True, the hook will be passed the: kwargs given to the forward function. Default: False
always_call (bool): If True the hook will be run regardless of: whether an exception is raised while calling the Module. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)

Register a forward pre-hook on the module.

The hook will be called every time before forward() is invoked.

If with_kwargs is false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:

hook(module, args) -> None or modified input

If with_kwargs is true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:

hook(module, args, kwargs) -> None or a tuple of modified input and kwargs

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing forward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward_pre hooks on this torch.nn.modules.Module. Note that global forward_pre hooks registered with register_module_forward_pre_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If true, the hook will be passed the kwargs: given to the forward function. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_hook(hook, prepend=False)

Register a backward hook on the module.

The hook will be called every time the gradients with respect to a module are computed, i.e. the hook will execute if and only if the gradients with respect to module outputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward hooks on this torch.nn.modules.Module. Note that global backward hooks registered with register_module_full_backward_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_pre_hook(hook, prepend=False)

Register a backward pre-hook on the module.

The hook will be called every time the gradients for the module are computed. The hook should have the following signature:

hook(module, grad_output) -> tuple[Tensor] or None

The grad_output is a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place of grad_output in subsequent computations. Entries in grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward_pre hooks on this torch.nn.modules.Module. Note that global backward_pre hooks registered with register_module_full_backward_pre_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_load_state_dict_post_hook(hook)

Register a post hook to be run after module’s load_state_dict is called.

It should have the following signature::: hook(module, incompatible_keys) -> None

The module argument is the current module that this hook is registered on, and the incompatible_keys argument is a NamedTuple consisting of attributes missing_keys and unexpected_keys. missing_keys is a list of str containing the missing keys and unexpected_keys is a list of str containing the unexpected keys.

The given incompatible_keys can be modified inplace if needed.

Note that the checks performed when calling load_state_dict() with strict=True are affected by modifications the hook makes to missing_keys or unexpected_keys, as expected. Additions to either set of keys will result in an error being thrown when strict=True, and clearing out both missing and unexpected keys will avoid an error.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_module(name, module)

Alias for add_module().

Return type:: None

register_parameter(name, param)

Add a parameter to the module.

The parameter can be accessed as an attribute using given name.

Return type:: None

Args:

name (str): name of the parameter. The parameter can be accessed: from this module using the given name
param (Parameter or None): parameter to be added to the module. If: None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

register_state_dict_pre_hook(hook)

Register a pre-hook for the load_state_dict() method.

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self. The registered hooks can be used to perform pre-processing before the state_dict call is made.

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Args:

requires_grad (bool): whether autograd should record operations on: parameters in this module. Default: True.

Returns:

Module: self

set_extra_state(state)

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:: state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_().

Return type:: TypeVar(T, bound= Module)

state_dict(*args, destination=None, prefix='', keep_vars=False)

Return a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:

destination (dict, optional): If provided, the state of module will: be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.
prefix (str, optional): a prefix added to parameter and buffer: names to compose the keys in state_dict. Default: ''.
keep_vars (bool, optional): by default the Tensor s: returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:

dict:: a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']

to(*args, **kwargs)

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)

to(dtype, non_blocking=False)

to(tensor, non_blocking=False)

to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:

device (torch.device): the desired device of the parameters: and buffers in this module
dtype (torch.dtype): the desired floating point or complex dtype of: the parameters and buffers in this module
tensor (torch.Tensor): Tensor whose dtype and device are the desired: dtype and device for all parameters and buffers in this module
memory_format (torch.memory_format): the desired memory: format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)

to_empty(*, device, recurse=True)

Move the parameters and buffers to the specified device without copying storage.

Return type:: TypeVar(T, bound= Module)

Args:

device (torch.device): The desired device of the parameters: and buffers in this module.
recurse (bool): Whether parameters and buffers of submodules should: be recursively moved to the specified device.

Returns:

Module: self

train(mode=True)

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Return type:: TypeVar(T, bound= Module)

Args:

mode (bool): whether to set training mode (True) or evaluation: mode (False). Default: True.

Returns:

Module: self

type(dst_type)

Casts all parameters and buffers to dst_type. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:: dst_type (type or string): the desired type
Returns:: Module: self

xpu(device=None)

Move all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

zero_grad(set_to_none=True)

Reset gradients of all model parameters.

See similar function under torch.optim.Optimizer for more context.

Return type:: None

Args:

set_to_none (bool): instead of setting to zero, set the grads to None.: See torch.optim.Optimizer.zero_grad() for details.

training: bool

class lhotse.features.kaldi.layers.Wav2Spec(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The STFT is transformed either to a magnitude spectrum (use_fft_mag=True) or a power spectrum (use_fft_mag=False).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2Spec()
>>> t(x).shape
torch.Size([1, 100, 257])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_fft_bins).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

T_destination = ~T_destination

add_module(name, module)

Add a child module to the current module.

The module can be accessed as an attribute using the given name.

Return type:: None

Args:

name (str): name of the child module. The child module can be: accessed from this module using the given name

module (Module): child module to be added to the module.

apply(fn)

Apply fn recursively to every submodule (as returned by .children()) as well as self.

Typical use includes initializing the parameters of a model (see also nn-init-doc).

Return type:: TypeVar(T, bound= Module)

Args:: fn (Module -> None): function to be applied to each submodule
Returns:: Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

buffers(recurse=True)

Return an iterator over module buffers.

Return type:: Iterator[Tensor]

Args:

recurse (bool): if True, then yields buffers of this module: and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

call_super_init: bool = False

children()

Return an iterator over immediate children modules.

Return type:: Iterator[Module]

Yields:: Module: a child module

compile(*args, **kwargs)

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu()

Move all model parameters and buffers to the CPU. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

cuda(device=None)

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

property dither: float

double()

Casts all floating point parameters and buffers to double datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

dump_patches: bool = False

eval()

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Returns:: Module: self

extra_repr()

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type:: str

float()

Casts all floating point parameters and buffers to float datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property frame_length: float

property frame_shift: float

get_buffer(target)

Return the buffer given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Tensor

Args:

target: The fully-qualified string name of the buffer: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not a buffer

get_extra_state()

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Return type:: Any

Returns:: object: Any extra state to store in the module’s state_dict

get_parameter(target)

Return the parameter given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Parameter

Args:

target: The fully-qualified string name of the Parameter: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Parameter

get_submodule(target)

Return the submodule given by target if it exists, otherwise throw an error.

For example, let’s say you have an nn.Module A that looks like this:

A(
    (net_b): Module(
        (net_c): Module(
            (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2))
        )
        (linear): Linear(in_features=100, out_features=200, bias=True)
    )
)

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Return type:: Module

Args:

target: The fully-qualified string name of the submodule: to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Module

half()

Casts all floating point parameters and buffers to half datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

ipu(device=None)

Move all model parameters and buffers to the IPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on IPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

load_state_dict(state_dict, strict=True, assign=False)

Copy parameters and buffers from state_dict into this module and its descendants.

If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:

state_dict (dict): a dict containing parameters and: persistent buffers.
strict (bool, optional): whether to strictly enforce that the keys: in state_dict match the keys returned by this module’s state_dict() function. Default: True
assign (bool, optional): whether to assign items in the state: dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:

NamedTuple with missing_keys and unexpected_keys fields:

missing_keys is a list of str containing the missing keys
unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Return an iterator over all modules in the network.

Return type:: Iterator[Module]

Yields:: Module: a module in the network
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
...     print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)

named_buffers(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Return type:: Iterator[Tuple[str, Tensor]]

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.

remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.

Yields:

(str, torch.Tensor): Tuple containing the name and buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, buf in self.named_buffers():
>>>     if name in ['running_var']:
>>>         print(buf.size())

named_children()

Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Return type:: Iterator[Tuple[str, Module]]

Yields:: (str, Module): Tuple containing a name and child module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)

named_modules(memo=None, prefix='', remove_duplicate=True)

Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:: memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result

or not
Yields:: (str, Module): Tuple of name and module
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
...     print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))

named_parameters(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Return type:: Iterator[Tuple[str, Parameter]]

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

remove_duplicate (bool, optional): whether to remove the duplicated: parameters in the result. Defaults to True.

Yields:

(str, Parameter): Tuple containing the name and parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, param in self.named_parameters():
>>>     if name in ['bias']:
>>>         print(param.size())

online_inference(x, context=None)

Return type:: Tuple[Tensor, Tensor]

parameters(recurse=True)

Return an iterator over module parameters.

This is typically passed to an optimizer.

Return type:: Iterator[Parameter]

Args:

recurse (bool): if True, then yields parameters of this module: and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

property preemph_coeff: float

register_backward_hook(hook)

Register a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Return type:: RemovableHandle

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_buffer(name, tensor, persistent=True)

Add a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Return type:: None

Args:

name (str): name of the buffer. The buffer can be accessed: from this module using the given name
tensor (Tensor or None): buffer to be registered. If None, then operations: that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.
persistent (bool): whether the buffer is part of this module’s: state_dict.

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> self.register_buffer('running_mean', torch.zeros(num_features))

register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)

Register a forward hook on the module.

The hook will be called every time after forward() has computed an output.

If with_kwargs is False or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called. The hook should have the following signature:

hook(module, args, output) -> None or modified output

If with_kwargs is True, the forward hook will be passed the kwargs given to the forward function and be expected to return the output possibly modified. The hook should have the following signature:

hook(module, args, kwargs, output) -> None or modified output

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If True, the provided hook will be fired

before all existing forward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward hooks on this torch.nn.modules.Module. Note that global forward hooks registered with register_module_forward_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If True, the hook will be passed the: kwargs given to the forward function. Default: False
always_call (bool): If True the hook will be run regardless of: whether an exception is raised while calling the Module. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)

Register a forward pre-hook on the module.

The hook will be called every time before forward() is invoked.

If with_kwargs is false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:

hook(module, args) -> None or modified input

If with_kwargs is true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:

hook(module, args, kwargs) -> None or a tuple of modified input and kwargs

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing forward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward_pre hooks on this torch.nn.modules.Module. Note that global forward_pre hooks registered with register_module_forward_pre_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If true, the hook will be passed the kwargs: given to the forward function. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_hook(hook, prepend=False)

Register a backward hook on the module.

The hook will be called every time the gradients with respect to a module are computed, i.e. the hook will execute if and only if the gradients with respect to module outputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward hooks on this torch.nn.modules.Module. Note that global backward hooks registered with register_module_full_backward_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_pre_hook(hook, prepend=False)

Register a backward pre-hook on the module.

The hook will be called every time the gradients for the module are computed. The hook should have the following signature:

hook(module, grad_output) -> tuple[Tensor] or None

The grad_output is a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place of grad_output in subsequent computations. Entries in grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward_pre hooks on this torch.nn.modules.Module. Note that global backward_pre hooks registered with register_module_full_backward_pre_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_load_state_dict_post_hook(hook)

Register a post hook to be run after module’s load_state_dict is called.

It should have the following signature::: hook(module, incompatible_keys) -> None

The module argument is the current module that this hook is registered on, and the incompatible_keys argument is a NamedTuple consisting of attributes missing_keys and unexpected_keys. missing_keys is a list of str containing the missing keys and unexpected_keys is a list of str containing the unexpected keys.

The given incompatible_keys can be modified inplace if needed.

Note that the checks performed when calling load_state_dict() with strict=True are affected by modifications the hook makes to missing_keys or unexpected_keys, as expected. Additions to either set of keys will result in an error being thrown when strict=True, and clearing out both missing and unexpected keys will avoid an error.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_module(name, module)

Alias for add_module().

Return type:: None

register_parameter(name, param)

Add a parameter to the module.

The parameter can be accessed as an attribute using given name.

Return type:: None

Args:

name (str): name of the parameter. The parameter can be accessed: from this module using the given name
param (Parameter or None): parameter to be added to the module. If: None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

register_state_dict_pre_hook(hook)

Register a pre-hook for the load_state_dict() method.

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self. The registered hooks can be used to perform pre-processing before the state_dict call is made.

property remove_dc_offset: bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Args:

requires_grad (bool): whether autograd should record operations on: parameters in this module. Default: True.

Returns:

Module: self

property sampling_rate: int

set_extra_state(state)

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:: state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_().

Return type:: TypeVar(T, bound= Module)

state_dict(*args, destination=None, prefix='', keep_vars=False)

Return a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:

destination (dict, optional): If provided, the state of module will: be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.
prefix (str, optional): a prefix added to parameter and buffer: names to compose the keys in state_dict. Default: ''.
keep_vars (bool, optional): by default the Tensor s: returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:

dict:: a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']

to(*args, **kwargs)

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)

to(dtype, non_blocking=False)

to(tensor, non_blocking=False)

to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:

device (torch.device): the desired device of the parameters: and buffers in this module
dtype (torch.dtype): the desired floating point or complex dtype of: the parameters and buffers in this module
tensor (torch.Tensor): Tensor whose dtype and device are the desired: dtype and device for all parameters and buffers in this module
memory_format (torch.memory_format): the desired memory: format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)

to_empty(*, device, recurse=True)

Move the parameters and buffers to the specified device without copying storage.

Return type:: TypeVar(T, bound= Module)

Args:

device (torch.device): The desired device of the parameters: and buffers in this module.
recurse (bool): Whether parameters and buffers of submodules should: be recursively moved to the specified device.

Returns:

Module: self

train(mode=True)

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Return type:: TypeVar(T, bound= Module)

Args:

mode (bool): whether to set training mode (True) or evaluation: mode (False). Default: True.

Returns:

Module: self

type(dst_type)

Casts all parameters and buffers to dst_type. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:: dst_type (type or string): the desired type
Returns:: Module: self

property window_type: str

xpu(device=None)

Move all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

zero_grad(set_to_none=True)

Reset gradients of all model parameters.

See similar function under torch.optim.Optimizer for more context.

Return type:: None

Args:

set_to_none (bool): instead of setting to zero, set the grads to None.: See torch.optim.Optimizer.zero_grad() for details.

training: bool

class lhotse.features.kaldi.layers.Wav2LogSpec(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The STFT is transformed either to a log-magnitude spectrum (use_fft_mag=True) or a log-power spectrum (use_fft_mag=False).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2LogSpec()
>>> t(x).shape
torch.Size([1, 100, 257])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_fft_bins).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

T_destination = ~T_destination

add_module(name, module)

Add a child module to the current module.

The module can be accessed as an attribute using the given name.

Return type:: None

Args:

name (str): name of the child module. The child module can be: accessed from this module using the given name

module (Module): child module to be added to the module.

apply(fn)

Apply fn recursively to every submodule (as returned by .children()) as well as self.

Typical use includes initializing the parameters of a model (see also nn-init-doc).

Return type:: TypeVar(T, bound= Module)

Args:: fn (Module -> None): function to be applied to each submodule
Returns:: Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

buffers(recurse=True)

Return an iterator over module buffers.

Return type:: Iterator[Tensor]

Args:

recurse (bool): if True, then yields buffers of this module: and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

call_super_init: bool = False

children()

Return an iterator over immediate children modules.

Return type:: Iterator[Module]

Yields:: Module: a child module

compile(*args, **kwargs)

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu()

Move all model parameters and buffers to the CPU. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

cuda(device=None)

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

property dither: float

double()

Casts all floating point parameters and buffers to double datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

dump_patches: bool = False

eval()

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Returns:: Module: self

extra_repr()

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type:: str

float()

Casts all floating point parameters and buffers to float datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property frame_length: float

property frame_shift: float

get_buffer(target)

Return the buffer given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Tensor

Args:

target: The fully-qualified string name of the buffer: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not a buffer

get_extra_state()

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Return type:: Any

Returns:: object: Any extra state to store in the module’s state_dict

get_parameter(target)

Return the parameter given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Parameter

Args:

target: The fully-qualified string name of the Parameter: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Parameter

get_submodule(target)

Return the submodule given by target if it exists, otherwise throw an error.

For example, let’s say you have an nn.Module A that looks like this:

A(
    (net_b): Module(
        (net_c): Module(
            (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2))
        )
        (linear): Linear(in_features=100, out_features=200, bias=True)
    )
)

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Return type:: Module

Args:

target: The fully-qualified string name of the submodule: to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Module

half()

Casts all floating point parameters and buffers to half datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

ipu(device=None)

Move all model parameters and buffers to the IPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on IPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

load_state_dict(state_dict, strict=True, assign=False)

Copy parameters and buffers from state_dict into this module and its descendants.

If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:

state_dict (dict): a dict containing parameters and: persistent buffers.
strict (bool, optional): whether to strictly enforce that the keys: in state_dict match the keys returned by this module’s state_dict() function. Default: True
assign (bool, optional): whether to assign items in the state: dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:

NamedTuple with missing_keys and unexpected_keys fields:

missing_keys is a list of str containing the missing keys
unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Return an iterator over all modules in the network.

Return type:: Iterator[Module]

Yields:: Module: a module in the network
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
...     print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)

named_buffers(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Return type:: Iterator[Tuple[str, Tensor]]

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.

remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.

Yields:

(str, torch.Tensor): Tuple containing the name and buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, buf in self.named_buffers():
>>>     if name in ['running_var']:
>>>         print(buf.size())

named_children()

Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Return type:: Iterator[Tuple[str, Module]]

Yields:: (str, Module): Tuple containing a name and child module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)

named_modules(memo=None, prefix='', remove_duplicate=True)

Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:: memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result

or not
Yields:: (str, Module): Tuple of name and module
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
...     print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))

named_parameters(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Return type:: Iterator[Tuple[str, Parameter]]

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

remove_duplicate (bool, optional): whether to remove the duplicated: parameters in the result. Defaults to True.

Yields:

(str, Parameter): Tuple containing the name and parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, param in self.named_parameters():
>>>     if name in ['bias']:
>>>         print(param.size())

online_inference(x, context=None)

Return type:: Tuple[Tensor, Tensor]

parameters(recurse=True)

Return an iterator over module parameters.

This is typically passed to an optimizer.

Return type:: Iterator[Parameter]

Args:

recurse (bool): if True, then yields parameters of this module: and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

property preemph_coeff: float

register_backward_hook(hook)

Register a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Return type:: RemovableHandle

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_buffer(name, tensor, persistent=True)

Add a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Return type:: None

Args:

name (str): name of the buffer. The buffer can be accessed: from this module using the given name
tensor (Tensor or None): buffer to be registered. If None, then operations: that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.
persistent (bool): whether the buffer is part of this module’s: state_dict.

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> self.register_buffer('running_mean', torch.zeros(num_features))

register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)

Register a forward hook on the module.

The hook will be called every time after forward() has computed an output.

If with_kwargs is False or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called. The hook should have the following signature:

hook(module, args, output) -> None or modified output

If with_kwargs is True, the forward hook will be passed the kwargs given to the forward function and be expected to return the output possibly modified. The hook should have the following signature:

hook(module, args, kwargs, output) -> None or modified output

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If True, the provided hook will be fired

before all existing forward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward hooks on this torch.nn.modules.Module. Note that global forward hooks registered with register_module_forward_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If True, the hook will be passed the: kwargs given to the forward function. Default: False
always_call (bool): If True the hook will be run regardless of: whether an exception is raised while calling the Module. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)

Register a forward pre-hook on the module.

The hook will be called every time before forward() is invoked.

If with_kwargs is false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:

hook(module, args) -> None or modified input

If with_kwargs is true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:

hook(module, args, kwargs) -> None or a tuple of modified input and kwargs

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing forward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward_pre hooks on this torch.nn.modules.Module. Note that global forward_pre hooks registered with register_module_forward_pre_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If true, the hook will be passed the kwargs: given to the forward function. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_hook(hook, prepend=False)

Register a backward hook on the module.

The hook will be called every time the gradients with respect to a module are computed, i.e. the hook will execute if and only if the gradients with respect to module outputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward hooks on this torch.nn.modules.Module. Note that global backward hooks registered with register_module_full_backward_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_pre_hook(hook, prepend=False)

Register a backward pre-hook on the module.

The hook will be called every time the gradients for the module are computed. The hook should have the following signature:

hook(module, grad_output) -> tuple[Tensor] or None

The grad_output is a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place of grad_output in subsequent computations. Entries in grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward_pre hooks on this torch.nn.modules.Module. Note that global backward_pre hooks registered with register_module_full_backward_pre_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_load_state_dict_post_hook(hook)

Register a post hook to be run after module’s load_state_dict is called.

It should have the following signature::: hook(module, incompatible_keys) -> None

The module argument is the current module that this hook is registered on, and the incompatible_keys argument is a NamedTuple consisting of attributes missing_keys and unexpected_keys. missing_keys is a list of str containing the missing keys and unexpected_keys is a list of str containing the unexpected keys.

The given incompatible_keys can be modified inplace if needed.

Note that the checks performed when calling load_state_dict() with strict=True are affected by modifications the hook makes to missing_keys or unexpected_keys, as expected. Additions to either set of keys will result in an error being thrown when strict=True, and clearing out both missing and unexpected keys will avoid an error.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_module(name, module)

Alias for add_module().

Return type:: None

register_parameter(name, param)

Add a parameter to the module.

The parameter can be accessed as an attribute using given name.

Return type:: None

Args:

name (str): name of the parameter. The parameter can be accessed: from this module using the given name
param (Parameter or None): parameter to be added to the module. If: None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

register_state_dict_pre_hook(hook)

Register a pre-hook for the load_state_dict() method.

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self. The registered hooks can be used to perform pre-processing before the state_dict call is made.

property remove_dc_offset: bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Args:

requires_grad (bool): whether autograd should record operations on: parameters in this module. Default: True.

Returns:

Module: self

property sampling_rate: int

set_extra_state(state)

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:: state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_().

Return type:: TypeVar(T, bound= Module)

state_dict(*args, destination=None, prefix='', keep_vars=False)

Return a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:

destination (dict, optional): If provided, the state of module will: be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.
prefix (str, optional): a prefix added to parameter and buffer: names to compose the keys in state_dict. Default: ''.
keep_vars (bool, optional): by default the Tensor s: returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:

dict:: a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']

to(*args, **kwargs)

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)

to(dtype, non_blocking=False)

to(tensor, non_blocking=False)

to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:

device (torch.device): the desired device of the parameters: and buffers in this module
dtype (torch.dtype): the desired floating point or complex dtype of: the parameters and buffers in this module
tensor (torch.Tensor): Tensor whose dtype and device are the desired: dtype and device for all parameters and buffers in this module
memory_format (torch.memory_format): the desired memory: format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)

to_empty(*, device, recurse=True)

Move the parameters and buffers to the specified device without copying storage.

Return type:: TypeVar(T, bound= Module)

Args:

device (torch.device): The desired device of the parameters: and buffers in this module.
recurse (bool): Whether parameters and buffers of submodules should: be recursively moved to the specified device.

Returns:

Module: self

train(mode=True)

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Return type:: TypeVar(T, bound= Module)

Args:

mode (bool): whether to set training mode (True) or evaluation: mode (False). Default: True.

Returns:

Module: self

type(dst_type)

Casts all parameters and buffers to dst_type. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:: dst_type (type or string): the desired type
Returns:: Module: self

property window_type: str

xpu(device=None)

Move all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

zero_grad(set_to_none=True)

Reset gradients of all model parameters.

See similar function under torch.optim.Optimizer for more context.

Return type:: None

Args:

set_to_none (bool): instead of setting to zero, set the grads to None.: See torch.optim.Optimizer.zero_grad() for details.

training: bool

class lhotse.features.kaldi.layers.Wav2LogFilterBank(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=80, norm_filters=False, torchaudio_compatible_mel_scale=True)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their log-Mel filter bank energies (also known as “fbank”).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2LogFilterBank()
>>> t(x).shape
torch.Size([1, 100, 80])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_filters).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=80, norm_filters=False, torchaudio_compatible_mel_scale=True)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

T_destination = ~T_destination

add_module(name, module)

Add a child module to the current module.

The module can be accessed as an attribute using the given name.

Return type:: None

Args:

name (str): name of the child module. The child module can be: accessed from this module using the given name

module (Module): child module to be added to the module.

apply(fn)

Apply fn recursively to every submodule (as returned by .children()) as well as self.

Typical use includes initializing the parameters of a model (see also nn-init-doc).

Return type:: TypeVar(T, bound= Module)

Args:: fn (Module -> None): function to be applied to each submodule
Returns:: Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

buffers(recurse=True)

Return an iterator over module buffers.

Return type:: Iterator[Tensor]

Args:

recurse (bool): if True, then yields buffers of this module: and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

call_super_init: bool = False

children()

Return an iterator over immediate children modules.

Return type:: Iterator[Module]

Yields:: Module: a child module

compile(*args, **kwargs)

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu()

Move all model parameters and buffers to the CPU. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

cuda(device=None)

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

property dither: float

double()

Casts all floating point parameters and buffers to double datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

dump_patches: bool = False

eval()

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Returns:: Module: self

extra_repr()

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type:: str

float()

Casts all floating point parameters and buffers to float datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property frame_length: float

property frame_shift: float

get_buffer(target)

Return the buffer given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Tensor

Args:

target: The fully-qualified string name of the buffer: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not a buffer

get_extra_state()

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Return type:: Any

Returns:: object: Any extra state to store in the module’s state_dict

get_parameter(target)

Return the parameter given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Parameter

Args:

target: The fully-qualified string name of the Parameter: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Parameter

get_submodule(target)

Return the submodule given by target if it exists, otherwise throw an error.

For example, let’s say you have an nn.Module A that looks like this:

A(
    (net_b): Module(
        (net_c): Module(
            (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2))
        )
        (linear): Linear(in_features=100, out_features=200, bias=True)
    )
)

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Return type:: Module

Args:

target: The fully-qualified string name of the submodule: to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Module

half()

Casts all floating point parameters and buffers to half datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

ipu(device=None)

Move all model parameters and buffers to the IPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on IPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

load_state_dict(state_dict, strict=True, assign=False)

Copy parameters and buffers from state_dict into this module and its descendants.

If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:

state_dict (dict): a dict containing parameters and: persistent buffers.
strict (bool, optional): whether to strictly enforce that the keys: in state_dict match the keys returned by this module’s state_dict() function. Default: True
assign (bool, optional): whether to assign items in the state: dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:

NamedTuple with missing_keys and unexpected_keys fields:

missing_keys is a list of str containing the missing keys
unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Return an iterator over all modules in the network.

Return type:: Iterator[Module]

Yields:: Module: a module in the network
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
...     print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)

named_buffers(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Return type:: Iterator[Tuple[str, Tensor]]

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.

remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.

Yields:

(str, torch.Tensor): Tuple containing the name and buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, buf in self.named_buffers():
>>>     if name in ['running_var']:
>>>         print(buf.size())

named_children()

Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Return type:: Iterator[Tuple[str, Module]]

Yields:: (str, Module): Tuple containing a name and child module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)

named_modules(memo=None, prefix='', remove_duplicate=True)

Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:: memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result

or not
Yields:: (str, Module): Tuple of name and module
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
...     print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))

named_parameters(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Return type:: Iterator[Tuple[str, Parameter]]

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

remove_duplicate (bool, optional): whether to remove the duplicated: parameters in the result. Defaults to True.

Yields:

(str, Parameter): Tuple containing the name and parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, param in self.named_parameters():
>>>     if name in ['bias']:
>>>         print(param.size())

online_inference(x, context=None)

Return type:: Tuple[Tensor, Tensor]

parameters(recurse=True)

Return an iterator over module parameters.

This is typically passed to an optimizer.

Return type:: Iterator[Parameter]

Args:

recurse (bool): if True, then yields parameters of this module: and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

property preemph_coeff: float

register_backward_hook(hook)

Register a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Return type:: RemovableHandle

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_buffer(name, tensor, persistent=True)

Add a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Return type:: None

Args:

name (str): name of the buffer. The buffer can be accessed: from this module using the given name
tensor (Tensor or None): buffer to be registered. If None, then operations: that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.
persistent (bool): whether the buffer is part of this module’s: state_dict.

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> self.register_buffer('running_mean', torch.zeros(num_features))

register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)

Register a forward hook on the module.

The hook will be called every time after forward() has computed an output.

If with_kwargs is False or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called. The hook should have the following signature:

hook(module, args, output) -> None or modified output

If with_kwargs is True, the forward hook will be passed the kwargs given to the forward function and be expected to return the output possibly modified. The hook should have the following signature:

hook(module, args, kwargs, output) -> None or modified output

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If True, the provided hook will be fired

before all existing forward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward hooks on this torch.nn.modules.Module. Note that global forward hooks registered with register_module_forward_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If True, the hook will be passed the: kwargs given to the forward function. Default: False
always_call (bool): If True the hook will be run regardless of: whether an exception is raised while calling the Module. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)

Register a forward pre-hook on the module.

The hook will be called every time before forward() is invoked.

If with_kwargs is false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:

hook(module, args) -> None or modified input

If with_kwargs is true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:

hook(module, args, kwargs) -> None or a tuple of modified input and kwargs

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing forward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward_pre hooks on this torch.nn.modules.Module. Note that global forward_pre hooks registered with register_module_forward_pre_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If true, the hook will be passed the kwargs: given to the forward function. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_hook(hook, prepend=False)

Register a backward hook on the module.

The hook will be called every time the gradients with respect to a module are computed, i.e. the hook will execute if and only if the gradients with respect to module outputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward hooks on this torch.nn.modules.Module. Note that global backward hooks registered with register_module_full_backward_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_pre_hook(hook, prepend=False)

Register a backward pre-hook on the module.

The hook will be called every time the gradients for the module are computed. The hook should have the following signature:

hook(module, grad_output) -> tuple[Tensor] or None

The grad_output is a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place of grad_output in subsequent computations. Entries in grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward_pre hooks on this torch.nn.modules.Module. Note that global backward_pre hooks registered with register_module_full_backward_pre_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_load_state_dict_post_hook(hook)

Register a post hook to be run after module’s load_state_dict is called.

It should have the following signature::: hook(module, incompatible_keys) -> None

The module argument is the current module that this hook is registered on, and the incompatible_keys argument is a NamedTuple consisting of attributes missing_keys and unexpected_keys. missing_keys is a list of str containing the missing keys and unexpected_keys is a list of str containing the unexpected keys.

The given incompatible_keys can be modified inplace if needed.

Note that the checks performed when calling load_state_dict() with strict=True are affected by modifications the hook makes to missing_keys or unexpected_keys, as expected. Additions to either set of keys will result in an error being thrown when strict=True, and clearing out both missing and unexpected keys will avoid an error.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_module(name, module)

Alias for add_module().

Return type:: None

register_parameter(name, param)

Add a parameter to the module.

The parameter can be accessed as an attribute using given name.

Return type:: None

Args:

name (str): name of the parameter. The parameter can be accessed: from this module using the given name
param (Parameter or None): parameter to be added to the module. If: None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

register_state_dict_pre_hook(hook)

Register a pre-hook for the load_state_dict() method.

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self. The registered hooks can be used to perform pre-processing before the state_dict call is made.

property remove_dc_offset: bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Args:

requires_grad (bool): whether autograd should record operations on: parameters in this module. Default: True.

Returns:

Module: self

property sampling_rate: int

set_extra_state(state)

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:: state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_().

Return type:: TypeVar(T, bound= Module)

state_dict(*args, destination=None, prefix='', keep_vars=False)

Return a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:

destination (dict, optional): If provided, the state of module will: be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.
prefix (str, optional): a prefix added to parameter and buffer: names to compose the keys in state_dict. Default: ''.
keep_vars (bool, optional): by default the Tensor s: returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:

dict:: a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']

to(*args, **kwargs)

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)

to(dtype, non_blocking=False)

to(tensor, non_blocking=False)

to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:

device (torch.device): the desired device of the parameters: and buffers in this module
dtype (torch.dtype): the desired floating point or complex dtype of: the parameters and buffers in this module
tensor (torch.Tensor): Tensor whose dtype and device are the desired: dtype and device for all parameters and buffers in this module
memory_format (torch.memory_format): the desired memory: format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)

to_empty(*, device, recurse=True)

Move the parameters and buffers to the specified device without copying storage.

Return type:: TypeVar(T, bound= Module)

Args:

device (torch.device): The desired device of the parameters: and buffers in this module.
recurse (bool): Whether parameters and buffers of submodules should: be recursively moved to the specified device.

Returns:

Module: self

train(mode=True)

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Return type:: TypeVar(T, bound= Module)

Args:

mode (bool): whether to set training mode (True) or evaluation: mode (False). Default: True.

Returns:

Module: self

type(dst_type)

Casts all parameters and buffers to dst_type. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:: dst_type (type or string): the desired type
Returns:: Module: self

property window_type: str

xpu(device=None)

Move all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

zero_grad(set_to_none=True)

Reset gradients of all model parameters.

See similar function under torch.optim.Optimizer for more context.

Return type:: None

Args:

set_to_none (bool): instead of setting to zero, set the grads to None.: See torch.optim.Optimizer.zero_grad() for details.

training: bool

class lhotse.features.kaldi.layers.Wav2MFCC(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=23, norm_filters=False, num_ceps=13, cepstral_lifter=22, torchaudio_compatible_mel_scale=True)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Mel-Frequency Cepstral Coefficients (MFCC).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2MFCC()
>>> t(x).shape
torch.Size([1, 100, 13])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_ceps).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=23, norm_filters=False, num_ceps=13, cepstral_lifter=22, torchaudio_compatible_mel_scale=True)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

static make_lifter(N, Q)[source]

Makes the liftering function

Args:: N: Number of cepstral coefficients. Q: Liftering parameter
Returns:: Liftering vector.

static make_dct_matrix(num_ceps, num_filters)[source]

T_destination = ~T_destination

add_module(name, module)

Add a child module to the current module.

The module can be accessed as an attribute using the given name.

Return type:: None

Args:

name (str): name of the child module. The child module can be: accessed from this module using the given name

module (Module): child module to be added to the module.

apply(fn)

Apply fn recursively to every submodule (as returned by .children()) as well as self.

Typical use includes initializing the parameters of a model (see also nn-init-doc).

Return type:: TypeVar(T, bound= Module)

Args:: fn (Module -> None): function to be applied to each submodule
Returns:: Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

buffers(recurse=True)

Return an iterator over module buffers.

Return type:: Iterator[Tensor]

Args:

recurse (bool): if True, then yields buffers of this module: and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

call_super_init: bool = False

children()

Return an iterator over immediate children modules.

Return type:: Iterator[Module]

Yields:: Module: a child module

compile(*args, **kwargs)

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu()

Move all model parameters and buffers to the CPU. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

cuda(device=None)

Move all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

property dither: float

double()

Casts all floating point parameters and buffers to double datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

dump_patches: bool = False

eval()

Set the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Returns:: Module: self

extra_repr()

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type:: str

float()

Casts all floating point parameters and buffers to float datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property frame_length: float

property frame_shift: float

get_buffer(target)

Return the buffer given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Tensor

Args:

target: The fully-qualified string name of the buffer: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not a buffer

get_extra_state()

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Return type:: Any

Returns:: object: Any extra state to store in the module’s state_dict

get_parameter(target)

Return the parameter given by target if it exists, otherwise throw an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Return type:: Parameter

Args:

target: The fully-qualified string name of the Parameter: to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Parameter

get_submodule(target)

Return the submodule given by target if it exists, otherwise throw an error.

For example, let’s say you have an nn.Module A that looks like this:

A(
    (net_b): Module(
        (net_c): Module(
            (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2))
        )
        (linear): Linear(in_features=100, out_features=200, bias=True)
    )
)

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Return type:: Module

Args:

target: The fully-qualified string name of the submodule: to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:

AttributeError: If the target string references an invalid: path or resolves to something that is not an nn.Module

half()

Casts all floating point parameters and buffers to half datatype. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Returns:: Module: self

ipu(device=None)

Move all model parameters and buffers to the IPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on IPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

load_state_dict(state_dict, strict=True, assign=False)

Copy parameters and buffers from state_dict into this module and its descendants.

If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:

state_dict (dict): a dict containing parameters and: persistent buffers.
strict (bool, optional): whether to strictly enforce that the keys: in state_dict match the keys returned by this module’s state_dict() function. Default: True
assign (bool, optional): whether to assign items in the state: dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:

NamedTuple with missing_keys and unexpected_keys fields:

missing_keys is a list of str containing the missing keys
unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Return an iterator over all modules in the network.

Return type:: Iterator[Module]

Yields:: Module: a module in the network
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
...     print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)

named_buffers(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Return type:: Iterator[Tuple[str, Tensor]]

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.

remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.

Yields:

(str, torch.Tensor): Tuple containing the name and buffer

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, buf in self.named_buffers():
>>>     if name in ['running_var']:
>>>         print(buf.size())

named_children()

Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Return type:: Iterator[Tuple[str, Module]]

Yields:: (str, Module): Tuple containing a name and child module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)

named_modules(memo=None, prefix='', remove_duplicate=True)

Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:: memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result

or not
Yields:: (str, Module): Tuple of name and module
Note:: Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
...     print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))

named_parameters(prefix='', recurse=True, remove_duplicate=True)

Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Return type:: Iterator[Tuple[str, Parameter]]

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

remove_duplicate (bool, optional): whether to remove the duplicated: parameters in the result. Defaults to True.

Yields:

(str, Parameter): Tuple containing the name and parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for name, param in self.named_parameters():
>>>     if name in ['bias']:
>>>         print(param.size())

online_inference(x, context=None)

Return type:: Tuple[Tensor, Tensor]

parameters(recurse=True)

Return an iterator over module parameters.

This is typically passed to an optimizer.

Return type:: Iterator[Parameter]

Args:

recurse (bool): if True, then yields parameters of this module: and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

property preemph_coeff: float

register_backward_hook(hook)

Register a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Return type:: RemovableHandle

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_buffer(name, tensor, persistent=True)

Add a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Return type:: None

Args:

name (str): name of the buffer. The buffer can be accessed: from this module using the given name
tensor (Tensor or None): buffer to be registered. If None, then operations: that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.
persistent (bool): whether the buffer is part of this module’s: state_dict.

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> self.register_buffer('running_mean', torch.zeros(num_features))

register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)

Register a forward hook on the module.

The hook will be called every time after forward() has computed an output.

If with_kwargs is False or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called. The hook should have the following signature:

hook(module, args, output) -> None or modified output

If with_kwargs is True, the forward hook will be passed the kwargs given to the forward function and be expected to return the output possibly modified. The hook should have the following signature:

hook(module, args, kwargs, output) -> None or modified output

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If True, the provided hook will be fired

before all existing forward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward hooks on this torch.nn.modules.Module. Note that global forward hooks registered with register_module_forward_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If True, the hook will be passed the: kwargs given to the forward function. Default: False
always_call (bool): If True the hook will be run regardless of: whether an exception is raised while calling the Module. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)

Register a forward pre-hook on the module.

The hook will be called every time before forward() is invoked.

If with_kwargs is false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:

hook(module, args) -> None or modified input

If with_kwargs is true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:

hook(module, args, kwargs) -> None or a tuple of modified input and kwargs

Return type:: RemovableHandle

Args:

hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing forward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing forward_pre hooks on this torch.nn.modules.Module. Note that global forward_pre hooks registered with register_module_forward_pre_hook() will fire before all hooks registered by this method. Default: False

with_kwargs (bool): If true, the hook will be passed the kwargs: given to the forward function. Default: False

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_hook(hook, prepend=False)

Register a backward hook on the module.

The hook will be called every time the gradients with respect to a module are computed, i.e. the hook will execute if and only if the gradients with respect to module outputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward hooks on this torch.nn.modules.Module. Note that global backward hooks registered with register_module_full_backward_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_full_backward_pre_hook(hook, prepend=False)

Register a backward pre-hook on the module.

The hook will be called every time the gradients for the module are computed. The hook should have the following signature:

hook(module, grad_output) -> tuple[Tensor] or None

The grad_output is a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place of grad_output in subsequent computations. Entries in grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype: RemovableHandle

Warning

Modifying inputs inplace is not allowed when using backward hooks and will raise an error.

Args:

hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided hook will be fired before

all existing backward_pre hooks on this torch.nn.modules.Module. Otherwise, the provided hook will be fired after all existing backward_pre hooks on this torch.nn.modules.Module. Note that global backward_pre hooks registered with register_module_full_backward_pre_hook() will fire before all hooks registered by this method.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_load_state_dict_post_hook(hook)

Register a post hook to be run after module’s load_state_dict is called.

It should have the following signature::: hook(module, incompatible_keys) -> None

The module argument is the current module that this hook is registered on, and the incompatible_keys argument is a NamedTuple consisting of attributes missing_keys and unexpected_keys. missing_keys is a list of str containing the missing keys and unexpected_keys is a list of str containing the unexpected keys.

The given incompatible_keys can be modified inplace if needed.

Note that the checks performed when calling load_state_dict() with strict=True are affected by modifications the hook makes to missing_keys or unexpected_keys, as expected. Additions to either set of keys will result in an error being thrown when strict=True, and clearing out both missing and unexpected keys will avoid an error.

Returns:

torch.utils.hooks.RemovableHandle:: a handle that can be used to remove the added hook by calling handle.remove()

register_module(name, module)

Alias for add_module().

Return type:: None

register_parameter(name, param)

Add a parameter to the module.

The parameter can be accessed as an attribute using given name.

Return type:: None

Args:

name (str): name of the parameter. The parameter can be accessed: from this module using the given name
param (Parameter or None): parameter to be added to the module. If: None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

register_state_dict_pre_hook(hook)

Register a pre-hook for the load_state_dict() method.

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self. The registered hooks can be used to perform pre-processing before the state_dict call is made.

property remove_dc_offset: bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Return type:: TypeVar(T, bound= Module)

Args:

requires_grad (bool): whether autograd should record operations on: parameters in this module. Default: True.

Returns:

Module: self

property sampling_rate: int

set_extra_state(state)

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:: state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_().

Return type:: TypeVar(T, bound= Module)

state_dict(*args, destination=None, prefix='', keep_vars=False)

Return a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:

destination (dict, optional): If provided, the state of module will: be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.
prefix (str, optional): a prefix added to parameter and buffer: names to compose the keys in state_dict. Default: ''.
keep_vars (bool, optional): by default the Tensor s: returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:

dict:: a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']

to(*args, **kwargs)

Move and/or cast the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)

to(dtype, non_blocking=False)

to(tensor, non_blocking=False)

to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:

device (torch.device): the desired device of the parameters: and buffers in this module
dtype (torch.dtype): the desired floating point or complex dtype of: the parameters and buffers in this module
tensor (torch.Tensor): Tensor whose dtype and device are the desired: dtype and device for all parameters and buffers in this module
memory_format (torch.memory_format): the desired memory: format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)

to_empty(*, device, recurse=True)

Move the parameters and buffers to the specified device without copying storage.

Return type:: TypeVar(T, bound= Module)

Args:

device (torch.device): The desired device of the parameters: and buffers in this module.
recurse (bool): Whether parameters and buffers of submodules should: be recursively moved to the specified device.

Returns:

Module: self

train(mode=True)

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Return type:: TypeVar(T, bound= Module)

Args:

mode (bool): whether to set training mode (True) or evaluation: mode (False). Default: True.

Returns:

Module: self

type(dst_type)

Casts all parameters and buffers to dst_type. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Args:: dst_type (type or string): the desired type
Returns:: Module: self

property window_type: str

xpu(device=None)

Move all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype: TypeVar(T, bound= Module)

Note

This method modifies the module in-place.

Arguments:

device (int, optional): if specified, all parameters will be: copied to that device

Returns:

Module: self

zero_grad(set_to_none=True)

Reset gradients of all model parameters.

See similar function under torch.optim.Optimizer for more context.

Return type:: None

Args:

set_to_none (bool): instead of setting to zero, set the grads to None.: See torch.optim.Optimizer.zero_grad() for details.

training: bool

lhotse.features.kaldi.layers.create_mel_scale(num_filters, fft_length, sampling_rate, low_freq=0, high_freq=None, norm_filters=True)[source]

Return type:: Tensor

lhotse.features.kaldi.layers.available_windows()[source]

Return type:: List[str]

lhotse.features.kaldi.layers.create_frame_window(window_size, window_type='povey', blackman_coeff=0.42)[source]: Returns a window function with the given type and size

lhotse.features.kaldi.layers.lin2mel(x)[source]

lhotse.features.kaldi.layers.mel2lin(x)[source]

lhotse.features.kaldi.layers.next_power_of_2(x)[source]

Returns the smallest power of 2 that is greater than x.

Original source: TorchAudio (torchaudio/compliance/kaldi.py)

Return type:: int

Torchaudio feature extractors

class lhotse.features.fbank.TorchaudioFbankConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=80, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0)[source]

dither: float = 0.0

window_type: str = 'povey'

frame_length: float = 0.025

frame_shift: float = 0.01

remove_dc_offset: bool = True

round_to_power_of_two: bool = True

energy_floor: float = 1e-10

min_duration: float = 0.0

preemphasis_coefficient: float = 0.97

raw_energy: bool = True

low_freq: float = 20.0

high_freq: float = -400.0

num_mel_bins: int = 80

use_energy: bool = False

vtln_low: float = 100.0

vtln_high: float = -500.0

vtln_warp: float = 1.0

to_dict()[source]

Return type:: Dict[str, Any]

static from_dict(data)[source]

Return type:: TorchaudioFbankConfig

__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=80, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0)

class lhotse.features.fbank.TorchaudioFbank(config=None)[source]

Log Mel energy filter bank feature extractor based on torchaudio.compliance.kaldi.fbank function.

name = 'fbank'

config_type: alias of TorchaudioFbankConfig

feature_dim(sampling_rate)[source]

Return type:: int

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters:

features_a (ndarray) – Left-hand side (reference) signal.
features_b (ndarray) – Right-hand side (mixed-in) signal.
energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type:

ndarray

Returns:

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters:: features (ndarray) – A feature matrix.
Return type:: float
Returns:: A positive float value of the signal energy.

__init__(config=None)

property device: str | device

extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: ndarray
Returns:: a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate, lengths=None): Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features and the source data used.

Parameters:

recording (Recording) – a Recording that specifies what’s the input audio.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an optional offset in seconds for where to start reading the recording.
duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.
channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters:

samples (ndarray) – a numpy ndarray with the audio samples.
sampling_rate (int) – integer sampling rate of samples.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.
channel (Union[List[int], int, None]) – an optional channel number(s) to insert into Features manifest.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix (it is not written to disk).

property frame_shift: float

classmethod from_dict(data)

Return type:: FeatureExtractor

classmethod from_yaml(path)

Return type:: FeatureExtractor

to_dict()

Return type:: Dict[str, Any]

to_yaml(path)

class lhotse.features.mfcc.TorchaudioMfccConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)[source]

dither: float = 0.0

window_type: str = 'povey'

frame_length: float = 0.025

frame_shift: float = 0.01

remove_dc_offset: bool = True

round_to_power_of_two: bool = True

energy_floor: float = 1e-10

min_duration: float = 0.0

preemphasis_coefficient: float = 0.97

raw_energy: bool = True

low_freq: float = 20.0

high_freq: float = -400.0

num_mel_bins: int = 23

use_energy: bool = False

vtln_low: float = 100.0

vtln_high: float = -500.0

vtln_warp: float = 1.0

cepstral_lifter: float = 22.0

num_ceps: int = 13

to_dict()[source]

Return type:: Dict[str, Any]

static from_dict(data)[source]

Return type:: TorchaudioMfccConfig

__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)

class lhotse.features.mfcc.TorchaudioMfcc(config=None)[source]

MFCC feature extractor based on torchaudio.compliance.kaldi.mfcc function.

name = 'mfcc'

config_type: alias of TorchaudioMfccConfig

feature_dim(sampling_rate)[source]

Return type:: int

__init__(config=None)

static compute_energy(features)

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters:: features (ndarray) – A feature matrix.
Return type:: float
Returns:: A positive float value of the signal energy.

property device: str | device

extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: ndarray
Returns:: a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate, lengths=None): Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features and the source data used.

Parameters:

recording (Recording) – a Recording that specifies what’s the input audio.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an optional offset in seconds for where to start reading the recording.
duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.
channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters:

samples (ndarray) – a numpy ndarray with the audio samples.
sampling_rate (int) – integer sampling rate of samples.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.
channel (Union[List[int], int, None]) – an optional channel number(s) to insert into Features manifest.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix (it is not written to disk).

property frame_shift: float

classmethod from_dict(data)

Return type:: FeatureExtractor

classmethod from_yaml(path)

Return type:: FeatureExtractor

static mix(features_a, features_b, energy_scaling_factor_b)

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters:

features_a (ndarray) – Left-hand side (reference) signal.
features_b (ndarray) – Right-hand side (mixed-in) signal.
energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type:

ndarray

Returns:

A mixed feature matrix.

to_dict()

Return type:: Dict[str, Any]

to_yaml(path)

class lhotse.features.spectrogram.TorchaudioSpectrogramConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)[source]

dither: float = 0.0

window_type: str = 'povey'

frame_length: float = 0.025

frame_shift: float = 0.01

remove_dc_offset: bool = True

round_to_power_of_two: bool = True

energy_floor: float = 1e-10

min_duration: float = 0.0

preemphasis_coefficient: float = 0.97

raw_energy: bool = True

to_dict()[source]

Return type:: Dict[str, Any]

static from_dict(data)[source]

Return type:: TorchaudioSpectrogramConfig

__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)

class lhotse.features.spectrogram.TorchaudioSpectrogram(config=None)[source]

Log spectrogram feature extractor based on torchaudio.compliance.kaldi.spectrogram function.

name = 'spectrogram'

config_type: alias of TorchaudioSpectrogramConfig

feature_dim(sampling_rate)[source]

Return type:: int

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters:

features_a (ndarray) – Left-hand side (reference) signal.
features_b (ndarray) – Right-hand side (mixed-in) signal.
energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type:

ndarray

Returns:

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters:: features (ndarray) – A feature matrix.
Return type:: float
Returns:: A positive float value of the signal energy.

__init__(config=None)

property device: str | device

extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: ndarray
Returns:: a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate, lengths=None): Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features and the source data used.

Parameters:

recording (Recording) – a Recording that specifies what’s the input audio.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an optional offset in seconds for where to start reading the recording.
duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.
channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters:

samples (ndarray) – a numpy ndarray with the audio samples.
sampling_rate (int) – integer sampling rate of samples.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.
channel (Union[List[int], int, None]) – an optional channel number(s) to insert into Features manifest.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix (it is not written to disk).

property frame_shift: float

classmethod from_dict(data)

Return type:: FeatureExtractor

classmethod from_yaml(path)

Return type:: FeatureExtractor

to_dict()

Return type:: Dict[str, Any]

to_yaml(path)

Librosa filter-bank

class lhotse.features.librosa_fbank.LibrosaFbankConfig(sampling_rate=22050, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600)[source]

Default librosa config with values consistent with various TTS projects.

This config is intended for use with popular TTS projects such as [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) Warning: You may need to normalize your features.

sampling_rate: int = 22050

fft_size: int = 1024

hop_size: int = 256

win_length: int = None

window: str = 'hann'

num_mel_bins: int = 80

fmin: int = 80

fmax: int = 7600

to_dict()[source]

Return type:: Dict[str, Any]

static from_dict(data)[source]

Return type:: LibrosaFbankConfig

__init__(sampling_rate=22050, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600)

lhotse.features.librosa_fbank.pad_or_truncate_features(feats, expected_num_frames, abs_tol=1, pad_value=-23.025850929940457)[source]

lhotse.features.librosa_fbank.logmelfilterbank(audio, sampling_rate, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600, eps=1e-10)[source]

Compute log-Mel filterbank feature.

Args:: audio (ndarray): Audio signal (T,). sampling_rate (int): Sampling rate. fft_size (int): FFT size. hop_size (int): Hop size. win_length (int): Window length. If set to None, it will be the same as fft_size. window (str): Window function type. num_mel_bins (int): Number of mel basis. fmin (int): Minimum frequency in mel basis calculation. fmax (int): Maximum frequency in mel basis calculation. eps (float): Epsilon value to avoid inf in log calculation.
Returns:: ndarray: Log Mel filterbank feature (#source_feats, num_mel_bins).

class lhotse.features.librosa_fbank.LibrosaFbank(config=None)[source]

Librosa fbank feature extractor

Differs from Fbank extractor in that it uses librosa backend for stft and mel scale calculations. It can be easily configured to be compatible with existing speech-related projects that use librosa features.

name = 'librosa-fbank'

config_type: alias of LibrosaFbankConfig

property frame_shift: float

feature_dim(sampling_rate)[source]

Return type:: int

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type:: ndarray
Returns:: a numpy ndarray representing the feature matrix.

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters:

features_a (ndarray) – Left-hand side (reference) signal.
features_b (ndarray) – Right-hand side (mixed-in) signal.
energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type:

ndarray

Returns:

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters:: features (ndarray) – A feature matrix.
Return type:: float
Returns:: A positive float value of the signal energy.

__init__(config=None)

property device: str | device

extract_batch(samples, sampling_rate, lengths=None): Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype: Union[ndarray, Tensor, List[ndarray], List[Tensor]]

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features and the source data used.

Parameters:

recording (Recording) – a Recording that specifies what’s the input audio.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an optional offset in seconds for where to start reading the recording.
duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.
channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters:

samples (ndarray) – a numpy ndarray with the audio samples.
sampling_rate (int) – integer sampling rate of samples.
storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.
offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.
channel (Union[List[int], int, None]) – an optional channel number(s) to insert into Features manifest.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type:

Returns:

a Features manifest item for the extracted feature matrix (it is not written to disk).

classmethod from_dict(data)

Return type:: FeatureExtractor

classmethod from_yaml(path)

Return type:: FeatureExtractor

to_dict()

Return type:: Dict[str, Any]

to_yaml(path)

Feature storage

class lhotse.features.io.FeaturesWriter[source]

FeaturesWriter defines the interface of how to store numpy arrays in a particular storage backend. This backend could either be:

separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.

Each class inheriting from FeaturesWriter must define:

the write() method, which defines the storing operation
(accepts a key used to place the value array in the storage);
the storage_path() property, which is either a common directory for the files,
the name of the file storing multiple arrays, name of the cloud bucket, etc.
the name() property that is unique to this particular storage mechanism -
it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

Each FeaturesWriter can also be used as a context manager, as some implementations might need to free a resource after the writing is finalized. By default nothing happens in the context manager functions, and this can be modified by the inheriting subclasses.

Example:

>>> with MyWriter('some/path') as storage:
...     extractor.extract_from_recording_and_store(recording, storage)

The features loading must be defined separately in a class inheriting from FeaturesReader.

abstract property name: str

abstract property storage_path: str

abstract write(key, value)[source]

Return type:: str

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)[source]

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.FeaturesReader[source]

FeaturesReader defines the interface of how to load numpy arrays from a particular storage backend. This backend could either be:

separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.

Each class inheriting from FeaturesReader must define:

the read() method, which defines the loading operation
(accepts the key to locate the array in the storage and return it). The read method should support selecting only a subset of the feature matrix, with the bounds expressed as arguments left_offset_frames and right_offset_frames. It’s up to the Reader implementation to load only the required part or trim it to that range only after loading. It is assumed that the time dimension is always the first one.
the name() property that is unique to this particular storage mechanism -
it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

The features writing must be defined separately in a class inheriting from FeaturesWriter.

abstract property name: str

abstract read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

lhotse.features.io.available_storage_backends()[source]

Return type:: List[str]

lhotse.features.io.register_reader(cls)[source]

Decorator used to add a new FeaturesReader to Lhotse’s registry.

Example:

@register_reader
class MyFeatureReader(FeatureReader):
    ...

lhotse.features.io.register_writer(cls)[source]

Decorator used to add a new FeaturesWriter to Lhotse’s registry.

Example:

@register_writer
class MyFeatureWriter(FeatureWriter):
    ...

lhotse.features.io.get_reader(name)[source]

Find a FeaturesReader sub-class that corresponds to the provided name and return its type.

Example: :rtype: Type[FeaturesReader]

reader_type = get_reader(“lilcom_files”) reader = reader_type(“/storage/features/”)

lhotse.features.io.get_writer(name)[source]

Find a FeaturesWriter sub-class that corresponds to the provided name and return its type.

Example: :rtype: Type[FeaturesWriter]

writer_type = get_writer(“lilcom_files”) writer = writer_type(“/storage/features/”)

class lhotse.features.io.LilcomFilesReader(storage_path, *args, **kwargs)[source]

Reads Lilcom-compressed files from a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'lilcom_files'

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.LilcomFilesWriter(storage_path, tick_power=-5, *args, **kwargs)[source]

Writes Lilcom-compressed files to a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'lilcom_files'

__init__(storage_path, tick_power=-5, *args, **kwargs)[source]

property storage_path: str

write(key, value)[source]

Return type:: str

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.NumpyFilesReader(storage_path, *args, **kwargs)[source]

Reads non-compressed numpy arrays from files in a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'numpy_files'

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.NumpyFilesWriter(storage_path, *args, **kwargs)[source]

Writes non-compressed numpy arrays to files in a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'numpy_files'

__init__(storage_path, *args, **kwargs)[source]

property storage_path: str

write(key, value)[source]

Return type:: str

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

lhotse.features.io.check_h5py_installed()[source]

lhotse.features.io.lookup_cache_or_open(storage_path)[source]

Helper internal function used in HDF5 readers. It opens the HDF files and keeps their handles open in a global program cache to avoid excessive amount of syscalls when the Reader class is instantiated and destroyed in a loop repeatedly (frequent use-case).

The file handles can be freed at any time by calling close_cached_file_handles().

lhotse.features.io.lookup_chunk_size(h5_file_handle)[source]

Helper internal function to retrieve the chunk size from an HDF5 file. Helps avoid unnecessary repeated disk reads.

Return type:: int

lhotse.features.io.close_cached_file_handles()[source]

Closes the cached file handles in lookup_cache_or_open and lookup_reader_cache_or_open (see respective docs for more details).

Return type:: None

class lhotse.features.io.NumpyHdf5Reader(storage_path, *args, **kwargs)[source]

Reads non-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'numpy_hdf5'

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.NumpyHdf5Writer(storage_path, mode='w', *args, **kwargs)[source]

Writes non-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

Internally, this class opens the file lazily so that this object can be passed between processes without issues. This simplifies the parallel feature extraction code.

name = 'numpy_hdf5'

__init__(storage_path, mode='w', *args, **kwargs)[source]

Parameters:

storage_path (Union[Path, str]) – Path under which we’ll create the HDF5 file. We will add a .h5 suffix if it is not already in storage_path.
mode (str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise

property storage_path: str

write(key, value)[source]

Return type:: str

close()[source]

Return type:: None

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.LilcomHdf5Reader(storage_path, *args, **kwargs)[source]

Reads lilcom-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'lilcom_hdf5'

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.LilcomHdf5Writer(storage_path, tick_power=-5, mode='w', *args, **kwargs)[source]

Writes lilcom-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'lilcom_hdf5'

__init__(storage_path, tick_power=-5, mode='w', *args, **kwargs)[source]

Parameters:

storage_path (Union[Path, str]) – Path under which we’ll create the HDF5 file. We will add a .h5 suffix if it is not already in storage_path.
tick_power (int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.
mode (str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise

property storage_path: str

write(key, value)[source]

Return type:: str

close()[source]

Return type:: None

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.ChunkedLilcomHdf5Reader(storage_path, *args, **kwargs)[source]

Reads lilcom-compressed numpy arrays from a HDF5 file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.

storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'chunked_lilcom_hdf5'

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.ChunkedLilcomHdf5Writer(storage_path, tick_power=-5, chunk_size=100, mode='w', *args, **kwargs)[source]

Writes lilcom-compressed numpy arrays to a HDF5 file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.

storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'chunked_lilcom_hdf5'

__init__(storage_path, tick_power=-5, chunk_size=100, mode='w', *args, **kwargs)[source]

Parameters:

storage_path (Union[Path, str]) – Path under which we’ll create the HDF5 file. We will add a .h5 suffix if it is not already in storage_path.
tick_power (int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.
chunk_size (int) – How many frames to store per chunk. Too low a number will require many reads for long feature matrices, too high a number will require to read more redundant data.
mode (str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise

property storage_path: str

write(key, value)[source]

Return type:: str

close()[source]

Return type:: None

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.LilcomChunkyReader(storage_path, *args, **kwargs)[source]

Reads lilcom-compressed numpy arrays from a binary file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.

storage_path corresponds to the binary file path.

storage_key for each utterance is a comma separated list of offsets in the file. The first number is the offset for the whole array, and the following numbers are relative offsets for each chunk. These offsets are relative to the previous chunk start.

name = 'lilcom_chunky'

CHUNK_SIZE = 500

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.LilcomChunkyWriter(storage_path, tick_power=-5, mode='wb', *args, **kwargs)[source]

Writes lilcom-compressed numpy arrays to a binary file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.

storage_path corresponds to the binary file path.

storage_key for each utterance is a comma separated list of offsets in the file. The first number is the offset for the whole array, and the following numbers are relative offsets for each chunk. These offsets are relative to the previous chunk start.

name = 'lilcom_chunky'

CHUNK_SIZE = 500

__init__(storage_path, tick_power=-5, mode='wb', *args, **kwargs)[source]

Parameters:

storage_path (Union[Path, str]) – Path under which we’ll create the binary file.
tick_power (int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.
chunk_size – How many frames to store per chunk. Too low a number will require many reads for long feature matrices, too high a number will require to read more redundant data.
mode (str) – Modes, one of: “w” (write) or “a” (append); can be “wb” and “ab”, “b” is implicit

property storage_path: str

write(key, value)[source]

Return type:: str

close()[source]

Return type:: None

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.LilcomURLReader(storage_path, *args, **kwargs)[source]

Downloads Lilcom-compressed files from a URL (S3, GCP, Azure, HTTP, etc.). storage_path corresponds to the root URL (e.g. “s3://my-data-bucket”) storage_key will be concatenated to storage_path to form a full URL (e.g. “my-feature-file.llc”)

Caution

Requires smart_open to be installed (pip install smart_open).

name = 'lilcom_url'

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.LilcomURLWriter(storage_path, tick_power=-5, *args, **kwargs)[source]

Writes Lilcom-compressed files to a URL (S3, GCP, Azure, HTTP, etc.). storage_path corresponds to the root URL (e.g. “s3://my-data-bucket”) storage_key will be concatenated to storage_path to form a full URL (e.g. “my-feature-file.llc”)

Caution

Requires smart_open to be installed (pip install smart_open).

name = 'lilcom_url'

__init__(storage_path, tick_power=-5, *args, **kwargs)[source]

property storage_path: str

write(key, value)[source]

Return type:: str

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

lhotse.features.io.check_kaldi_native_io_installed()[source]

lhotse.features.io.lookup_reader_cache_or_open(storage_path)[source]

Helper internal function used in KaldiReader. It opens kaldi scp files and keeps their handles open in a global program cache to avoid excessive amount of syscalls when the Reader class is instantiated and destroyed in a loop repeatedly (frequent use-case).

The file handles can be freed at any time by calling close_cached_file_handles().

class lhotse.features.io.KaldiReader(storage_path, *args, **kwargs)[source]

Reads Kaldi’s “feats.scp” file using kaldi_native_io. storage_path corresponds to the path to feats.scp. storage_key corresponds to the utterance-id in Kaldi.

Caution

Requires kaldi_native_io to be installed (pip install kaldi_native_io).

name = 'kaldiio'

__init__(storage_path, *args, **kwargs)[source]

read(key, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.KaldiWriter(storage_path, compression_method=1, *args, **kwargs)[source]

Write data to Kaldi’s “feats.scp” and “feats.ark” files using kaldi_native_io. storage_path corresponds to a directory where we’ll create “feats.scp” and “feats.ark” files. storage_key corresponds to the utterance-id in Kaldi.

The following compression_method values are supported by kaldi_native_io:

kAutomaticMethod = 1
kSpeechFeature = 2
kTwoByteAuto = 3
kTwoByteSignedInteger = 4
kOneByteAuto = 5
kOneByteUnsignedInteger = 6
kOneByteZeroOne = 7

Note

Setting compression_method works only with 2D arrays.

Example:

>>> data = np.random.randn(131, 80)
>>> with KaldiWriter('featdir') as w:
...     w.write('utt1', data)
>>> reader = KaldiReader('featdir/feats.scp')
>>> read_data = reader.read('utt1')
>>> np.testing.assert_equal(data, read_data)

Caution

Requires kaldi_native_io to be installed (pip install kaldi_native_io).

name = 'kaldiio'

__init__(storage_path, compression_method=1, *args, **kwargs)[source]

property storage_path: str

write(key, value)[source]

Return type:: str

close()[source]

Return type:: None

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

lhotse.features.io.get_memory_writer(name)[source]

class lhotse.features.io.MemoryLilcomReader(*args, **kwargs)[source]

name = 'memory_lilcom'

__init__(*args, **kwargs)[source]

read(raw_data, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.MemoryLilcomWriter(*args, lilcom_tick_power=-5, **kwargs)[source]

name = 'memory_lilcom'

__init__(*args, lilcom_tick_power=-5, **kwargs)[source]

property storage_path: None

write(key, value)[source]

Return type:: bytes

close()[source]

Return type:: None

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.MemoryRawReader(*args, **kwargs)[source]

name = 'memory_raw'

__init__(*args, **kwargs)[source]

read(raw_data, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.MemoryRawWriter(*args, **kwargs)[source]

name = 'memory_raw'

__init__(*args, **kwargs)[source]

property storage_path: None

write(key, value)[source]

Return type:: bytes

close()[source]

Return type:: None

store_array(key, value, frame_shift=None, temporal_dim=None, start=0)

Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.

If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then temporal_dim and frame_shift may be specified to enable downstream padding, truncating, and partial reads of the array.

Parameters:

key (str) – An ID that uniquely identifies the array.
value (ndarray) – The array to be stored.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
start (float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.

Return type:

Union[Array, TemporalArray]

Returns:

A manifest of type Array or TemporalArray, depending on the input arguments.

class lhotse.features.io.MemoryNpyReader(*args, **kwargs)[source]

name = 'memory_npy'

__init__(*args, **kwargs)[source]

read(raw_data, left_offset_frames=0, right_offset_frames=None)[source]

Return type:: ndarray

class lhotse.features.io.DummySharReader(*args, **kwargs)[source]

name = 'shar'

__init__(*args, **kwargs)[source]

read(*args, **kwargs)[source]

Return type:: ndarray

Feature-domain mixing

class lhotse.features.mixer.FeatureMixer(feature_extractor, base_feats, frame_shift, padding_value=-1000.0, reference_energy=None)[source]

Utility class to mix multiple feature matrices into a single one. It should be instantiated separately for each mixing session (i.e. each MixedCut will create a separate FeatureMixer to mix its tracks). It is initialized with a numpy array of features (typically float32) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using the add_to_mix method. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize the FeatureMixer.

It relies on the FeatureExtractor to have defined mix and compute_energy methods, so that the FeatureMixer knows how to scale and add two feature matrices together.

__init__(feature_extractor, base_feats, frame_shift, padding_value=-1000.0, reference_energy=None)[source]

FeatureMixer’s constructor.

Parameters:

feature_extractor (FeatureExtractor) – The FeatureExtractor instance that specifies how to mix the features.
base_feats (ndarray) – The features used to initialize the FeatureMixer are a point of reference in terms of energy and offset for all features mixed into them.
frame_shift (float) – Required to correctly compute offset and padding during the mix.
padding_value (float) – The value used to pad the shorter features during the mix. This value is adequate only for log space features. For non-log space features, e.g. energies, use either 0 or a small positive value like 1e-5.
reference_energy (Optional[float]) – Optionally pass a reference energy value to compute SNRs against. This might be required when base_feats correspond to padding energies.

property num_features

property unmixed_feats: ndarray: Return a numpy ndarray with the shape (num_tracks, num_frames, num_features), where each track’s feature matrix is padded and scaled adequately to the offsets and SNR used in add_to_mix call.

property mixed_feats: ndarray: Return a numpy ndarray with the shape (num_frames, num_features) - a mono mixed feature matrix of the tracks supplied with add_to_mix calls.

add_to_mix(feats, sampling_rate, snr=None, offset=0.0)[source]: Add feature matrix of a new track into the mix. :type feats: ndarray :param feats: A 2D feature matrix to be mixed in. :type sampling_rate: int :param sampling_rate: The sampling rate of feats :type snr: Optional[float] :param snr: Signal-to-noise ratio, assuming feats represents noise (positive SNR - lower feats energy, negative SNR - higher feats energy) :type offset: float :param offset: How many seconds to shift feats in time. For mixing, the signal will be padded before the start with low energy values.

Augmentation

Cuts

Data structures and tools used to create training/testing examples.

The following is the hierarchy of imports in this module (to avoid circular imports):

┌─────────────┐ │ __init__.py │─────────────┬────────────────────────────────────────────┐ └─────────────┘ │ │

│ │ │ │ ▼ │ │ ┌────────────────┐ │ ├──────────▶│ mono.MonoCut │────────────────────┐ │ │ └────────────────┘ │ │ │ ▼ │ │ ┌────────────────┐ ┌────────────────┐ │ ├──────────▶│ multi.MultiCut │──────────▶│ data.DataCut │───────┤ │ └────────────────┘ └────────────────┘ │ │ ▲ ▼ │ ┌────────────────────┐ │ ┌─────────────┐ ├──────────▶│ mixed.MixedCut │────────────────┴───────▶│ base.Cut │ │ └────────────────────┘ └─────────────┘ │ │ ▲ │ │ │ │ │ ┌────────────────────┐ │ ├──────────────────────┴────────▶│ padding.PaddingCut │───────────┤ │ └────────────────────┘ │

┌────────────────┐ ▲ │ │ set.CutSet │───────────────────────────────────┴─────────────────────┘ └────────────────┘

class lhotse.cut.Cut[source]

Caution

Cut is just an abstract class – the actual logic is implemented by its child classes (scroll down for references).

Cut is a base class for audio cuts. An “audio cut” is a subset of a Recording – it can also be thought of as a “view” or a pointer to a chunk of audio. It is not limited to audio data – cuts may also point to (sub-spans of) precomputed Features.

Cuts are different from SupervisionSegment in that they may be arbitrarily longer or shorter than supervisions; cuts may even contain multiple supervisions for creating contextual training data, and unsupervised regions that provide real or synthetic acoustic background context for the supervised segments.

The following example visualizes how a cut may represent a part of a single-channel recording with two utterances and some background noise in between:

                  Recording
|-------------------------------------------|
"Hey, Matt!"     "Yes?"        "Oh, nothing"
|----------|     |----|        |-----------|
           Cut1
|------------------------|

This scenario can be represented in code, using MonoCut, as:

>>> from lhotse import Recording, SupervisionSegment, MonoCut
>>> rec = Recording(id='rec1', duration=10.0, sampling_rate=8000, num_samples=80000, sources=[...])
>>> sups = [
...     SupervisionSegment(id='sup1', recording_id='rec1', start=0, duration=3.37, text='Hey, Matt!'),
...     SupervisionSegment(id='sup2', recording_id='rec1', start=4.5, duration=0.9, text='Yes?'),
...     SupervisionSegment(id='sup3', recording_id='rec1', start=6.9, duration=2.9, text='Oh, nothing'),
... ]
>>> cut = MonoCut(id='rec1-cut1', start=0.0, duration=6.0, channel=0, recording=rec,
...     supervisions=[sups[0], sups[1]])

Note

All Cut classes assume that the SupervisionSegment time boundaries are relative to the beginning of the cut. E.g. if the underlying Recording starts at 0s (always true), the cut starts at 100s, and the SupervisionSegment inside the cut starts at 3s, it really did start at 103rd second of the recording. In some cases, the supervision might have a negative start, or a duration exceeding the duration of the cut; this means that the supervision in the recording extends beyond the cut.

Cut allows to check and read audio data or features data:

>>> assert cut.has_recording
>>> samples = cut.load_audio()
>>> if cut.has_features:
...     feats = cut.load_features()

It can be visualized, and listened to, inside Jupyter Notebooks:

>>> cut.plot_audio()
>>> cut.play_audio()
>>> cut.plot_features()

Cuts can be used with Lhotse’s FeatureExtractor to compute features.

>>> from lhotse import Fbank
>>> feats = cut.compute_features(extractor=Fbank())

It is also possible to use a FeaturesWriter to store the features and attach their manifest to a copy of the cut:

>>> from lhotse import LilcomChunkyWriter
>>> with LilcomChunkyWriter('feats.lca') as storage:
...     cut_with_feats = cut.compute_and_store_features(
...         extractor=Fbank(),
...         storage=storage
...     )

Cuts have several methods that allow their manipulation, transformation, and mixing. Some examples (see the respective methods documentation for details):

>>> cut_2_to_4s = cut.truncate(offset=2, duration=2)
>>> cut_padded = cut.pad(duration=10.0)
>>> cut_extended = cut.extend_by(duration=5.0, direction='both')
>>> cut_mixed = cut.mix(other_cut, offset_other_by=5.0, snr=20)
>>> cut_append = cut.append(other_cut)
>>> cut_24k = cut.resample(24000)
>>> cut_sp = cut.perturb_speed(1.1)
>>> cut_vp = cut.perturb_volume(2.)
>>> cut_rvb = cut.reverb_rir(rir_recording)

Note

All cut transformations are performed lazily, on-the-fly, upon calling load_audio or load_features. The stored waveforms and features are untouched.

Caution

Operations on cuts are not mutating – they return modified copies of Cut objects, leaving the original object unmodified.

A Cut that contains multiple segments (SupervisionSegment) can be decayed into smaller cuts that correspond directly to supervisions:

>>> smaller_cuts = cut.trim_to_supervisions()

Cuts can be detached from parts of their metadata:

>>> cut_no_feat = cut.drop_features()
>>> cut_no_rec = cut.drop_recording()
>>> cut_no_sup = cut.drop_supervisions()

Finally, cuts provide convenience methods to compute feature frame and audio sample masks for supervised regions:

>>> sup_frames = cut.supervisions_feature_mask()
>>> sup_samples = cut.supervisions_audio_mask()

See also:

lhotse.cut.MonoCut

lhotse.cut.MixedCut

lhotse.cut.CutSet

id: str

start: float

duration: float

sampling_rate: int

supervisions: List[SupervisionSegment]

num_samples: Optional[int]

num_frames: Optional[int]

num_features: Optional[int]

frame_shift: Optional[float]

features_type: Optional[str]

has_recording: bool

has_features: bool

has_video: bool

video: Optional[VideoInfo]

from_dict: Callable[[Dict], Cut]

load_audio: Callable[[], ndarray]

load_video: Callable[[], Tuple[Tensor, Optional[Tensor]]]

load_features: Callable[[], ndarray]

compute_and_store_features: Callable

drop_features: Callable

drop_recording: Callable

drop_supervisions: Callable

truncate: Callable

pad: Callable

extend_by: Callable

resample: Callable

perturb_speed: Callable

perturb_tempo: Callable

perturb_volume: Callable

reverb_rir: Callable

map_supervisions: Callable

merge_supervisions: Callable

filter_supervisions: Callable

fill_supervision: Callable

with_features_path_prefix: Callable

with_recording_path_prefix: Callable

property end: float

to_dict()[source]

Return type:: dict

property has_overlapping_supervisions: bool

property trimmed_supervisions: List[SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

split(timestamp)[source]

Return type:: Tuple[Cut, Cut]

Split a cut into two cuts at timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:

left cut [0s - 4s]

right cut [4s - 10s]

mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None)[source]

Refer to :function:`~lhotse.cut.mix` documentation.

Return type:: Cut

append(other, snr=None, preserve_id=None)[source]

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters:: preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.
Return type:: Cut

compute_features(extractor, augment_fn=None)[source]

Compute the features from this cut. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type:

ndarray

Returns:

a numpy ndarray with the computed features.

plot_audio()[source]: Display a plot of the waveform. Requires matplotlib to be installed.

play_audio()[source]: Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_features()[source]: Display the feature matrix as an image. Requires matplotlib to be installed.

plot_alignment(alignment_type='word')[source]: Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)[source]

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|

For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Parameters:

keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2. In this mode, we guarantee that there will always be exactly one supervision per cut.
min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
context_direction (Literal['center', 'left', 'right', 'random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.

Return type:

CutSet

Returns:

a list of cuts.

trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)[source]

Splits the current Cut into its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.

For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Hint

If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the Cut.merge_supervisions() method first to merge the supervisions into a single one, followed by the Cut.trim_to_alignments() method. For example:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)

Hint

The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)

Parameters:

type (str) – The type of the alignment to trim to (e.g. “word”).
max_pause (Optional[float]) – The maximum pause allowed between the alignments to merge them. If None, no merging will be performed. [default: None]
delimiter (str) – The delimiter to use when joining the alignment items.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
num_jobs – Number of parallel workers to process the cuts.

Return type:

CutSet

Returns:

a CutSet object.

trim_to_supervision_groups(max_pause=0.0)[source]

Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482

For example, the following cut:

Cut

╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝

is transformed into two cuts:

Cut 1                                       Cut 2

╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝

For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.

Parameters:: max_pause (float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.
Return type:: CutSet
Returns:: a CutSet.

cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)[source]

Return a list of shorter cuts, made by traversing this cut in windows of duration seconds by hop seconds.

The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.

Parameters:

duration (float) – Desired duration of the new cuts in seconds.
hop (Optional[float]) – Shift between the windows in the new cuts in seconds.
keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

Return type:

CutSet

Returns:

a list of cuts made from shorter duration windows.

index_supervisions(index_mixed_tracks=False, keep_ids=None)[source]

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters:

index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.
keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type:

Dict[str, IntervalTree]

Returns:

a mapping from Cut ID to an interval tree of SupervisionSegments.

save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)[source]

Store this cut’s waveform as audio recording to disk.

Parameters:

storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.
format (Optional[str]) – Audio format argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
encoding (Optional[str]) – Audio encoding argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
kwargs – additional arguments passed to Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.

Return type:

Returns:

a new Cut instance.

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)[source]

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)[source]

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

supervisions_feature_mask(use_alignment_if_exists=None)[source]

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

supervisions_audio_mask(use_alignment_if_exists=None)[source]

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

with_id(id_)[source]

Return a copy of the Cut with a new ID.

Return type:: Cut

class lhotse.cut.CutSet(cuts=None)[source]

CutSet represents a collection of cuts. CutSet ties together all types of data – audio, features and supervisions, and is suitable to represent training/dev/test sets.

CutSet can be either “lazy” (acts as an iterable) which is best for representing full datasets, or “eager” (acts as a list), which is best for representing individual mini-batches (and sometimes test/dev datasets). Almost all operations are available for both modes, but some of them are more efficient depending on the mode (e.g. indexing an “eager” manifest is O(1)).

Note

CutSet is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.

When coming from Kaldi, there is really no good equivalent – the closest concept may be Kaldi’s “egs” for training neural networks, which are chunks of feature matrices and corresponding alignments used respectively as inputs and supervisions. CutSet is different because it provides you with all kinds of metadata, and you can select just the interesting bits to feed them to your models.

CutSet can be created from any combination of RecordingSet, SupervisionSet, and FeatureSet with lhotse.cut.CutSet.from_manifests():

>>> from lhotse import CutSet
>>> cuts = CutSet.from_manifests(recordings=my_recording_set)
>>> cuts2 = CutSet.from_manifests(features=my_feature_set)
>>> cuts3 = CutSet.from_manifests(
...     recordings=my_recording_set,
...     features=my_feature_set,
...     supervisions=my_supervision_set,
... )

When creating a CutSet with CutSet.from_manifests(), the resulting cuts will have the same duration as the input recordings or features. For long recordings, it is not viable for training. We provide several methods to transform the cuts into shorter ones.

Consider the following scenario:

                  Recording
|-------------------------------------------|
"Hey, Matt!"     "Yes?"        "Oh, nothing"
|----------|     |----|        |-----------|

.......... CutSet.from_manifests() ..........
                    Cut1
|-------------------------------------------|

............. Example CutSet A ..............
    Cut1          Cut2              Cut3
|----------|     |----|        |-----------|

............. Example CutSet B ..............
          Cut1                  Cut2
|---------------------||--------------------|

............. Example CutSet C ..............
             Cut1        Cut2
            |---|      |------|

The CutSet’s A, B and C can be created like:

>>> cuts_A = cuts.trim_to_supervisions()
>>> cuts_B = cuts.cut_into_windows(duration=5.0)
>>> cuts_C = cuts.trim_to_unsupervised_segments()

Note

Some operations support parallel execution via an optional num_jobs parameter. By default, all processing is single-threaded.

Caution

Operations on cut sets are not mutating – they return modified copies of CutSet objects, leaving the original object unmodified (and all of its cuts are also unmodified).

CutSet can be stored and read from JSON, JSONL, etc. and supports optional gzip compression:

>>> cuts.to_file('cuts.jsonl.gz')
>>> cuts4 = CutSet.from_file('cuts.jsonl.gz')

It behaves similarly to a dict:

>>> 'rec1-1-0' in cuts
True
>>> cut = cuts['rec1-1-0']
>>> for cut in cuts:
>>>    pass
>>> len(cuts)
127

CutSet has some convenience properties and methods to gather information about the dataset:

>>> ids = list(cuts.ids)
>>> speaker_id_set = cuts.speakers
>>> # The following prints a message:
>>> cuts.describe()
Cuts count: 547
Total duration (hours): 326.4
Speech duration (hours): 79.6 (24.4%)
***
Duration statistics (seconds):
mean    2148.0
std      870.9
min      477.0
25%     1523.0
50%     2157.0
75%     2423.0
max     5415.0
dtype: float64

Manipulation examples:

>>> longer_than_5s = cuts.filter(lambda c: c.duration > 5)
>>> first_100 = cuts.subset(first=100)
>>> split_into_4 = cuts.split(num_splits=4)
>>> shuffled = cuts.shuffle()
>>> random_sample = cuts.sample(n_cuts=10)
>>> new_ids = cuts.modify_ids(lambda c: c.id + '-newid')

These operations can be composed to implement more complex operations, e.g. bucketing by duration:

>>> buckets = cuts.sort_by_duration().split(num_splits=30)

Cuts in a CutSet can be detached from parts of their metadata:

>>> cuts_no_feat = cuts.drop_features()
>>> cuts_no_rec = cuts.drop_recordings()
>>> cuts_no_sup = cuts.drop_supervisions()

Sometimes specific sorting patterns are useful when a small CutSet represents a mini-batch:

>>> cuts = cuts.sort_by_duration(ascending=False)
>>> cuts = cuts.sort_like(other_cuts)

CutSet offers some batch processing operations:

>>> cuts = cuts.pad(num_frames=300)  # or duration=30.0
>>> cuts = cuts.truncate(max_duration=30.0, offset_type='start')  # truncate from start to 30.0s
>>> cuts = cuts.mix(other_cuts, snr=[10, 30], mix_prob=0.5)

CutSet supports lazy data augmentation/transformation methods which require adjusting some information in the manifest (e.g., num_samples or duration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:

>>> cuts_sp = cuts.perturb_speed(factor=1.1)
>>> cuts_vp = cuts.perturb_volume(factor=2.)
>>> cuts_24k = cuts.resample(24000)
>>> cuts_rvb = cuts.reverb_rir(rir_recordings)

Caution

If the CutSet contained Features manifests, they will be detached after performing audio augmentations such as CutSet.perturb_speed(), CutSet.resample(), CutSet.perturb_volume(), or CutSet.reverb_rir().

CutSet offers parallel feature extraction capabilities (see meth:.CutSet.compute_and_store_features: for details), and can be used to estimate global mean and variance:

>>> from lhotse import Fbank
>>> cuts = CutSet()
>>> cuts = cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='/data/feats',
...     num_jobs=4
... )
>>> mvn_stats = cuts.compute_global_feature_stats('/data/features/mvn_stats.pkl', max_cuts=10000)

See also:

Cut

__init__(cuts=None)[source]

property data: Iterable[Cut]: Alias property for self.cuts

property mixed_cuts: CutSet

property simple_cuts: CutSet

property multi_cuts: CutSet

property ids: Iterable[str]

property speakers: FrozenSet[str]

static from_files(paths, shuffle_iters=True, seed=None)[source]

Constructor that creates a single CutSet out of many manifest files. We will iterate sequentially over each of the files, and by default we will randomize the file order every time CutSet is iterated.

This is intended primarily for large datasets which are split into many small manifests, to ensure that the order in which data is seen during training can be properly randomized.

Parameters:

paths (List[Union[Path, str]]) – a list of paths to cut manifests.
shuffle_iters (bool) – bool, should we shuffle paths each time we iterate the returned CutSet (enabled by default).
seed (Optional[int]) – int, random seed controlling the shuffling RNG. By default, we’ll use Python’s global RNG so the order will be different on each script execution.

Return type:

Returns:

a lazy CutSet instance.

static from_cuts(cuts)[source]

Left for backward compatibility, where it implicitly created an “eager” CutSet.

Return type:: CutSet

static from_items(cuts)

Left for backward compatibility, where it implicitly created an “eager” CutSet.

Return type:: CutSet

static from_manifests(recordings=None, supervisions=None, features=None, output_path=None, random_ids=False, tolerance=0.001, lazy=False)[source]

Create a CutSet from any combination of supervision, feature and recording manifests. At least one of recordings or features is required.

The created cuts will be of type MonoCut, even when the recordings have multiple channels. The MonoCut boundaries correspond to those found in the features, when available, otherwise to those found in the recordings.

When supervisions are provided, we’ll be searching them for matching recording IDs and attaching to created cuts, assuming they are fully within the cut’s time span.

Parameters:

recordings (Optional[RecordingSet]) – an optional RecordingSet manifest.
supervisions (Optional[SupervisionSet]) – an optional SupervisionSet manifest.
features (Optional[FeatureSet]) – an optional FeatureSet manifest.
output_path (Union[Path, str, None]) – an optional path where the CutSet is stored.
random_ids (bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)
tolerance (float) – float, tolerance for supervision and feature segment boundary comparison. By default, it’s 1ms. Increasing this value can be helpful when importing Kaldi data directories with precomputed features (typically 0.02 - 0.1 should be sufficient).
lazy (bool) – boolean, when True, output_path must be provided

Return type:

Returns:

a new CutSet instance.

static from_dicts(data)[source]

Return type:: CutSet

static from_webdataset(path, **wds_kwargs)[source]

Provides the ability to read Lhotse objects from a WebDataset tarball (or a collection of them, i.e., shards) sequentially, without reading the full contents into memory. It also supports passing a list of paths, or WebDataset-style pipes.

CutSets stored in this format are potentially much faster to read from due to sequential I/O (we observed speedups of 50-100x vs random-read mechanisms).

Since this mode does not support random access reads, some methods of CutSet might not work properly (e.g. len()).

The behaviour of the underlying WebDataset instance can be customized by providing its kwargs directly to the constructor of this class. For details, see lhotse.dataset.webdataset.mini_webdataset() documentation.

Examples

Read manifests and data from a single tarball:

>>> cuts = CutSet.from_webdataset("data/cuts-train.tar")

Read manifests and data from a multiple tarball shards:

>>> cuts = CutSet.from_webdataset("data/shard-{000000..004126}.tar")
>>> # alternatively
>>> cuts = CutSet.from_webdataset(["data/shard-000000.tar", "data/shard-000001.tar", ...])

Read manifests and data from shards in cloud storage (here AWS S3 via AWS CLI):

>>> cuts = CutSet.from_webdataset("pipe:aws s3 cp data/shard-{000000..004126}.tar -")

Read manifests and data from shards which are split between PyTorch DistributeDataParallel nodes and dataloading workers, with shard-level shuffling enabled:

>>> cuts = CutSet.from_webdataset(
...     "data/shard-{000000..004126}.tar",
...     split_by_worker=True,
...     split_by_node=True,
...     shuffle_shards=True,
... )

Return type:: CutSet

static from_shar(fields=None, in_dir=None, split_for_dataloading=False, shuffle_shards=False, stateful_shuffle=True, seed=42, cut_map_fns=None)[source]

Reads cuts and their corresponding data from multiple shards, also recognized as the Lhotse Shar format. Each shard is numbered and represented as a collection of one text manifest and one or more binary tarfiles. Each tarfile contains a single type of data, e.g., recordings, features, or custom fields.

Given an example directory named some_dir, its expected layout is some_dir/cuts.000000.jsonl.gz, some_dir/recording.000000.tar, some_dir/features.000000.tar, and then the same names but numbered with 000001, etc. There may also be other files if the cuts have custom data attached to them.

The main idea behind Lhotse Shar format is to optimize dataloading with sequential reads, while keeping the data composition more flexible than e.g. WebDataset tar archives do. To achieve this, Lhotse Shar keeps each data type in a separate archive, along a single CutSet JSONL manifest. This way, the metadata can be investigated without iterating through the binary data. The format also allows iteration over a subset of fields, or extension of existing data with new fields.

As you iterate over cuts from LazySharIterator, it keeps a file handle open for the JSONL manifest and all of the tar files that correspond to the current shard. The tar files are read item by item together, and their binary data is attached to the cuts. It can be normally accessed using methods such as cut.load_audio().

We can simply load a directory created by SharWriter. Example:

>>> cuts = LazySharIterator(in_dir="some_dir")
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()
...     fbank = cut.load_features()

LazySharIterator can also be initialized from a dict, where the keys indicate fields to be read, and the values point to actual shard locations. This is useful when only a subset of data is needed, or it is stored in different locations. Example:

>>> cuts = LazySharIterator({
...     "cuts": ["some_dir/cuts.000000.jsonl.gz"],
...     "recording": ["another_dir/recording.000000.tar"],
...     "features": ["yet_another_dir/features.000000.tar"],
... })
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()
...     fbank = cut.load_features()

We also support providing shell commands as shard sources, inspired by WebDataset. The “cuts” field expects a .jsonl stream, while the other fields expect a .tar stream. Example:

>>> cuts = LazySharIterator({
...     "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl"]
...     "recording": ["pipe:curl https://my.page/recording.000000.tar"],
... })
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()

The shell command can also contain pipes, which can be used to e.g. decompressing. Example:

>>> cuts = LazySharIterator({
...     "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl.gz | gunzip -c -"],
        (...)
... })

Finally, we allow specifying URLs or cloud storage URIs for the shard sources. We defer to smart_open library to handle those. Example:

>>> cuts = LazySharIterator({
...     "cuts": ["s3://my-bucket/cuts.000000.jsonl.gz"],
...     "recording": ["s3://my-bucket/recording.000000.tar"],
... })
... for cut in cuts:
...     print("Cut", cut.id, "has duration of", cut.duration)
...     audio = cut.load_audio()

Parameters:

fields (Optional[Dict[str, Sequence[Union[Path, str]]]]) – a dict whose keys specify which fields to load, and values are lists of shards (either paths or shell commands). The field “cuts” pointing to CutSet shards always has to be present.
in_dir (Union[Path, str, None]) – path to a directory created with SharWriter with all the shards in a single place. Can be used instead of fields.
split_for_dataloading (bool) – bool, by default False which does nothing. Setting it to True is intended for PyTorch training with multiple dataloader workers and possibly multiple DDP nodes. It results in each node+worker combination receiving a unique subset of shards from which to read data to avoid data duplication. This is mutually exclusive with seed='randomized'.
shuffle_shards (bool) – bool, by default False. When True, the shards are shuffled (in case of multi-node training, the shuffling is the same on each node given the same seed).
seed (Union[int, Literal['randomized']]) – When shuffle_shards is True, we use this number to seed the RNG. Seed can be set to 'randomized' in which case we expect that the user provided lhotse.dataset.dataloading.worker_init_fn() as DataLoader’s worker_init_fn argument. It will cause the iterator to shuffle shards differently on each node and dataloading worker in PyTorch training. This is mutually exclusive with split_for_dataloading=True. Seed can be set to 'trng' which, like 'randomized', shuffles the shards differently on each iteration, but is not possible to control (and is not reproducible). trng mode is mostly useful when the user has limited control over the training loop and may not be able to guarantee internal Shar epoch is being incremented, but needs randomness on each iteration (e.g. useful with PyTorch Lightning).
stateful_shuffle (bool) – bool, by default False. When True, every time this object is fully iterated, it increments an internal epoch counter and triggers shard reshuffling with RNG seeded by seed + epoch. Doesn’t have any effect when shuffle_shards is False.
cut_map_fns (Optional[Sequence[Callable[[Cut], Cut]]]) – optional sequence of callables that accept cuts and return cuts. It’s expected to have the same length as the number of shards, so each function corresponds to a specific shard. It can be used to attach shard-specific custom attributes to cuts.

Return type:

See also: LazySharIterator,: to_shar().

to_shar(output_dir, fields, shard_size=1000, warn_unused_fields=True, include_cuts=True, num_jobs=1, fault_tolerant=False, verbose=False)[source]

Writes cuts and their corresponding data into multiple shards, also recognized as the Lhotse Shar format. Each shard is numbered and represented as a collection of one text manifest and one or more binary tarfiles. Each tarfile contains a single type of data, e.g., recordings, features, or custom fields.

The main idea behind Lhotse Shar format is to optimize dataloading with sequential reads, while keeping the data composition more flexible than e.g. WebDataset tar archives do. To achieve this, Lhotse Shar keeps each data type in a separate archive, along a single CutSet JSONL manifest. This way, the metadata can be investigated without iterating through the binary data. The format also allows iteration over a subset of fields, or extension of existing data with new fields.

The user has to specify which fields should be saved, and what compression to use for each of them. Currently we support wav, flac, and mp3 compression for recording and custom audio fields, and lilcom or numpy for features and custom array fields.

Example:

>>> cuts = CutSet(...)  # cuts have 'recording' and 'features'
>>> output_paths = cuts.to_shar(
...     "some_dir", shard_size=100, fields={"recording": "mp3", "features": "lilcom"}
... )

It would create a directory some_dir with files such as some_dir/cuts.000000.jsonl.gz, some_dir/recording.000000.tar, some_dir/features.000000.tar, and then the same names but numbered with 000001, etc. The function returns a dict that maps field names to lists of saved shard paths.

When shard_size is set to None, we will disable automatic sharding and the shard number suffix will be omitted from the file names.

The option warn_unused_fields will emit a warning when cuts have some data attached to them (e.g., recording, features, or custom arrays) but saving it was not specified via fields.

The option include_cuts controls whether we store the cuts alongside fields (true by default). Turning it off is useful when extending existing dataset with new fields/feature types, but the original cuts do not require any modification.

When num_jobs is greater than 1, we will first split the CutSet into shard CutSets, and then export the fields in parallel using multiple subprocesses. Enabling verbose will display a progress bar. :rtype: Dict[str, List[str]]

Note

It is recommended not to set num_jobs too high on systems with slow disks, as the export will likely be bottlenecked by I/O speed in these cases. Try experimenting with 4-8 jobs first.

The option fault_tolerant will skip over audio files that failed to load with a warning. By default it is disabled.

See also: SharWriter,: to_shar().

to_dicts()[source]

Return type:: Iterable[dict]

decompose(output_dir=None, verbose=False)[source]

Return a 3-tuple of unique (recordings, supervisions, features) found in this CutSet. Some manifest sets may also be None, e.g., if features were not extracted.

Note

MixedCut is iterated over its track cuts.

Parameters:

output_dir (Union[Path, str, None]) – directory where the manifests will be saved. The following files will be created: ‘recordings.jsonl.gz’, ‘supervisions.jsonl.gz’, ‘features.jsonl.gz’.
verbose (bool) – when True, shows a progress bar.

Return type:

Tuple[Optional[RecordingSet], Optional[SupervisionSet], Optional[FeatureSet]]

describe(full=False)[source]

Print a message describing details about the CutSet - the number of cuts and the duration statistics, including the total duration and the percentage of speech segments.

Parameters:: full (bool) – when True, prints the full duration statistics, including % of speech by speaker count.
Return type:: None

Example output (for AMI train set):

>>> cs.describe(full=True)

Cut statistics: ╒═══════════════════════════╤══════════╕ │ Cuts count: │ 133 │ ├───────────────────────────┼──────────┤ │ Total duration (hh:mm:ss) │ 79:23:03 │ ├───────────────────────────┼──────────┤ │ mean │ 2148.7 │ ├───────────────────────────┼──────────┤ │ std │ 867.4 │ ├───────────────────────────┼──────────┤ │ min │ 477.9 │ ├───────────────────────────┼──────────┤ │ 25% │ 1509.8 │ ├───────────────────────────┼──────────┤ │ 50% │ 2181.7 │ ├───────────────────────────┼──────────┤ │ 75% │ 2439.9 │ ├───────────────────────────┼──────────┤ │ 99% │ 5300.7 │ ├───────────────────────────┼──────────┤ │ 99.5% │ 5355.3 │ ├───────────────────────────┼──────────┤ │ 99.9% │ 5403.2 │ ├───────────────────────────┼──────────┤ │ max │ 5415.2 │ ├───────────────────────────┼──────────┤ │ Recordings available: │ 133 │ ├───────────────────────────┼──────────┤ │ Features available: │ 0 │ ├───────────────────────────┼──────────┤ │ Supervisions available: │ 102222 │ ╘═══════════════════════════╧══════════╛ Speech duration statistics: ╒══════════════════════════════╤══════════╤═══════════════════════════╕ │ Total speech duration │ 64:59:51 │ 81.88% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total speaking time duration │ 74:33:09 │ 93.91% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total silence duration │ 14:23:12 │ 18.12% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Single-speaker duration │ 56:18:24 │ 70.93% (86.63% of speech) │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Overlapped speech duration │ 08:41:28 │ 10.95% (13.37% of speech) │ ╘══════════════════════════════╧══════════╧═══════════════════════════╛ Speech duration statistics by number of speakers: ╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕ │ Number of speakers │ Duration (hh:mm:ss) │ Speaking time (hh:mm:ss) │ % of speech │ % of speaking time │ ╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡ │ 1 │ 56:18:24 │ 56:18:24 │ 86.63% │ 75.53% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 2 │ 07:51:44 │ 15:43:28 │ 12.10% │ 21.09% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 3 │ 00:47:36 │ 02:22:47 │ 1.22% │ 3.19% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 4 │ 00:02:08 │ 00:08:31 │ 0.05% │ 0.19% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ Total │ 64:59:51 │ 74:33:09 │ 100.00% │ 100.00% │ ╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛

split(num_splits, shuffle=False, drop_last=False)[source]

Split the CutSet into num_splits pieces of equal size.

Parameters:

num_splits (int) – Requested number of splits.
shuffle (bool) – Optionally shuffle the recordings order first.
drop_last (bool) – determines how to handle splitting when len(seq) is not divisible by num_splits. When False (default), the splits might have unequal lengths. When True, it may discard the last element in some splits to ensure they are equally long.

Return type:

List[CutSet]

Returns:

A list of CutSet pieces.

split_lazy(output_dir, chunk_size, prefix='', num_digits=8, start_idx=0)[source]

Splits a manifest (either lazily or eagerly opened) into chunks, each with chunk_size items (except for the last one, typically).

In order to be memory efficient, this implementation saves each chunk to disk in a .jsonl.gz format as the input manifest is sampled.

Note

For lowest memory usage, use load_manifest_lazy to open the input manifest for this method.

Parameters:

it – any iterable of Lhotse manifests.
output_dir (Union[Path, str]) – directory where the split manifests are saved. Each manifest is saved at: {output_dir}/{prefix}.{split_idx}.jsonl.gz
chunk_size (int) – the number of items in each chunk.
prefix (str) – the prefix of each manifest.
num_digits (int) – the width of split_idx, which will be left padded with zeros to achieve it.
start_idx (int) – The split index to start counting from (default is 0).

Return type:

List[CutSet]

Returns:

a list of lazily opened chunk manifests.

subset(*, supervision_ids=None, cut_ids=None, first=None, last=None)[source]

Return a new CutSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

Example:

>>> cuts = CutSet.from_yaml('path/to/cuts')
>>> train_set = cuts.subset(supervision_ids=train_ids)
>>> test_set = cuts.subset(supervision_ids=test_ids)

Parameters:

supervision_ids (Optional[Iterable[str]]) – List of supervision IDs to keep.
cut_ids (Optional[Iterable[str]]) – List of cut IDs to keep. The returned CutSet preserves the order of cut_ids.
first (Optional[int]) – int, the number of first cuts to keep.
last (Optional[int]) – int, the number of last cuts to keep.

Return type:

Returns:

a new CutSet with the subset results.

map(transform_fn, apply_fn=<function is_cut>)[source]

Apply transform_fn to each item in this manifest and return a new manifest. If the manifest is opened lazy, the transform is also applied lazily.

Parameters:: transform_fn (Callable[[TypeVar(T)], TypeVar(T)]) – A callable (function) that accepts a single item instance and returns a new (or the same) instance of the same type. E.g. with CutSet, callable accepts Cut and returns also Cut.
Return type:: CutSet
Returns:: a new CutSet with transformed cuts.

filter_supervisions(predicate)[source]

Return a new CutSet with Cuts containing only SupervisionSegments satisfying predicate

Cuts without supervisions are preserved

Example:

>>> cuts = CutSet.from_yaml('path/to/cuts')
>>> at_least_five_second_supervisions = cuts.filter_supervisions(lambda s: s.duration >= 5)

Parameters:: predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool
Return type:: CutSet
Returns:: a CutSet with filtered supervisions

merge_supervisions(merge_policy='delimiter', custom_merge_fn=None)[source]

Return a copy of the cut that has all of its supervisions merged into a single segment.

The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields of all segments are concatenated with a whitespace.

Parameters:

merge_policy (str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied to custom fields. Fields with a None value are omitted.
custom_merge_fn (Optional[Callable[[str, Iterable[Any]], Any]]) – a function that will be called to merge custom fields values. We expect custom_merge_fn to handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like: custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])

Return type:

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False, num_jobs=1)[source]

Return a new CutSet with Cuts that have identical spans as their supervisions.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|

For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Parameters:

keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2. In this mode, we guarantee that there will always be exactly one supervision per cut.
min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
context_direction (Literal['center', 'left', 'right', 'random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
num_jobs (int) – Number of parallel workers to process the cuts.

Return type:

Returns:

a CutSet.

trim_to_alignments(type, max_pause=0.0, max_segment_duration=None, delimiter=' ', keep_all_channels=False, num_jobs=1)[source]

Return a new CutSet with Cuts that have identical spans as the alignments of type type. An additional max_pause is allowed between the alignments to merge contiguous alignment items.

For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Parameters:

type (str) – The type of the alignment to trim to (e.g. “word”).
max_pause (float) – The maximum pause allowed between the alignments to merge them.
delimiter (str) – The delimiter to use when concatenating the alignment items.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
num_jobs (int) – Number of parallel workers to process the cuts.

Return type:

Returns:

a CutSet.

trim_to_unsupervised_segments()[source]

Return a new CutSet with Cuts created from segments that have no supervisions (likely silence or noise).

Return type:: CutSet
Returns:: a CutSet.

trim_to_supervision_groups(max_pause=None, num_jobs=1)[source]

Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482

For example, the following cut:

Cut

╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝

is transformed into two cuts:

Cut 1                                       Cut 2

╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝

For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.

Parameters:

max_pause (Optional[float]) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups.
num_jobs (int) – Number of parallel workers to process the cuts.

Return type:

Returns:

a CutSet.

combine_same_recording_channels()[source]

Find cuts that come from the same recording and have matching start and end times, but represent different channels. Then, combine them together to form MultiCut’s and return a new CutSet containing these MultiCut’s. This is useful for processing microphone array recordings.

It is intended to be used as the first operation after creating a new CutSet (but might also work in other circumstances, e.g. if it was cut to windows first).

Return type:: CutSet

Example:

>>> ami = prepare_ami('path/to/ami')
>>> cut_set = CutSet.from_manifests(recordings=ami['train']['recordings'])
>>> multi_channel_cut_set = cut_set.combine_same_recording_channels()

In the AMI example, the multi_channel_cut_set will yield MultiCuts that hold all single-channel Cuts together.

sort_by_recording_id(ascending=True)[source]

Sort the CutSet alphabetically according to ‘recording_id’. Ascending by default.

This is advantageous before caling save_audios() on a trim_to_supervision() processed CutSet, also make sure that set_caching_enabled(True) was called.

Return type:: CutSet

sort_by_duration(ascending=False)[source]

Sort the CutSet according to cuts duration and return the result. Descending by default.

Return type:: CutSet

sort_like(other)[source]

Sort the CutSet according to the order of cut IDs in other and return the result.

Return type:: CutSet

index_supervisions(index_mixed_tracks=False, keep_ids=None)[source]

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters:

index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.
keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type:

Dict[str, IntervalTree]

Returns:

a mapping from Cut ID to an interval tree of SupervisionSegments.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)[source]

Return a new CutSet with Cuts padded to duration, num_frames or num_samples. Cuts longer than the specified argument will not be affected. By default, cuts will be padded to the right (i.e. after the signal).

When none of duration, num_frames, or num_samples is specified, we’ll try to determine the best way to pad to the longest cut based on whether features or recordings are available.

Parameters:

duration (Optional[float]) – The cuts minimal duration after padding. When not specified, we’ll choose the duration of the longest cut in the CutSet.
num_frames (Optional[int]) – The cut’s total number of frames after padding.
num_samples (Optional[int]) – The cut’s total number of samples after padding.
pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.
preserve_id (bool) – When True, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).
pad_value_dict (Optional[Dict[str, Union[int, float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.

Return type:

Returns:

A padded CutSet.

truncate(max_duration, offset_type, keep_excessive_supervisions=True, preserve_id=False, rng=None)[source]: Return a new CutSet with the Cuts truncated so that their durations are at most max_duration. Cuts shorter than max_duration will not be changed. :type max_duration: float :param max_duration: float, the maximum duration in seconds of a cut in the resulting manifest. :type offset_type: str :param offset_type: str, can be: - ‘start’ => cuts are truncated from their start; - ‘end’ => cuts are truncated from their end minus max_duration; - ‘random’ => cuts are truncated randomly between their start and their end minus max_duration :type keep_excessive_supervisions: bool :param keep_excessive_supervisions: bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept. :type preserve_id: bool :param preserve_id: bool. Should the truncated cut keep the same ID or get a new, random one. :type rng: Optional[Random] :param rng: optional random number generator to be used with a ‘random’ offset_type. :rtype: CutSet :return: a new CutSet instance with truncated cuts.

extend_by(duration, direction='both', preserve_id=False, pad_silence=True)[source]

Returns a new CutSet with cuts extended by duration amount.

Parameters:

duration (float) – float (seconds), specifies the duration by which the CutSet is extended.
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the same duration (equal to duration).
preserve_id (bool) – bool. Should the extended cut keep the same ID or get a new, random one.
pad_silence (bool) – bool. If True, the extended part of the cut will be padded with silence if required to match the specified duration.

Return type:

Returns:

a new CutSet instance.

cut_into_windows(duration, hop=None, keep_excessive_supervisions=True, num_jobs=1)[source]

Return a new CutSet, made by traversing each DataCut in windows of duration seconds by hop seconds and creating new DataCut out of them.

The last window might have a shorter duration if there was not enough audio, so you might want to use either .filter() or .pad() afterwards to obtain a uniform duration CutSet.

Parameters:

duration (float) – Desired duration of the new cuts in seconds.
hop (Optional[float]) – Shift between the windows in the new cuts in seconds.
keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
num_jobs (int) – The number of parallel workers.

Return type:

Returns:

a new CutSet with cuts made from shorter duration windows.

load_audio(collate=False, limit=1024)[source]

Reads the audio of all cuts in this CutSet into memory. Useful when this object represents a mini-batch.

Parameters:

collate (bool) – Should we collate the read audio into a single array. Shorter cuts will be padded. False by default.
limit (int) – Maximum number of read audio examples. By default it’s 1024 which covers most frequently encountered mini-batch sizes. If you are working with larger batch sizes, increase this limit.

Return type:

Union[List[ndarray], Tuple[ndarray, ndarray]]

Returns:

A list of numpy arrays, or a single array with batch size as the first dim.

sample(n_cuts=1)[source]

Randomly sample this CutSet and return n_cuts cuts. When n_cuts is 1, will return a single cut instance; otherwise will return a CutSet.

Return type:: Union[Cut, CutSet]

resample(sampling_rate, affix_id=False)[source]

Return a new CutSet that contains cuts resampled to the new sampling_rate. All cuts in the manifest must contain recording information. If the feature manifests are attached, they are dropped.

Parameters:

sampling_rate (int) – The new sampling rate.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

Returns:

a modified copy of the CutSet.

perturb_speed(factor, affix_id=True)[source]

Return a new CutSet that contains speed perturbed cuts with a factor of factor. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are modified to reflect the speed perturbed start times and durations.

Parameters:

factor (float) – The resulting playback speed is factor times the original one.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

Returns:

a modified copy of the CutSet.

perturb_tempo(factor, affix_id=True)[source]

Return a new CutSet that contains tempo perturbed cuts with a factor of factor.

Compared to speed perturbation, tempo preserves pitch. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are modified to reflect the tempo perturbed start times and durations.

Parameters:

factor (float) – The resulting playback tempo is factor times the original one.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

Returns:

a modified copy of the CutSet.

perturb_volume(factor, affix_id=True)[source]

Return a new CutSet that contains volume perturbed cuts with a factor of factor. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are remaining the same.

Parameters:

factor (float) – The resulting playback volume is factor times the original one.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

Returns:

a modified copy of the CutSet.

normalize_loudness(target, mix_first=True, affix_id=True)[source]

Return a new CutSet that will lazily apply loudness normalization to the desired target loudness (in dBFS).

Parameters:

target (float) – The target loudness in dBFS.
affix_id (bool) – When true, we will modify the Cut.id field by affixing it with “_ln{target}”.

Return type:

Returns:

a modified copy of the current CutSet.

dereverb_wpe(affix_id=True)[source]

Return a new CutSet that will lazily apply WPE dereverberation.

Parameters:: affix_id (bool) – When true, we will modify the Cut.id field by affixing it with “_wpe”.
Return type:: CutSet
Returns:: a modified copy of the current CutSet.

reverb_rir(rir_recordings=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0])[source]

Return a new CutSet that contains original cuts convolved with randomly chosen impulse responses from rir_recordings. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests remain the same.

If no rir_recordings are provided, we will generate a set of impulse responses using a fast random generator (https://arxiv.org/abs/2208.04101).

Parameters:

rir_recordings (Optional[RecordingSet]) – RecordingSet containing the room impulse responses.
normalize_output (bool) – When true, output will be normalized to have energy as input.
early_only (bool) – When true, only the early reflections (first 50 ms) will be used.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).
rir_channels (List[int]) – The channels of the impulse response to use. By default, first channel will be used. If it is a multi-channel RIR, applying RIR will produce MixedCut. If no RIR is provided, we will generate one with as many channels as this argument specifies.

Return type:

Returns:

a modified copy of the CutSet.

mix(cuts, duration=None, allow_padding=False, snr=20, preserve_id=None, mix_prob=1.0, seed=42, random_mix_offset=False)[source]

Mix cuts in this CutSet with randomly sampled cuts from another CutSet. A typical application would be data augmentation with noise, music, babble, etc.

Parameters:

cuts (CutSet) – a CutSet containing cuts to be mixed into this CutSet.
duration (Optional[float]) – an optional float in seconds. When None, we will preserve the duration of the cuts in self (i.e. we’ll truncate the mix if it exceeded the original duration). Otherwise, we will keep sampling cuts to mix in until we reach the specified duration (and truncate to that value, should it be exceeded).
allow_padding (bool) – an optional bool. When it is True, we will allow the offset to be larger than the reference cut by padding the reference cut.
snr (Union[float, Sequence[float], None]) – an optional float, or pair (range) of floats, in decibels. When it’s a single float, we will mix all cuts with this SNR level (where cuts in self are treated as signals, and cuts in cuts are treated as noise). When it’s a pair of floats, we will uniformly sample SNR values from that range. When None, we will mix the cuts without any level adjustment (could be too noisy for data augmentation).
preserve_id (Optional[str]) – optional string (“left”, “right”). when specified, append will preserve the cut id of the left- or right-hand side argument. otherwise, a new random id is generated.
mix_prob (float) – an optional float in range [0, 1]. Specifies the probability of performing a mix. Values lower than 1.0 mean that some cuts in the output will be unchanged.
seed (Union[int, Literal['trng', 'randomized'], Random]) – an optional int or “trng”. Random seed for choosing the cuts to mix and the SNR. If “trng” is provided, we’ll use the secrets module for non-deterministic results on each iteration. You can also directly pass a random.Random instance here.
random_mix_offset (bool) –
an optional bool. When True and the duration of the to be mixed in cut in longer than the original cut,

select a random sub-region from the to be mixed in cut.

Return type:

Returns:

a new CutSet with mixed cuts.

drop_features()[source]

Return a new CutSet, where each Cut is copied and detached from its extracted features.

Return type:: CutSet

drop_recordings()[source]

Return a new CutSet, where each Cut is copied and detached from its recordings.

Return type:: CutSet

drop_supervisions()[source]

Return a new CutSet, where each Cut is copied and detached from its supervisions.

Return type:: CutSet

drop_alignments()[source]

Return a new CutSet, where each Cut is copied and detached from the alignments present in its supervisions.

Return type:: CutSet

compute_and_store_features(extractor, storage_path, num_jobs=None, augment_fn=None, storage_type=<class 'lhotse.features.io.LilcomChunkyWriter'>, executor=None, mix_eagerly=True, progress_bar=True)[source]

Extract features for all cuts, possibly in parallel, and store them using the specified storage object.

Examples:

Extract fbank features on one machine using 8 processes, store arrays partitioned in 8 archive files with lilcom compression:
>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=8,
... )
Extract fbank features on one machine using 8 processes, store each array in a separate file with lilcom compression:
>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=8,
...     storage_type=LilcomFilesWriter
... )
Extract fbank features on multiple machines using a Dask cluster with 80 jobs, store arrays partitioned in 80 archive files with lilcom compression:
>>> from distributed import Client
... cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=80,
...     executor=Client(...)
... )
Extract fbank features on one machine using 8 processes, store each array in an S3 bucket (requires smart_open):
>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='s3://my-feature-bucket/my-corpus-features',
...     num_jobs=8,
...     storage_type=LilcomURLWriter
... )

Parameters:

extractor (FeatureExtractor) – A FeatureExtractor instance (either Lhotse’s built-in or a custom implementation).
storage_path (Union[Path, str]) – The path to location where we will store the features. The exact type and layout of stored files will be dictated by the storage_type argument.
num_jobs (Optional[int]) – The number of parallel processes used to extract the features. We will internally split the CutSet into this many chunks and process each chunk in parallel.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
storage_type (Type[TypeVar(FW, bound= FeaturesWriter)]) – a FeaturesWriter subclass type. It determines how the features are stored to disk, e.g. separate file per array, HDF5 files with multiple arrays, etc.
executor (Optional[Executor]) – when provided, will be used to parallelize the feature extraction process. By default, we will instantiate a ProcessPoolExecutor. Learn more about the Executor API at https://lhotse.readthedocs.io/en/latest/parallelism.html
mix_eagerly (bool) – Related to how the features are extracted for MixedCut instances, if any are present. When False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a new DataCut instance with the same ID. The returned DataCut will not have a Recording attached.
progress_bar (bool) – Should a progress bar be displayed (automatically turned off for parallel computation).

Return type:

Returns:

Returns a new CutSet with Features manifests attached to the cuts.

compute_and_store_features_batch(extractor, storage_path, manifest_path=None, batch_duration=600.0, num_workers=4, collate=False, augment_fn=None, storage_type=<class 'lhotse.features.io.LilcomChunkyWriter'>, overwrite=False)[source]

Extract features for all cuts in batches. This method is intended for use with compatible feature extractors that implement an accelerated extract_batch() method. For example, kaldifeat extractors can be used this way (see, e.g., KaldifeatFbank or KaldifeatMfcc).

When a CUDA GPU is available and enabled for the feature extractor, this can be much faster than CutSet.compute_and_store_features(). Otherwise, the speed will be comparable to single-threaded extraction.

Example: extract fbank features on one GPU, using 4 dataloading workers for reading audio, and store the arrays in an archive file with lilcom compression:

>>> from lhotse import KaldifeatFbank, KaldifeatFbankConfig
>>> extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda'))
>>> cuts = CutSet(...)
... cuts = cuts.compute_and_store_features_batch(
...     extractor=extractor,
...     storage_path='feats',
...     batch_duration=500,
...     num_workers=4,
... )

Parameters:

extractor (FeatureExtractor) – A FeatureExtractor instance, which should implement an accelerated extract_batch method.
storage_path (Union[Path, str]) – The path to location where we will store the features. The exact type and layout of stored files will be dictated by the storage_type argument.
manifest_path (Union[Path, str, None]) – Optional path where to write the CutSet manifest with attached feature manifests. If not specified, we will be keeping all manifests in memory.
batch_duration (float) – The maximum number of audio seconds in a batch. Determines batch size dynamically.
num_workers (int) – How many background dataloading workers should be used for reading the audio.
collate (bool) – If True, the waveforms will be collated into a single padded tensor before being passed to the feature extractor. Some extractors can be faster this way (for e.g., see lhotse.features.kaldi.extractors). If you are using kaldifeat extractors, you should set this to False.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
storage_type (Type[TypeVar(FW, bound= FeaturesWriter)]) – a FeaturesWriter subclass type. It determines how the features are stored to disk, e.g. separate file per array, HDF5 files with multiple arrays, etc.
overwrite (bool) – should we overwrite the manifest, HDF5 files, etc. By default, this method will append to these files if they exist.

Return type:

Returns:

Returns a new CutSet with Features manifests attached to the cuts.

save_audios(storage_path, format='wav', encoding=None, num_jobs=None, executor=None, augment_fn=None, progress_bar=True, shuffle_on_split=True, **kwargs)[source]

Store waveforms of all cuts as audio recordings to disk.

Parameters:

storage_path (Union[Path, str]) – The path to location where we will store the audio recordings. For each cut, a sub-directory will be created that starts with the first 3 characters of the cut’s ID. The audio recording is then stored in the sub-directory using filename {cut.id}.{format}
format (str) – Audio format argument supported by torchaudio.save or soundfile.write. Tested values are: wav, flac, and opus.
encoding (Optional[str]) – Audio encoding argument supported by torchaudio.save or soundfile.write. Please refer to the documentation of the relevant library used in your audio backend.
num_jobs (Optional[int]) – The number of parallel processes used to store the audio recordings. We will internally split the CutSet into this many chunks and process each chunk in parallel.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
executor (Optional[Executor]) – when provided, will be used to parallelize the process. By default, we will instantiate a ProcessPoolExecutor. Learn more about the Executor API at https://lhotse.readthedocs.io/en/latest/parallelism.html
progress_bar (bool) – Should a progress bar be displayed (automatically turned off for parallel computation).
shuffle_on_split (bool) – Shuffle the CutSet before splitting it for the parallel workers. It is active only when num_jobs > 1. The default is True.
kwargs – Deprecated arguments go here and are ignored.

Return type:

Returns:

Returns a new CutSet.

compute_global_feature_stats(storage_path=None, max_cuts=None, extractor=None)[source]

Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

Parameters:

storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.
max_cuts (Optional[int]) – optionally, limit the number of cuts used for stats estimation. The cuts will be selected randomly in that case.
extractor (Optional[FeatureExtractor]) – optional FeatureExtractor, when provided, we ignore any pre-computed features.

Return a dict of ``{‘norm_means’``{‘norm_means’:

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type:

Dict[str, ndarray]

with_features_path_prefix(path)[source]

Return type:: CutSet

with_recording_path_prefix(path)[source]

Return type:: CutSet

copy_data(output_dir, verbose=True)[source]

Copies every data item referenced by this CutSet into a new directory. The structure is as follows:

output_dir

Parameters:

output_dir (Union[Path, str]) – The root directory where we’ll store the copied data.
verbose (bool) – Show progress bar, enabled by default.

Return type:

Returns:

CutSet manifest pointing to the new data.

copy_feats(writer, output_path=None)[source]

Save a copy of every feature matrix found in this CutSet using writer and return a new manifest with cuts referring to the new feature locations.

Parameters:

writer (FeaturesWriter) – a lhotse.features.io.FeaturesWriter instance.
output_path (Union[Path, str, None]) – optional path where the new manifest should be stored. It’s used to write the manifest incrementally and return a lazy manifest, otherwise the copy is stored in memory.

Return type:

Returns:

a copy of the manifest.

modify_ids(transform_fn)[source]

Modify the IDs of cuts in this CutSet. Useful when combining multiple ``CutSet``s that were created from a single source, but contain features with different data augmentations techniques.

Parameters:: transform_fn (Callable[[str], str]) – A callable (function) that accepts a string (cut ID) and returns

a new string (new cut ID). :rtype: CutSet :return: a new CutSet with cuts with modified IDs.

fill_supervisions(add_empty=True, shrink_ok=False)[source]

Fills the whole duration of each cut in a CutSet with a supervision segment.

If the cut has one supervision, its start is set to 0 and duration is set to cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.

If there are no supervisions, we will add an empty one when add_empty==True, otherwise we won’t change anything.

If there are two or more supervisions, we will raise an exception.

Parameters:

add_empty (bool) – should we add an empty supervision with identical time bounds as the cut.
shrink_ok (bool) – should we raise an error if a supervision would be shrank as a result of calling this method.

Return type:

map_supervisions(transform_fn)[source]

Modify the SupervisionSegments by transform_fn in this CutSet.

Parameters:: transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.
Return type:: CutSet
Returns:: a new, modified CutSet.

transform_text(transform_fn)[source]

Return a copy of this CutSet with all SupervisionSegments text transformed with transform_fn. Useful for text normalization, phonetic transcription, etc.

Parameters:: transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.
Return type:: CutSet
Returns:: a new, modified CutSet.

filter(predicate)

Return a new manifest containing only the items that satisfy predicate. If the manifest is lazy, the filtering will also be applied lazily.

Parameters:: predicate (Callable[[TypeVar(T)], bool]) – a function that takes a cut as an argument and returns bool.
Returns:: a filtered manifest.

classmethod from_file(path)

Return type:: Any

classmethod from_json(path)

Return type:: Any

classmethod from_jsonl(path)

Return type:: Any

classmethod from_jsonl_lazy(path): Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration. :rtype: Any

Warning

Opening the manifest in this way might cause some methods that rely on random access to fail.

classmethod from_yaml(path)

Return type:: Any

classmethod infinite_mux(*manifests, weights=None, seed=0, max_open_streams=None)

Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. Unlike mux(), this method allows to limit the number of max open sub-iterators at any given time.

To enable this, it performs 2-stage sampling. First, it samples with replacement the set of iterators I to construct a subset I_sub of size max_open_streams. Then, for each iteration step, it samples an iterator i from I_sub, fetches the next item from it, and yields it. Once i becomes exhausted, it is replaced with a new iterator j sampled from I_sub.

Caution

Do not use this method with inputs that are infinitely iterable as they will silently break the multiplexing property by only using a subset of the input iterables.

Caution

This method is not recommended for multiplexing for a small amount of iterations, as it may be much less accurate than mux() depending on the number of open streams, iterable sizes, and the random seed.

Parameters:

manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
weights (Optional[List[Union[int, float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.
seed (Union[int, Literal['trng', 'randomized']]) – the random seed, ensures deterministic order across multiple iterations.
max_open_streams (Optional[int]) – the number of iterables that can be open simultaneously at any given time.

property is_lazy: bool: Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.

classmethod mux(*manifests, stop_early=False, weights=None, seed=0)

Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. If one of the iterables is exhausted before the others, we will keep iterating until all iterables are exhausted. This behavior can be changed with stop_early parameter.

Parameters:

manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
stop_early (bool) – should we stop the iteration as soon as we exhaust one of the manifests.
weights (Optional[List[Union[int, float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.
seed (Union[int, Literal['trng', 'randomized']]) – the random seed, ensures deterministic order across multiple iterations.

classmethod open_writer(path, overwrite=True)

Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (.jsonl), with optional gzip compression (.jsonl.gz). :rtype: Union[SequentialJsonlWriter, InMemoryWriter]

Note

when path is None, we will return a InMemoryWriter instead has the same API but stores the manifests in memory. It is convenient when you want to make disk saving optional.

Example:

>>> from lhotse import RecordingSet
... recordings = [...]
... with RecordingSet.open_writer('recordings.jsonl.gz') as writer:
...     for recording in recordings:
...         writer.write(recording)

This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifests.

Example:

>>> from lhotse import RecordingSet, Recording
... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer:
...     for path in Path('.').rglob('*.wav'):
...         recording_id = path.stem
...         if writer.contains(recording_id):
...             # Item already written previously - skip processing.
...             continue
...         # Item doesn't exist yet - run extra work to prepare the manifest
...         # and store it.
...         recording = Recording.from_file(path, recording_id=recording_id)
...         writer.write(recording)

repeat(times=None, preserve_id=False)

Return a new, lazily evaluated manifest that iterates over the original elements times number of times.

Parameters:

times (Optional[int]) – how many times to repeat (infinite by default).
preserve_id (bool) – when True, we won’t update the element ID with repeat number.

Returns:

a repeated manifest.

shuffle(rng=None, buffer_size=10000)

Shuffles the elements and returns a shuffled variant of self. If the manifest is opened lazily, performs shuffling on-the-fly with a fixed buffer size.

Parameters:: rng (Optional[Random]) – an optional instance of random.Random for precise control of randomness.
Returns:: a shuffled copy of self, or a manifest that is shuffled lazily.

to_eager(): Evaluates all lazy operations on this manifest, if any, and returns a copy that keeps all items in memory. If the manifest was “eager” already, this is a no-op and won’t copy anything.

to_file(path)

Return type:: None

to_json(path)

Return type:: None

to_jsonl(path)

Return type:: None

to_yaml(path)

Return type:: None

class lhotse.cut.MixedCut(id, tracks, transforms=None)[source]

MixedCut is a Cut that actually consists of multiple other cuts. Its primary purpose is to allow time-domain and feature-domain augmentation via mixing the training cuts with noise, music, and babble cuts. The actual mixing operations are performed on-the-fly.

Internally, MixedCut holds other cuts in multiple tracks (MixTrack), each with its own offset and SNR that is relative to the first track.

Please refer to the documentation of Cut to learn more about using cuts.

In addition to methods available in Cut, MixedCut provides the methods to read all of its tracks audio and features as separate channels:

>>> cut = MixedCut(...)
>>> mono_features = cut.load_features()
>>> assert len(mono_features.shape) == 2
>>> multi_features = cut.load_features(mixed=False)
>>> # Now, the first dimension is the channel.
>>> assert len(multi_features.shape) == 3

Note

MixedCut is different from MultiCut, which is intended to represent multi-channel recordings that share the same supervisions.

Note

Each track in a MixedCut can be either a MonoCut, MultiCut, or PaddingCut.

Note

The transforms field is a list of dictionaries that describe the transformations that should be applied to the track after mixing.

See also:

lhotse.cut.Cut

lhotse.cut.MonoCut

lhotse.cut.MultiCut

lhotse.cut.CutSet

id: str

tracks: List[MixTrack]

transforms: Optional[List[Dict]] = None

property supervisions: List[SupervisionSegment]: Lists the supervisions of the underlying source cuts. Each segment start time will be adjusted by the track offset.

property start: float

property duration: float

property channel: int | List[int]

property has_features: bool

property has_recording: bool

property has_video: bool

has(field)[source]

Return type:: bool

property num_frames: int | None

property frame_shift: float | None

property sampling_rate: int | None

property num_samples: int | None

property num_features: int | None

property num_channels: int | None

property features_type: str | None

load_custom(name)[source]

Load custom data as numpy array. The custom data is expected to have been stored in cuts custom field as an Array or TemporalArray manifest.

Note

It works with Array manifests stored via attribute assignments, e.g.: cut.my_custom_data = Array(...).

Warning

For MixedCut, this will only work if the mixed cut consists of a single MonoCut and an arbitrary number of PaddingCuts. This is because it is generally undefined how to mix arbitrary arrays.

Parameters:: name (str) – name of the custom attribute.
Return type:: ndarray
Returns:: a numpy array with the data (after padding).

move_to_memory(audio_format='flac', load_audio=True, load_features=True, load_custom=True)[source]

Load data (audio, features, or custom arrays) into memory and attach them to a copy of the manifest. This is useful when you want to store cuts together with the actual data in some binary format that enables sequential data reads.

Audio is encoded with audio_format (compatible with torchaudio.save), floating point features are encoded with lilcom, and other arrays are pickled.

Return type:: MixedCut

to_mono(encoding='flac', **kwargs)[source]

Convert this MixedCut to a MonoCut by mixing all tracks and channels into a single one. The result audio array is stored in memory, and can be saved to disk by calling cut.save_audio(path, ...) on the result.

Hint

the resulting MonoCut will have custom field populated with the custom value from the first track of the MixedCut.

Parameters:: encoding (str) – any of “wav”, “flac”, or “opus”.
Return type:: Cut
Returns:: a new MonoCut instance.

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)[source]

Returns a new MixedCut that is a sub-region of the current MixedCut. This method truncates the underlying Cuts and modifies their offsets in the mix, as needed. Tracks that do not fit in the truncated cut are removed.

Note that no operation is done on the actual features - it’s only during the call to load_features() when the actual changes happen (a subset of features is loaded).

Parameters:

offset (float) – float (seconds), controls the start of the new cut relative to the current MixedCut’s start.
duration (Optional[float]) – optional float (seconds), controls the duration of the resulting MixedCut. By default, the duration is (end of the cut before truncation) - (offset).
keep_excessive_supervisions (bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.
preserve_id (bool) – bool. Should the truncated cut keep the same ID or get a new, random one.

Return type:

Returns:

a new MixedCut instance.

extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)[source]

This raises a ValueError since extending a MixedCut is not defined.

Parameters:

duration (float) – float (seconds), duration (in seconds) to extend the MixedCut.
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the duration specified in duration.
preserve_id (bool) – bool. Should the extended cut keep the same ID or get a new, random one.
pad_silence (bool) – bool. See usage in lhotse.cut.MonoCut.extend_by.

Return type:

Returns:

a new MixedCut instance.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)[source]

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters:

duration (Optional[float]) – The cut’s minimal duration after padding.
num_frames (Optional[int]) – The cut’s total number of frames after padding.
num_samples (Optional[int]) – The cut’s total number of samples after padding.
pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.
preserve_id (bool) – When True, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).
pad_value_dict (Optional[Dict[str, Union[int, float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.

Return type:

Returns:

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

resample(sampling_rate, affix_id=False)[source]

Return a new MixedCut that will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.

Parameters:

sampling_rate (int) – The new sampling rate.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

Returns:

a modified copy of the current MixedCut.

perturb_speed(factor, affix_id=True)[source]

Return a new MixedCut that will lazily perturb the speed while loading audio. The num_samples, start and duration fields of the underlying Cuts (and their Recordings and SupervisionSegments) are updated to reflect the shrinking/extending effect of speed. We are also updating the offsets of all underlying tracks.

Parameters:

factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_sp{factor}”.

Return type:

Returns:

a modified copy of the current MixedCut.

perturb_tempo(factor, affix_id=True)[source]

Return a new MixedCut that will lazily perturb the tempo while loading audio.

Compared to speed perturbation, tempo preserves pitch. The num_samples, start and duration fields of the underlying Cuts (and their Recordings and SupervisionSegments) are updated to reflect the shrinking/extending effect of tempo. We are also updating the offsets of all underlying tracks.

Parameters:

factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_tp{factor}”.

Return type:

Returns:

a modified copy of the current MixedCut.

perturb_volume(factor, affix_id=True)[source]

Return a new MixedCut that will lazily perturb the volume while loading audio. Recordings of the underlying Cuts are updated to reflect volume change.

Parameters:

factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).
affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_vp{factor}”.

Return type:

Returns:

a modified copy of the current MixedCut.

normalize_loudness(target, mix_first=True, affix_id=False)[source]

Return a new MixedCut that will lazily apply loudness normalization.

Parameters:

target (float) – The target loudness in dBFS.
mix_first (bool) – If true, we will mix the underlying cuts before applying loudness normalization. If false, we cannot guarantee that the resulting cut will have the target loudness.
affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_ln{target}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0], room_rng_seed=None, source_rng_seed=None, mix_first=True)[source]

Return a new MixedCut that will convolve the audio with the provided impulse response. If no rir_recording is provided, we will generate an impulse response using a fast random generator (https://arxiv.org/abs/2208.04101).

Parameters:

rir_recording (Optional[Recording]) – The impulse response to use for convolving.
normalize_output (bool) – When true, output will be normalized to have energy as input.
early_only (bool) – When true, only the early reflections (first 50 ms) will be used.
affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_rvb”.
rir_channels (List[int]) – The channels of the impulse response to use. By default, first channel is used. If only one channel is specified, all tracks will be convolved with this channel. If a list is provided, it must contain as many channels as there are tracks such that each track will be convolved with one of the specified channels.
room_rng_seed (Optional[int]) – Seed for the room configuration.
source_rng_seed (Optional[int]) – Seed for the source position.
mix_first (bool) – When true, the mixing will be done first before convolving with the RIR. This effectively means that all tracks will be convolved with the same RIR. If you are simulating multi-speaker mixtures, you should set this to False.

Return type:

Returns:

a modified copy of the current MixedCut.

load_features(mixed=True)[source]

Loads the features of the source cuts and mixes them on-the-fly.

Parameters:: mixed (bool) – when True (default), the features are mixed together (as defined in the mixing function for the extractor). This could result in either a 2D or 3D array. For example, if all underlying tracks are single-channel, the output will be a 2D array of shape (num_frames, num_features). If any of the tracks are multi-channel, the output may be a 3D array of shape (num_frames, num_features, num_channels).
Return type:: Optional[ndarray]
Returns:: A numpy ndarray with features and with shape (num_frames, num_features), or (num_tracks, num_frames, num_features)

load_audio(mixed=True, mono_downmix=False)[source]

Loads the audios of the source cuts and mix them on-the-fly.

Parameters:

mixed (bool) – When True (default), returns a mix of the underlying tracks. This will return a numpy array with shape (num_channels, num_samples), where num_channels is determined by the num_channels property of the MixedCut. Otherwise returns a numpy array with the number of channels equal to the total number of channels across all tracks in the MixedCut. For example, if it contains a MultiCut with 2 channels and a MonoCut with 1 channel, the returned array will have shape (3, num_samples).
mono_downmix (bool) – If the MixedCut contains > 1 channels (for e.g. when one of its tracks is a MultiCut), this parameter controls whether the returned array will be down-mixed to a single channel. This down-mixing is done by summing the channels together.

Return type:

Optional[ndarray]

Returns:

A numpy ndarray with audio samples and with shape (num_channels, num_samples)

property video: VideoInfo | None

load_video(with_audio=True, mixed=True, mono_downmix=False)[source]

Return type:: Optional[Tuple[Tensor, Optional[Tensor]]]

plot_tracks_features()[source]: Display the feature matrix as an image. Requires matplotlib to be installed.

plot_tracks_audio()[source]: Display plots of the individual tracks’ waveforms. Requires matplotlib to be installed.

drop_features()[source]

Return a copy of the current MixedCut, detached from features.

Return type:: MixedCut

drop_recording()[source]

Return a copy of the current MixedCut, detached from recording.

Return type:: MixedCut

drop_supervisions()[source]

Return a copy of the current MixedCut, detached from supervisions.

Return type:: MixedCut

drop_alignments()[source]

Return a copy of the current MixedCut, detached from supervisions.

Return type:: MixedCut

compute_and_store_features(extractor, storage, augment_fn=None, mix_eagerly=True)[source]

Compute the features from this cut, store them on disk, and create a new MonoCut object with the feature manifest attached. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
storage (FeaturesWriter) – a FeaturesWriter instance used to store the features.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation.
mix_eagerly (bool) – when False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a new MonoCut instance with the same ID. The returned MonoCut will not have a Recording attached.

Return type:

DataCut

Returns:

a new MonoCut instance if mix_eagerly is True, or returns self with each of the tracks containing the Features manifests.

fill_supervision(add_empty=True, shrink_ok=False)[source]

Fills the whole duration of a cut with a supervision segment.

If the cut has one supervision, its start is set to 0 and duration is set to cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.

If there are no supervisions, we will add an empty one when add_empty==True, otherwise we won’t change anything.

If there are two or more supervisions, we will raise an exception.

Note

For MixedCut, we expect that only one track contains a supervision. That supervision will be expanded to cover the full MixedCut’s duration.

Parameters:

add_empty (bool) – should we add an empty supervision with identical time bounds as the cut.
shrink_ok (bool) – should we raise an error if a supervision would be shrank as a result of calling this method.

Return type:

map_supervisions(transform_fn)[source]

Modify the SupervisionSegments by transform_fn of this MixedCut.

Parameters:: transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.
Return type:: Cut
Returns:: a modified MixedCut.

merge_supervisions(merge_policy='delimiter', custom_merge_fn=None)[source]

Return a copy of the cut that has all of its supervisions merged into a single segment.

The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields are concatenated with a whitespace.

Note

If you’re using individual tracks of a mixed cut, note that this transform drops all the supervisions in individual tracks and assigns the merged supervision in the first DataCut found in self.tracks.

Parameters:

merge_policy (str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied to custom fields. Fields with a None value are omitted.
custom_merge_fn (Optional[Callable[[str, Iterable[Any]], Any]]) – a function that will be called to merge custom fields values. We expect custom_merge_fn to handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like: custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])

Return type:

filter_supervisions(predicate)[source]

Modify cut to store only supervisions accepted by predicate

Example:

>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids)
>>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0)
>>> cut = cut.filter_supervisions(lambda s: s.text is not None)

Parameters:: predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool
Return type:: Cut
Returns:: a modified MixedCut

static from_dict(data)[source]

Return type:: MixedCut

with_features_path_prefix(path)[source]

Return type:: MixedCut

with_recording_path_prefix(path)[source]

Return type:: MixedCut

__init__(id, tracks, transforms=None)

append(other, snr=None, preserve_id=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters:: preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.
Return type:: Cut

compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type:

ndarray

Returns:

a numpy ndarray with the computed features.

cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)

Return a list of shorter cuts, made by traversing this cut in windows of duration seconds by hop seconds.

The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.

Parameters:

duration (float) – Desired duration of the new cuts in seconds.
hop (Optional[float]) – Shift between the windows in the new cuts in seconds.
keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

Return type:

CutSet

Returns:

a list of cuts made from shorter duration windows.

property end: float

property has_overlapping_supervisions: bool

index_supervisions(index_mixed_tracks=False, keep_ids=None)

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters:

index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.
keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type:

Dict[str, IntervalTree]

Returns:

a mapping from Cut ID to an interval tree of SupervisionSegments.

mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None)

Refer to :function:`~lhotse.cut.mix` documentation.

Return type:: Cut

play_audio(): Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_alignment(alignment_type='word'): Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

plot_audio(): Display a plot of the waveform. Requires matplotlib to be installed.

plot_features(): Display the feature matrix as an image. Requires matplotlib to be installed.

save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)

Store this cut’s waveform as audio recording to disk.

Parameters:

storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.
format (Optional[str]) – Audio format argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
encoding (Optional[str]) – Audio encoding argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
kwargs – additional arguments passed to Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.

Return type:

Returns:

a new Cut instance.

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

split(timestamp)

Return type:: Tuple[Cut, Cut]

Split a cut into two cuts at timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:

left cut [0s - 4s]

right cut [4s - 10s]

supervisions_audio_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

supervisions_feature_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

to_dict()

Return type:: dict

trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)

Splits the current Cut into its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.

For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Hint

If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the Cut.merge_supervisions() method first to merge the supervisions into a single one, followed by the Cut.trim_to_alignments() method. For example:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)

Hint

The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)

Parameters:

type (str) – The type of the alignment to trim to (e.g. “word”).
max_pause (Optional[float]) – The maximum pause allowed between the alignments to merge them. If None, no merging will be performed. [default: None]
delimiter (str) – The delimiter to use when joining the alignment items.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
num_jobs – Number of parallel workers to process the cuts.

Return type:

CutSet

Returns:

a CutSet object.

trim_to_supervision_groups(max_pause=0.0)

Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482

For example, the following cut:

Cut

╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝

is transformed into two cuts:

Cut 1                                       Cut 2

╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝

For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.

Parameters:: max_pause (float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.
Return type:: CutSet
Returns:: a CutSet.

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|

For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Parameters:

keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2. In this mode, we guarantee that there will always be exactly one supervision per cut.
min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
context_direction (Literal['center', 'left', 'right', 'random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.

Return type:

CutSet

Returns:

a list of cuts.

property trimmed_supervisions: List[SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

with_id(id_)

Return a copy of the Cut with a new ID.

Return type:: Cut

class lhotse.cut.MixTrack(cut, type=None, offset=0.0, snr=None)[source]

Represents a single track in a mix of Cuts. Points to a specific DataCut or PaddingCut and holds information on how to mix it with other Cuts, relative to the first track in a mix.

cut: Union[DataCut, PaddingCut]

type: str = None

offset: float = 0.0

snr: Optional[float] = None

static from_dict(data)[source]

__init__(cut, type=None, offset=0.0, snr=None)

class lhotse.cut.MonoCut(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)[source]

MonoCut is a Cut of a single channel of a Recording. In addition to Cut, it has a specified channel attribute. This is the most commonly used type of cut.

Please refer to the documentation of Cut to learn more about using cuts.

See also:

lhotse.cut.Cut

lhotse.cut.MixedCut

lhotse.cut.CutSet

channel: int

property num_channels: int

load_features()[source]

Load the features from the underlying storage and cut them to the relevant [begin, duration] region of the current MonoCut.

Return type:: Optional[ndarray]

load_audio()[source]

Load the audio by locating the appropriate recording in the supplied RecordingSet. The audio is trimmed to the [begin, end] range specified by the MonoCut.

Return type:: Optional[ndarray]
Returns:: a numpy ndarray with audio samples, with shape (1 <channel>, N <samples>)

load_video(with_audio=True)[source]

Load the subset of video (and audio) from attached recording. The data is trimmed to the [begin, end] range specified by the MonoCut.

Parameters:: with_audio (bool) – bool, whether to load and return audio alongside video. True by default.
Return type:: Optional[Tuple[Tensor, Optional[Tensor]]]
Returns:: a tuple of video tensor and optionally audio tensor (or None), or None if this cut has no video.

reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0], room_rng_seed=None, source_rng_seed=None)[source]

Return a new DataCut that will convolve the audio with the provided impulse response. If the rir_recording is multi-channel, the rir_channels argument determines which channels will be used. By default, we use the first channel and return a MonoCut. If we reverberate with a multi-channel RIR, we return a MultiCut.

If no rir_recording is provided, we will generate an impulse response using a fast random generator (https://arxiv.org/abs/2208.04101). Note that the generator only supports simulating reverberation with a single microphone, so we will return a MonoCut in this case.

Parameters:

rir_recording (Optional[Recording]) – The impulse response to use for convolving.
normalize_output (bool) – When true, output will be normalized to have energy as input.
early_only (bool) – When true, only the early reflections (first 50 ms) will be used.
affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_rvb”.
rir_channels (List[int]) – The channels of the impulse response to use. First channel is used by default. If multiple channels are specified, this will produce a MultiCut instead of a MonoCut.
room_rng_seed (Optional[int]) – The seed for the room configuration.
source_rng_seed (Optional[int]) – The seed for the source position.

Return type:

DataCut

Returns:

a modified copy of the current MonoCut.

merge_supervisions(merge_policy='delimiter', custom_merge_fn=None)[source]

Return a copy of the cut that has all of its supervisions merged into a single segment.

The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields of all segments are concatenated with a whitespace.

Parameters:

merge_policy (str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied to custom fields. Fields with a None value are omitted.
custom_merge_fn (Optional[Callable[[str, Iterable[Any]], Any]]) – a function that will be called to merge custom fields values. We expect custom_merge_fn to handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like: custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])

Return type:

MonoCut

static from_dict(data)[source]

Return type:: MonoCut

__init__(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)

append(other, snr=None, preserve_id=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters:: preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.
Return type:: Cut

attach_tensor(name, data, frame_shift=None, temporal_dim=None, compressed=False)

Attach a tensor to this MonoCut, described with an Array manifest. The attached data is stored in-memory for later use, and can be accessed by calling cut.load_<name>() or cut.load_custom().

This is useful if you want actions such as truncate/pad to propagate to the tensor, e.g.:

>>> cut = MonoCut(id="c1", start=2, duration=8, ...)
>>> cut = cut.attach_tensor(
...     "alignment",
...     torch.tensor([0, 0, 0, ...]),
...     frame_shift=0.1,
...     temporal_dim=0,
... )
>>> half_alignment = cut.truncate(duration=4.0).load_alignment()

Note

This object can’t be stored in JSON/JSONL manifests anymore.

Parameters:

name (str) – attribute under which the data can be found.
data (Union[ndarray, Tensor]) – PyTorch tensor or numpy array.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
compressed (bool) – When True, we will apply lilcom compression to the array. Only applicable to arrays of floats.

Return type:

Returns:

compute_and_store_features(extractor, storage, augment_fn=None, *args, **kwargs)

Compute the features from this cut, store them on disk, and attach a feature manifest to this cut. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
storage (FeaturesWriter) – a FeaturesWriter instance used to write the features to a storage.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation.

Return type:

DataCut

Returns:

a new MonoCut instance with a Features manifest attached to it.

compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type:

ndarray

Returns:

a numpy ndarray with the computed features.

custom: Optional[Dict[str, Any]] = None

cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)

Return a list of shorter cuts, made by traversing this cut in windows of duration seconds by hop seconds.

The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.

Parameters:

duration (float) – Desired duration of the new cuts in seconds.
hop (Optional[float]) – Shift between the windows in the new cuts in seconds.
keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

Return type:

CutSet

Returns:

a list of cuts made from shorter duration windows.

dereverb_wpe(affix_id=True)

Return a new DataCut that will lazily apply WPE dereverberation.

Parameters:: affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_wpe”.
Return type:: DataCut
Returns:: a modified copy of the current DataCut.

drop_alignments()

Return a copy of the current DataCut, detached from alignments.

Return type:: DataCut

drop_custom(name)

drop_features()

Return a copy of the current DataCut, detached from features.

Return type:: DataCut

drop_recording()

Return a copy of the current DataCut, detached from recording.

Return type:: DataCut

drop_supervisions()

Return a copy of the current DataCut, detached from supervisions.

Return type:: DataCut

property end: float

extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)

Returns a new Cut (DataCut or MixedCut) that is an extended region of the current DataCut by extending the cut by a fixed duration in the specified direction.

Note that no operation is done on the actual features or recording - it’s only during the call to DataCut.load_features() / DataCut.load_audio() when the actual changes happen (an extended version of features/audio is loaded).

Hint

This method extends a cut by a given duration, either to the left or to the right (or both), using the “real” content of the recording that the cut is part of. For example, a DataCut spanning the region from 2s to 5s in a recording, when extended by 2s to the right, will now span the region from 2s to 7s in the same recording (provided the recording length exceeds 7s). If the recording is shorter, additional silence will be padded to achieve the desired duration by default. This behavior can be changed by setting pad_silence=False. Also see DataCut.pad() which pads a cut “to” a specified length. To “truncate” a cut, use DataCut.truncate().

Hint

If pad_silence is set to False, then the cut will be extended only as much as allowed within the recording’s boundary.

Hint

If direction is “both”, the resulting cut will be extended by the specified duration in both directions. This is different from the usage in MonoCut.pad() where a padding equal to 0.5*duration is added to both sides.

Parameters:

duration (float) – float (seconds), specifies the duration by which the cut should be extended.
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the duration specified in duration.
preserve_id (bool) – bool. Should the extended cut keep the same ID or get a new, random one.
pad_silence (bool) – bool. Should the cut be padded with silence if the recording is shorter than the desired duration. If False, the cut will be extended only as much as allowed within the recording’s boundary.

Return type:

Returns:

a new MonoCut instance.

features: Optional[Features] = None

property features_type: str | None

fill_supervision(add_empty=True, shrink_ok=False)

Fills the whole duration of a cut with a supervision segment.

If the cut has one supervision, its start is set to 0 and duration is set to cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.

If there are no supervisions, we will add an empty one when add_empty==True, otherwise we won’t change anything.

If there are two or more supervisions, we will raise an exception.

Parameters:

add_empty (bool) – should we add an empty supervision with identical time bounds as the cut.
shrink_ok (bool) – should we raise an error if a supervision would be shrank as a result of calling this method.

Return type:

DataCut

filter_supervisions(predicate)

Return a copy of the cut that only has supervisions accepted by predicate.

Example:

>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids)
>>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0)
>>> cut = cut.filter_supervisions(lambda s: s.text is not None)

Parameters:: predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool
Return type:: DataCut
Returns:: a modified MonoCut

property frame_shift: float | None

has(field)

Return type:: bool

has_custom(name)

Check if the Cut has a custom attribute with name name.

Parameters:: name (str) – name of the custom attribute.
Return type:: bool
Returns:: a boolean.

property has_features: bool

property has_overlapping_supervisions: bool

property has_recording: bool

property has_video: bool

index_supervisions(index_mixed_tracks=False, keep_ids=None)

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters:

index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.
keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type:

Dict[str, IntervalTree]

Returns:

a mapping from Cut ID to an interval tree of SupervisionSegments.

load_custom(name)

Load custom data as numpy array. The custom data is expected to have been stored in cuts custom field as an Array or TemporalArray manifest.

Note

It works with Array manifests stored via attribute assignments, e.g.: cut.my_custom_data = Array(...).

Parameters:: name (str) – name of the custom attribute.
Return type:: ndarray
Returns:: a numpy array with the data.

map_supervisions(transform_fn)

Return a copy of the cut that has its supervisions transformed by transform_fn.

Parameters:: transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.
Return type:: DataCut
Returns:: a modified MonoCut.

mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None)

Refer to :function:`~lhotse.cut.mix` documentation.

Return type:: Cut

move_to_memory(audio_format='flac', load_audio=True, load_features=True, load_custom=True)

Load data (audio, features, or custom arrays) into memory and attach them to a copy of the manifest. This is useful when you want to store cuts together with the actual data in some binary format that enables sequential data reads.

Audio is encoded with audio_format (compatible with torchaudio.save), floating point features are encoded with lilcom, and other arrays are pickled.

Return type:: Cut

normalize_loudness(target, affix_id=False, **kwargs)

Return a new DataCut that will lazily apply loudness normalization.

Parameters:

target (float) – The target loudness in dBFS.
affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_ln{target}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

property num_features: int | None

property num_frames: int | None

property num_samples: int | None

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters:

duration (Optional[float]) – The cut’s minimal duration after padding.
num_frames (Optional[int]) – The cut’s total number of frames after padding.
num_samples (Optional[int]) – The cut’s total number of samples after padding.
pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.
preserve_id (bool) – When True, preserves the cut ID before padding. Otherwise, a new random ID is generated for the padded cut (default).
pad_value_dict (Optional[Dict[str, Union[int, float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.

Return type:

Returns:

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

perturb_speed(factor, affix_id=True)

Return a new DataCut that will lazily perturb the speed while loading audio. The num_samples, start and duration fields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlying Recording and the supervisions.

Parameters:

factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_sp{factor}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

perturb_tempo(factor, affix_id=True)

Return a new DataCut that will lazily perturb the tempo while loading audio.

Compared to speed perturbation, tempo preserves pitch. The num_samples, start and duration fields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlying Recording and the supervisions.

Parameters:

factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_tp{factor}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

perturb_volume(factor, affix_id=True)

Return a new DataCut that will lazily perturb the volume while loading audio.

Parameters:

factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).
affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_vp{factor}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

play_audio(): Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_alignment(alignment_type='word'): Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

plot_audio(): Display a plot of the waveform. Requires matplotlib to be installed.

plot_features(): Display the feature matrix as an image. Requires matplotlib to be installed.

recording: Optional[Recording] = None

property recording_id: str

resample(sampling_rate, affix_id=False)

Return a new DataCut that will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.

Parameters:

sampling_rate (int) – The new sampling rate.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

property sampling_rate: int

save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)

Store this cut’s waveform as audio recording to disk.

Parameters:

storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.
format (Optional[str]) – Audio format argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
encoding (Optional[str]) – Audio encoding argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
kwargs – additional arguments passed to Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.

Return type:

Returns:

a new Cut instance.

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

split(timestamp)

Return type:: Tuple[Cut, Cut]

Split a cut into two cuts at timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:

left cut [0s - 4s]

right cut [4s - 10s]

supervisions_audio_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

supervisions_feature_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

to_dict()

Return type:: dict

trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)

Splits the current Cut into its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.

For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Hint

If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the Cut.merge_supervisions() method first to merge the supervisions into a single one, followed by the Cut.trim_to_alignments() method. For example:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)

Hint

The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)

Parameters:

type (str) – The type of the alignment to trim to (e.g. “word”).
max_pause (Optional[float]) – The maximum pause allowed between the alignments to merge them. If None, no merging will be performed. [default: None]
delimiter (str) – The delimiter to use when joining the alignment items.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
num_jobs – Number of parallel workers to process the cuts.

Return type:

CutSet

Returns:

a CutSet object.

trim_to_supervision_groups(max_pause=0.0)

Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482

For example, the following cut:

Cut

╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝

is transformed into two cuts:

Cut 1                                       Cut 2

╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝

For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.

Parameters:: max_pause (float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.
Return type:: CutSet
Returns:: a CutSet.

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|

For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Parameters:

keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2. In this mode, we guarantee that there will always be exactly one supervision per cut.
min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
context_direction (Literal['center', 'left', 'right', 'random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.

Return type:

CutSet

Returns:

a list of cuts.

property trimmed_supervisions: List[SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)

Returns a new MonoCut that is a sub-region of the current DataCut.

Note that no operation is done on the actual features or recording - it’s only during the call to DataCut.load_features() / DataCut.load_audio() when the actual changes happen (a subset of features/audio is loaded).

Hint

To extend a cut by a fixed duration, use the DataCut.extend_by() method.

Parameters:

offset (float) – float (seconds), controls the start of the new cut relative to the current DataCut’s start. E.g., if the current DataCut starts at 10.0, and offset is 2.0, the new start is 12.0.
duration (Optional[float]) – optional float (seconds), controls the duration of the resulting DataCut. By default, the duration is (end of the cut before truncation) - (offset).
keep_excessive_supervisions (bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.
preserve_id (bool) – bool. Should the truncated cut keep the same ID or get a new, random one.
_supervisions_index (Optional[Dict[str, IntervalTree]]) – an IntervalTree; when passed, allows to speed up processing of Cuts with a very large number of supervisions. Intended as an internal parameter.

Return type:

DataCut

Returns:

a new MonoCut instance. If the current DataCut is shorter than the duration, return None.

property video: VideoInfo | None

with_custom(name, value): Return a copy of this object with an extra custom field assigned to it.

with_features_path_prefix(path)

Return type:: DataCut

with_id(id_)

Return a copy of the Cut with a new ID.

Return type:: Cut

with_recording_path_prefix(path)

Return type:: DataCut

id: str

start: Seconds

duration: Seconds

supervisions: List[SupervisionSegment]

class lhotse.cut.MultiCut(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)[source]

MultiCut is a Cut that is analogous to the MonoCut. While MonoCut represents a single channel of a recording, MultiCut represents multi-channel recordings where supervisions may or may not be shared across channels. It is intended to be used to store, for example, segments of a microphone array recording. The following diagrams illustrate some examples for MultiCut usage:

>>> 2-channel telephone recording with 2 supervisions, one for each channel (e.g., Switchboard):

╔══════════════════════════════ MultiCut ═════════════════╗ ║ ┌──────────────────────────┐ ║

Channel 1 ──╬─│ Hello this is John. │──────────────────────────────╬────────
║ └──────────────────────────┘ ║ ║ ┌──────────────────────────┐║

Channel 2 ──╬───────────────────────────────│ Hey, John. How are you? │╠────────
║ └──────────────────────────┘║ ╚═══════════════════════════════════════════════════════════╝

>>> Multi-array multi-microphone recording with shared supervisions (e.g., CHiME-6),
along with close-talk microphones (A and B are distant arrays, C is close-talk):

╔═══════════════════════════════════════════════════════════════════════════╗ ║ ┌───────────────────┐ ┌───────────────────┐ ║

A-1 ──╬─┤ ├─────────────────────────┤ ├───────╬─
║ │ What did you do? │ │I cleaned my room. │ ║

A-2 ──╬─┤ ├─────────────────────────┤ ├───────╬─
║ └───────────────────┘ ┌───────────────────┐ └───────────────────┘ ║

B-1 ──╬────────────────────────┤Yeah, we were going├──────────────────────────────╬─
║ │ to the mall. │ ║

B-2 ──╬────────────────────────┤ ├──────────────────────────────╬─

║ └───────────────────┘ ┌───────────────────┐ ║

C ──╬─────────────────────────────────────────────────────┤ Right. ├─╬─
║ └───────────────────┘ ║ ╚════════════════════════════════ MultiCut ═══════════════════════════════╝

By definition, a MultiCut has the same attributes as a MonoCut. The key difference is that the Recording object has multiple channels, and the Supervision objects may correspond to any of these channels. The channels that the MultiCut can be a subset of the Recording channels, but must be a superset of the Supervision channels.

See also:

lhotse.cut.Cut

lhotse.cut.MonoCut

lhotse.cut.CutSet

lhotse.cut.MixedCut

channel: List[int]

property num_channels: int

load_features(channel=None)[source]

Load the features from the underlying storage and cut them to the relevant [begin, duration] region of the current MultiCut.

Parameters:: channel (Union[List[int], int, None]) – The channel to load the features for. If None, all channels will be loaded. This is useful for the case when we have features extracted for each channel of the multi-cut, and we want to selectively load them.
Return type:: Optional[ndarray]

load_audio(channel=None)[source]

Load the audio by locating the appropriate recording in the supplied Recording. The audio is trimmed to the [begin, end] range specified by the MultiCut.

Parameters:: channel (Union[List[int], int, None]) – optional int or list of int, the subset of channels to load (all by default).
Return type:: Optional[ndarray]
Returns:: a numpy ndarray with audio samples, with shape (C <channel>, N <samples>)

load_video(channel=None, with_audio=True)[source]

Load the subset of video (and audio) from attached recording. The data is trimmed to the [begin, end] range specified by the MonoCut.

Parameters:

channel (Union[List[int], int, None]) – optional int or list of int, the subset of channels to load (all by default).
with_audio (bool) – bool, whether to load and return audio alongside video. True by default.

Return type:

Optional[Tuple[Tensor, Optional[Tensor]]]

Returns:

a tuple of video tensor and optionally audio tensor (or None), or None if this cut has no video.

reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0], room_rng_seed=None, source_rng_seed=None)[source]

Return a new MultiCut that will convolve the audio with the provided impulse response. If the rir_recording is multi-channel, the rir_channels argument determines which channels will be used. This list must be of the same length as the number of channels in the MultiCut.

If no rir_recording is provided, we will generate an impulse response using a fast random generator (https://arxiv.org/abs/2208.04101), only if the MultiCut has exactly one channel. At the moment we do not support simulation of multi-channel impulse responses.

Parameters:

rir_recording (Optional[Recording]) – The impulse response to use for convolving.
normalize_output (bool) – When true, output will be normalized to have energy as input.
early_only (bool) – When true, only the early reflections (first 50 ms) will be used.
affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_rvb”.
rir_channels (List[int]) – The channels of the impulse response to use. First channel is used by default. If multiple channels are specified, this will produce a MixedCut instead of a MonoCut.
room_rng_seed (Optional[int]) – The seed for the room configuration.
source_rng_seed (Optional[int]) – The seed for the source positions.

Return type:

MultiCut

Returns:

a modified copy of the current MonoCut.

merge_supervisions(merge_policy='delimiter', merge_channels=True, custom_merge_fn=None)[source]

Return a copy of the cut that has all of its supervisions merged into a single segment. The channel attribute of all the segments in this case will be set to the union of all channels. If merge_channels is set to False, the supervisions will be merged into a single segment per channel group. The channel attribute will not change in this case.

The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields of all segments are concatenated with a whitespace.

Parameters:

merge_policy (str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied to custom fields. Fields with a None value are omitted.
merge_channels (bool) – If true, we will merge all supervisions into a single segment. If false, we will merge supervisions per channel group. Default: True.
custom_merge_fn (Optional[Callable[[str, Iterable[Any]], Any]]) – a function that will be called to merge custom fields values. We expect custom_merge_fn to handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like: custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])

Return type:

MultiCut

with_channels(channels)[source]

Select specified channels from this cut. Supports extending to other channels available in the underlying Recording. :rtype: DataCut

If a single channel is provided, we’ll return a MonoCut, otherwise we’ll return a :class:`~lhotse.cut.MultiCut’.

static from_mono(*cuts)[source]

Convert one or more MonoCut to a MultiCut. If multiple mono cuts are provided, they must match in all fields except the channel. Each cut must have a distinct channel.

Parameters:: cuts (DataCut) – the input cut(s).
Return type:: MultiCut
Returns:: a MultiCut with a single track.

to_mono(mono_downmix=False)[source]

Convert a MultiCut to either a list of MonoCuts (one per channel) or a single MonoCut obtained by downmixing all channels.

Parameters:: mono_downmix (bool) – If true, we will downmix all channels into a single MonoCut. If false, we will return a list of MonoCuts, one per channel.
Return type:: Union[DataCut, List[DataCut]]
Returns:: a list of MonoCuts or a single MonoCut.

static from_dict(data)[source]

Return type:: MultiCut

__init__(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)

append(other, snr=None, preserve_id=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters:: preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.
Return type:: Cut

attach_tensor(name, data, frame_shift=None, temporal_dim=None, compressed=False)

Attach a tensor to this MonoCut, described with an Array manifest. The attached data is stored in-memory for later use, and can be accessed by calling cut.load_<name>() or cut.load_custom().

This is useful if you want actions such as truncate/pad to propagate to the tensor, e.g.:

>>> cut = MonoCut(id="c1", start=2, duration=8, ...)
>>> cut = cut.attach_tensor(
...     "alignment",
...     torch.tensor([0, 0, 0, ...]),
...     frame_shift=0.1,
...     temporal_dim=0,
... )
>>> half_alignment = cut.truncate(duration=4.0).load_alignment()

Note

This object can’t be stored in JSON/JSONL manifests anymore.

Parameters:

name (str) – attribute under which the data can be found.
data (Union[ndarray, Tensor]) – PyTorch tensor or numpy array.
frame_shift (Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).
temporal_dim (Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.
compressed (bool) – When True, we will apply lilcom compression to the array. Only applicable to arrays of floats.

Return type:

Returns:

compute_and_store_features(extractor, storage, augment_fn=None, *args, **kwargs)

Compute the features from this cut, store them on disk, and attach a feature manifest to this cut. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
storage (FeaturesWriter) – a FeaturesWriter instance used to write the features to a storage.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation.

Return type:

DataCut

Returns:

a new MonoCut instance with a Features manifest attached to it.

compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type:

ndarray

Returns:

a numpy ndarray with the computed features.

custom: Optional[Dict[str, Any]] = None

cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)

Return a list of shorter cuts, made by traversing this cut in windows of duration seconds by hop seconds.

The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.

Parameters:

duration (float) – Desired duration of the new cuts in seconds.
hop (Optional[float]) – Shift between the windows in the new cuts in seconds.
keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

Return type:

CutSet

Returns:

a list of cuts made from shorter duration windows.

dereverb_wpe(affix_id=True)

Return a new DataCut that will lazily apply WPE dereverberation.

Parameters:: affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_wpe”.
Return type:: DataCut
Returns:: a modified copy of the current DataCut.

drop_alignments()

Return a copy of the current DataCut, detached from alignments.

Return type:: DataCut

drop_custom(name)

drop_features()

Return a copy of the current DataCut, detached from features.

Return type:: DataCut

drop_recording()

Return a copy of the current DataCut, detached from recording.

Return type:: DataCut

drop_supervisions()

Return a copy of the current DataCut, detached from supervisions.

Return type:: DataCut

property end: float

extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)

Returns a new Cut (DataCut or MixedCut) that is an extended region of the current DataCut by extending the cut by a fixed duration in the specified direction.

Note that no operation is done on the actual features or recording - it’s only during the call to DataCut.load_features() / DataCut.load_audio() when the actual changes happen (an extended version of features/audio is loaded).

Hint

This method extends a cut by a given duration, either to the left or to the right (or both), using the “real” content of the recording that the cut is part of. For example, a DataCut spanning the region from 2s to 5s in a recording, when extended by 2s to the right, will now span the region from 2s to 7s in the same recording (provided the recording length exceeds 7s). If the recording is shorter, additional silence will be padded to achieve the desired duration by default. This behavior can be changed by setting pad_silence=False. Also see DataCut.pad() which pads a cut “to” a specified length. To “truncate” a cut, use DataCut.truncate().

Hint

If pad_silence is set to False, then the cut will be extended only as much as allowed within the recording’s boundary.

Hint

If direction is “both”, the resulting cut will be extended by the specified duration in both directions. This is different from the usage in MonoCut.pad() where a padding equal to 0.5*duration is added to both sides.

Parameters:

duration (float) – float (seconds), specifies the duration by which the cut should be extended.
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the duration specified in duration.
preserve_id (bool) – bool. Should the extended cut keep the same ID or get a new, random one.
pad_silence (bool) – bool. Should the cut be padded with silence if the recording is shorter than the desired duration. If False, the cut will be extended only as much as allowed within the recording’s boundary.

Return type:

Returns:

a new MonoCut instance.

features: Optional[Features] = None

property features_type: str | None

fill_supervision(add_empty=True, shrink_ok=False)

Fills the whole duration of a cut with a supervision segment.

If the cut has one supervision, its start is set to 0 and duration is set to cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.

If there are no supervisions, we will add an empty one when add_empty==True, otherwise we won’t change anything.

If there are two or more supervisions, we will raise an exception.

Parameters:

add_empty (bool) – should we add an empty supervision with identical time bounds as the cut.
shrink_ok (bool) – should we raise an error if a supervision would be shrank as a result of calling this method.

Return type:

DataCut

filter_supervisions(predicate)

Return a copy of the cut that only has supervisions accepted by predicate.

Example:

>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids)
>>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0)
>>> cut = cut.filter_supervisions(lambda s: s.text is not None)

Parameters:: predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool
Return type:: DataCut
Returns:: a modified MonoCut

property frame_shift: float | None

has(field)

Return type:: bool

has_custom(name)

Check if the Cut has a custom attribute with name name.

Parameters:: name (str) – name of the custom attribute.
Return type:: bool
Returns:: a boolean.

property has_features: bool

property has_overlapping_supervisions: bool

property has_recording: bool

property has_video: bool

index_supervisions(index_mixed_tracks=False, keep_ids=None)

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters:

index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.
keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type:

Dict[str, IntervalTree]

Returns:

a mapping from Cut ID to an interval tree of SupervisionSegments.

load_custom(name)

Load custom data as numpy array. The custom data is expected to have been stored in cuts custom field as an Array or TemporalArray manifest.

Note

It works with Array manifests stored via attribute assignments, e.g.: cut.my_custom_data = Array(...).

Parameters:: name (str) – name of the custom attribute.
Return type:: ndarray
Returns:: a numpy array with the data.

map_supervisions(transform_fn)

Return a copy of the cut that has its supervisions transformed by transform_fn.

Parameters:: transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.
Return type:: DataCut
Returns:: a modified MonoCut.

mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None)

Refer to :function:`~lhotse.cut.mix` documentation.

Return type:: Cut

move_to_memory(audio_format='flac', load_audio=True, load_features=True, load_custom=True)

Load data (audio, features, or custom arrays) into memory and attach them to a copy of the manifest. This is useful when you want to store cuts together with the actual data in some binary format that enables sequential data reads.

Audio is encoded with audio_format (compatible with torchaudio.save), floating point features are encoded with lilcom, and other arrays are pickled.

Return type:: Cut

normalize_loudness(target, affix_id=False, **kwargs)

Return a new DataCut that will lazily apply loudness normalization.

Parameters:

target (float) – The target loudness in dBFS.
affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_ln{target}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

property num_features: int | None

property num_frames: int | None

property num_samples: int | None

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters:

duration (Optional[float]) – The cut’s minimal duration after padding.
num_frames (Optional[int]) – The cut’s total number of frames after padding.
num_samples (Optional[int]) – The cut’s total number of samples after padding.
pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.
preserve_id (bool) – When True, preserves the cut ID before padding. Otherwise, a new random ID is generated for the padded cut (default).
pad_value_dict (Optional[Dict[str, Union[int, float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.

Return type:

Returns:

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

perturb_speed(factor, affix_id=True)

Return a new DataCut that will lazily perturb the speed while loading audio. The num_samples, start and duration fields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlying Recording and the supervisions.

Parameters:

factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_sp{factor}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

perturb_tempo(factor, affix_id=True)

Return a new DataCut that will lazily perturb the tempo while loading audio.

Compared to speed perturbation, tempo preserves pitch. The num_samples, start and duration fields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlying Recording and the supervisions.

Parameters:

factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_tp{factor}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

perturb_volume(factor, affix_id=True)

Return a new DataCut that will lazily perturb the volume while loading audio.

Parameters:

factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).
affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_vp{factor}”.

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

play_audio(): Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_alignment(alignment_type='word'): Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

plot_audio(): Display a plot of the waveform. Requires matplotlib to be installed.

plot_features(): Display the feature matrix as an image. Requires matplotlib to be installed.

recording: Optional[Recording] = None

property recording_id: str

resample(sampling_rate, affix_id=False)

Return a new DataCut that will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.

Parameters:

sampling_rate (int) – The new sampling rate.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

DataCut

Returns:

a modified copy of the current DataCut.

property sampling_rate: int

save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)

Store this cut’s waveform as audio recording to disk.

Parameters:

storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.
format (Optional[str]) – Audio format argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
encoding (Optional[str]) – Audio encoding argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
kwargs – additional arguments passed to Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.

Return type:

Returns:

a new Cut instance.

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

split(timestamp)

Return type:: Tuple[Cut, Cut]

Split a cut into two cuts at timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:

left cut [0s - 4s]

right cut [4s - 10s]

supervisions_audio_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

supervisions_feature_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

to_dict()

Return type:: dict

trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)

Splits the current Cut into its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.

For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Hint

If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the Cut.merge_supervisions() method first to merge the supervisions into a single one, followed by the Cut.trim_to_alignments() method. For example:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)

Hint

The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)

Parameters:

type (str) – The type of the alignment to trim to (e.g. “word”).
max_pause (Optional[float]) – The maximum pause allowed between the alignments to merge them. If None, no merging will be performed. [default: None]
delimiter (str) – The delimiter to use when joining the alignment items.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
num_jobs – Number of parallel workers to process the cuts.

Return type:

CutSet

Returns:

a CutSet object.

trim_to_supervision_groups(max_pause=0.0)

Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482

For example, the following cut:

Cut

╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝

is transformed into two cuts:

Cut 1                                       Cut 2

╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝

For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.

Parameters:: max_pause (float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.
Return type:: CutSet
Returns:: a CutSet.

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|

For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Parameters:

keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2. In this mode, we guarantee that there will always be exactly one supervision per cut.
min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
context_direction (Literal['center', 'left', 'right', 'random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.

Return type:

CutSet

Returns:

a list of cuts.

property trimmed_supervisions: List[SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)

Returns a new MonoCut that is a sub-region of the current DataCut.

Note that no operation is done on the actual features or recording - it’s only during the call to DataCut.load_features() / DataCut.load_audio() when the actual changes happen (a subset of features/audio is loaded).

Hint

To extend a cut by a fixed duration, use the DataCut.extend_by() method.

Parameters:

offset (float) – float (seconds), controls the start of the new cut relative to the current DataCut’s start. E.g., if the current DataCut starts at 10.0, and offset is 2.0, the new start is 12.0.
duration (Optional[float]) – optional float (seconds), controls the duration of the resulting DataCut. By default, the duration is (end of the cut before truncation) - (offset).
keep_excessive_supervisions (bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.
preserve_id (bool) – bool. Should the truncated cut keep the same ID or get a new, random one.
_supervisions_index (Optional[Dict[str, IntervalTree]]) – an IntervalTree; when passed, allows to speed up processing of Cuts with a very large number of supervisions. Intended as an internal parameter.

Return type:

DataCut

Returns:

a new MonoCut instance. If the current DataCut is shorter than the duration, return None.

property video: VideoInfo | None

with_custom(name, value): Return a copy of this object with an extra custom field assigned to it.

with_features_path_prefix(path)

Return type:: DataCut

with_id(id_)

Return a copy of the Cut with a new ID.

Return type:: Cut

with_recording_path_prefix(path)

Return type:: DataCut

id: str

start: Seconds

duration: Seconds

supervisions: List[SupervisionSegment]

class lhotse.cut.PaddingCut(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None, video=None, custom=None)[source]

PaddingCut is a dummy Cut that doesn’t refer to actual recordings or features –it simply returns zero samples in the time domain and a specified features value in the feature domain. Its main role is to be appended to other cuts to make them evenly sized.

Please refer to the documentation of Cut to learn more about using cuts.

See also:

lhotse.cut.Cut

lhotse.cut.MonoCut

lhotse.cut.MixedCut

lhotse.cut.CutSet

id: str

duration: float

sampling_rate: int

feat_value: float

num_frames: Optional[int] = None

num_features: Optional[int] = None

frame_shift: Optional[float] = None

num_samples: Optional[int] = None

video: Optional[VideoInfo] = None

custom: Optional[dict] = None

property start: float

property supervisions

property channel: int

property has_features: bool

property has_recording: bool

property has_video: bool

property num_channels: int

has(field)[source]

Return type:: bool

property recording_id: str

load_features(*args, **kwargs)[source]

Return type:: Optional[ndarray]

load_audio(*args, **kwargs)[source]

Return type:: Optional[ndarray]

load_video(with_audio=True)[source]

Return type:: Optional[Tuple[Tensor, Optional[Tensor]]]

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, **kwargs)[source]

Return type:: PaddingCut

extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)[source]

Return a new PaddingCut with region extended by the specified duration.

Parameters:

duration (float) – The duration by which to extend the cut.
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the cut should be extended to the left, right or both sides. By default, the cut is extended by the specified duration on both sides.
preserve_id (bool) – When True, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).
pad_silence (bool) – See usage in lhotse.cut.MonoCut.extend_by(). It is ignored here.

Return type:

Returns:

an extended PaddingCut.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)[source]

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters:

duration (Optional[float]) – The cut’s minimal duration after padding.
num_frames (Optional[int]) – The cut’s total number of frames after padding.
num_samples (Optional[int]) – The cut’s total number of samples after padding.
pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).
direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.
preserve_id (bool) – When True, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).
pad_value_dict (Optional[Dict[str, Union[int, float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.

Return type:

Returns:

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

resample(sampling_rate, affix_id=False)[source]

Return a new MonoCut that will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.

Parameters:

sampling_rate (int) – The new sampling rate.
affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type:

Returns:

a modified copy of the current MonoCut.

perturb_speed(factor, affix_id=True)[source]

Return a new PaddingCut that will “mimic” the effect of speed perturbation on duration and num_samples.

Parameters:

factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).
affix_id (bool) – When true, we will modify the PaddingCut.id field by affixing it with “_sp{factor}”.

Return type:

Returns:

a modified copy of the current PaddingCut.

perturb_tempo(factor, affix_id=True)[source]

Return a new PaddingCut that will “mimic” the effect of tempo perturbation on duration and num_samples.

Compared to speed perturbation, tempo preserves pitch. :type factor: float :param factor: The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster). :type affix_id: bool :param affix_id: When true, we will modify the PaddingCut.id field

by affixing it with “_tp{factor}”.

Return type:: PaddingCut
Returns:: a modified copy of the current PaddingCut.

perturb_volume(factor, affix_id=True)[source]

Return a new PaddingCut that will “mimic” the effect of volume perturbation on amplitude of samples.

Parameters:

factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).
affix_id (bool) – When true, we will modify the PaddingCut.id field by affixing it with “_vp{factor}”.

Return type:

Returns:

a modified copy of the current PaddingCut.

reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0], room_rng_seed=None, source_rng_seed=None)[source]

Return a new PaddingCut that will “mimic” the effect of reverberation with impulse response on original samples.

Parameters:

rir_recording (Optional[Recording]) – The impulse response to use for convolving.
normalize_output (bool) – When true, output will be normalized to have energy as input.
early_only (bool) – When true, only the early reflections (first 50 ms) will be used.
affix_id (bool) – When true, we will modify the PaddingCut.id field by affixing it with “_rvb”.
rir_channels (List[int]) – The channels of the impulse response to use.

Return type:

Returns:

a modified copy of the current PaddingCut.

normalize_loudness(target, affix_id=False, **kwargs)[source]

Return a new PaddingCut that will “mimic” the effect of loudness normalization

Parameters:

target (float) – The target loudness in dBFS.
affix_id (bool) – When true, we will modify the DataCut.id field by affixing it with “_ln{target}”.

Return type:

Returns:

a modified copy of the current DataCut.

drop_features()[source]

Return a copy of the current PaddingCut, detached from features.

Return type:: PaddingCut

drop_recording()[source]

Return a copy of the current PaddingCut, detached from recording.

Return type:: PaddingCut

drop_supervisions()[source]

Return a copy of the current PaddingCut, detached from supervisions.

Return type:: PaddingCut

drop_alignments()[source]

Return a copy of the current PaddingCut, detached from alignments.

Return type:: PaddingCut

compute_and_store_features(extractor, *args, **kwargs)[source]

Returns a new PaddingCut with updates information about the feature dimension and number of feature frames, depending on the extractor properties.

Return type:: Cut

fill_supervision(*args, **kwargs)[source]

Just for consistency with :class`.MonoCut` and MixedCut.

Return type:: PaddingCut

move_to_memory(*args, **kwargs)[source]

Just for consistency with :class`.MonoCut` and MixedCut.

Return type:: PaddingCut

map_supervisions(transform_fn)[source]

Just for consistency with MonoCut and MixedCut.

Parameters:: transform_fn (Callable[[Any], Any]) – a dummy function that would be never called actually.
Return type:: PaddingCut
Returns:: the PaddingCut itself.

merge_supervisions(*args, **kwargs)[source]

Just for consistency with MonoCut and MixedCut.

Return type:: PaddingCut
Returns:: the PaddingCut itself.

filter_supervisions(predicate)[source]

Just for consistency with MonoCut and MixedCut.

Parameters:: predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool
Return type:: PaddingCut
Returns:: a modified MonoCut

static from_dict(data)[source]

Return type:: PaddingCut

with_features_path_prefix(path)[source]

Return type:: PaddingCut

with_recording_path_prefix(path)[source]

Return type:: PaddingCut

__init__(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None, video=None, custom=None)

append(other, snr=None, preserve_id=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters:: preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.
Return type:: Cut

compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

Parameters:

extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type:

ndarray

Returns:

a numpy ndarray with the computed features.

cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)

Return a list of shorter cuts, made by traversing this cut in windows of duration seconds by hop seconds.

The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.

Parameters:

duration (float) – Desired duration of the new cuts in seconds.
hop (Optional[float]) – Shift between the windows in the new cuts in seconds.
keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

Return type:

CutSet

Returns:

a list of cuts made from shorter duration windows.

property end: float

property has_overlapping_supervisions: bool

index_supervisions(index_mixed_tracks=False, keep_ids=None)

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters:

index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.
keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type:

Dict[str, IntervalTree]

Returns:

a mapping from Cut ID to an interval tree of SupervisionSegments.

mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None)

Refer to :function:`~lhotse.cut.mix` documentation.

Return type:: Cut

play_audio(): Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_alignment(alignment_type='word'): Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

plot_audio(): Display a plot of the waveform. Requires matplotlib to be installed.

plot_features(): Display the feature matrix as an image. Requires matplotlib to be installed.

save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)

Store this cut’s waveform as audio recording to disk.

Parameters:

storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.
format (Optional[str]) – Audio format argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
encoding (Optional[str]) – Audio encoding argument supported by torchaudio.save or soundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.
augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.
kwargs – additional arguments passed to Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.

Return type:

Returns:

a new Cut instance.

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters:

min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).
speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)
use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type:

ndarray

split(timestamp)

Return type:: Tuple[Cut, Cut]

Split a cut into two cuts at timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:

left cut [0s - 4s]

right cut [4s - 10s]

supervisions_audio_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

supervisions_feature_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters:: use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
Return type:: ndarray

to_dict()

Return type:: dict

trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)

Splits the current Cut into its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.

For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Hint

If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the Cut.merge_supervisions() method first to merge the supervisions into a single one, followed by the Cut.trim_to_alignments() method. For example:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)

Hint

The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:

>>> cut = cut.merge_supervisions(type='word', delimiter=' ')
>>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)

Parameters:

type (str) – The type of the alignment to trim to (e.g. “word”).
max_pause (Optional[float]) – The maximum pause allowed between the alignments to merge them. If None, no merging will be performed. [default: None]
delimiter (str) – The delimiter to use when joining the alignment items.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
num_jobs – Number of parallel workers to process the cuts.

Return type:

CutSet

Returns:

a CutSet object.

trim_to_supervision_groups(max_pause=0.0)

Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482

For example, the following cut:

Cut

╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝

is transformed into two cuts:

Cut 1                                       Cut 2

╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝

For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.

Parameters:: max_pause (float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.
Return type:: CutSet
Returns:: a CutSet.

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|

For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).

Hint

If the resulting trimmed cut contains a single supervision, we set the cut id to the id of this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.

Hint

If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.

Parameters:

keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2. In this mode, we guarantee that there will always be exactly one supervision per cut.
min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
context_direction (Literal['center', 'left', 'right', 'random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
keep_all_channels (bool) – If True, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.

Return type:

CutSet

Returns:

a list of cuts.

property trimmed_supervisions: List[SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

with_id(id_)

Return a copy of the Cut with a new ID.

Return type:: Cut

features_type: Optional[str]

lhotse.cut.create_cut_set_eager(recordings=None, supervisions=None, features=None, output_path=None, random_ids=False, tolerance=0.001)[source]

Create a CutSet from any combination of supervision, feature and recording manifests. At least one of recordings or features is required.

The created cuts will be of type DataCut (MonoCut for single-channel and MultiCut for multi-channel). The DataCut boundaries correspond to those found in the features, when available, otherwise to those found in the recordings.

When supervisions are provided, we’ll be searching them for matching recording IDs and attaching to created cuts, assuming they are fully within the cut’s time span.

Parameters:

recordings (Optional[RecordingSet]) – an optional RecordingSet manifest.
supervisions (Optional[SupervisionSet]) – an optional SupervisionSet manifest.
features (Optional[FeatureSet]) – an optional FeatureSet manifest.
output_path (Union[Path, str, None]) – an optional path where the CutSet is stored.
random_ids (bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)
tolerance (float) – float, tolerance for supervision and feature segment boundary comparison. By default, it’s 1ms. Increasing this value can be helpful when importing Kaldi data directories with precomputed features.

Return type:

Returns:

a new CutSet instance.

lhotse.cut.create_cut_set_lazy(output_path, recordings=None, supervisions=None, features=None, random_ids=False, tolerance=0.001)[source]

Create a CutSet from any combination of supervision, feature and recording manifests. At least one of recordings or features is required.

This method is the “lazy” variant, which allows to create a CutSet with a minimal memory usage. It has some extra requirements:

The user must provide an output_path, where we will write the cuts as
we create them. We’ll return a lazily-opened CutSet from that file.

recordings and features (if both provided) have to be of equal length
and sorted by recording_id attribute of their elements.

supervisions (if provided) have to be sorted by recording_id;
note that there may be multiple supervisions with the same recording_id, which is allowed.

In addition, to prepare cuts in a fully memory-efficient way, make sure that:

All input manifests are stored in JSONL format and opened lazily
with <manifest_class>.from_jsonl_lazy(path) method.

For more details, see create_cut_set_eager().

Parameters:

output_path (Union[Path, str]) – path to which we will write the cuts.
recordings (Optional[RecordingSet]) – an optional RecordingSet manifest.
supervisions (Optional[SupervisionSet]) – an optional SupervisionSet manifest.
features (Optional[FeatureSet]) – an optional FeatureSet manifest.
random_ids (bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)
tolerance (float) – float, tolerance for supervision and feature segment boundary comparison. By default, it’s 1ms. Increasing this value can be helpful when importing Kaldi data directories with precomputed features.

Return type: