API Reference

This page contains a comprehensive list of all classes and functions within lhotse.

Recording manifests

Data structures used for describing audio recordings in a dataset.

class lhotse.audio.AudioSource(type: str, channels: List[int], source: str)

AudioSource represents audio data that can be retrieved from somewhere. Supported sources of audio are currently: - ‘file’ (formats supported by soundfile, possibly multi-channel) - ‘command’ [unix pipe] (must be WAVE, possibly multi-channel) - ‘url’ (any URL type that is supported by “smart_open” library, e.g. http/https/s3/gcp/azure/etc.)

type: str
channels: List[int]
source: str
load_audio(offset=0.0, duration=None)

Load the AudioSource (from files, commands, or URLs) with soundfile, accounting for many audio formats and multi-channel inputs. Returns numpy array with shapes: (n_samples,) for single-channel, (n_channels, n_samples) for multi-channel.

Note: The elements in the returned array are in the range [-1.0, 1.0] and are of dtype np.floatt32.

Return type


Return type


static from_dict(data)
Return type


__init__(type, channels, source)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.audio.Recording(id: str, sources: List[lhotse.audio.AudioSource], sampling_rate: int, num_samples: int, duration: float, transforms: Optional[List[Dict]] = None)

Recording represents an AudioSource along with some metadata.

id: str
sources: List[AudioSource]
sampling_rate: int
num_samples: int
duration: Seconds
transforms: Optional[List[Dict]] = None
static from_file(path, recording_id=None, relative_path_depth=None)

Read an audio file’s header and create the corresponding Recording. Suitable to use when each physical file represents a separate recording session.

If a recording session consists of multiple files (e.g. one per channel), it is advisable to create the Recording object manually, with each file represented as a separate AudioSource object.

  • path (Union[Path, str]) – Path to an audio file supported by libsoundfile (pysoundfile).

  • recording_id (Optional[str]) – recording id, when not specified ream the filename’s stem (“x.wav” -> “x”).

  • relative_path_depth (Optional[int]) – optional int specifying how many last parts of the file path should be retained in the AudioSource. By default writes the path as is.

Return type



a new Recording instance pointing to the audio file.

property num_channels
property channel_ids
load_audio(channels=None, offset=0.0, duration=None)
Return type


Return type


perturb_speed(factor, affix_id=True)

Return a new Recording that will lazily perturb the speed while loading audio. The num_samples and duration fields are updated to reflect the shrinking/extending effect of speed.

  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_sp{factor}”.

Return type



a modified copy of the current Recording.


Return a new Recording that will be lazily resampled while loading audio. :type sampling_rate: int :param sampling_rate: The new sampling rate. :rtype: Recording :return: A resampled Recording.

static from_dict(data)
Return type


__init__(id, sources, sampling_rate, num_samples, duration, transforms=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.audio.RecordingSet(*args, **kwds)

RecordingSet represents a dataset of recordings. It does not contain any annotation - just the information needed to retrieve a recording (possibly multi-channel, from files or from shell commands and pipes) and some metadata for each of them.

It also supports (de)serialization to/from YAML and takes care of mapping between rich Python classes and YAML primitives during conversion.

recordings: Dict[str, Recording]
static from_recordings(recordings)
Return type


static from_dir(path, pattern, num_jobs=1)
static from_dicts(data)
Return type


Return type



Return a new RecordingSet with the Recordings that satisfy the predicate.


predicate (Callable[[Recording], bool]) – a function that takes a recording as an argument and returns bool.

Return type



a filtered RecordingSet.

split(num_splits, shuffle=False)

Split the RecordingSet into num_splits pieces of equal size.

  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the recordings order first.

Return type



A list of RecordingSet pieces.

subset(first=None, last=None)

Return a new RecordingSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

  • first (Optional[int]) – int, the number of first recordings to keep.

  • last (Optional[int]) – int, the number of last recordings to keep.

Return type



a new RecordingSet with the subset results.

load_audio(recording_id, channels=None, offset_seconds=0.0, duration_seconds=None)
Return type


Return type


Return type


Return type


Return type


Return type


perturb_speed(factor, affix_id=True)

Return a new RecordingSet that will lazily perturb the speed while loading audio. The num_samples and duration fields are updated to reflect the shrinking/extending effect of speed.

  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_sp{factor}”.

Return type



a RecordingSet containing the perturbed Recording objects.


Apply resampling to all recordings in the RecordingSet and return a new RecordingSet. :type sampling_rate: int :param sampling_rate: The new sampling rate. :rtype: RecordingSet :return: a new RecordingSet with lazily resampled Recording objects.


Initialize self. See help(type(self)) for accurate signature.

class lhotse.audio.AudioMixer(base_audio, sampling_rate)

Utility class to mix multiple waveforms into a single one. It should be instantiated separately for each mixing session (i.e. each MixedCut will create a separate AudioMixer to mix its tracks). It is initialized with a numpy array of audio samples (typically float32 in [-1, 1] range) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using the add_to_mix method. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize the AudioMixer.

__init__(base_audio, sampling_rate)
  • base_audio (ndarray) – A numpy array with the audio samples for the base signal (all the other signals will be mixed to it).

  • sampling_rate (int) – Sampling rate of the audio.

property unmixed_audio

Return a numpy ndarray with the shape (num_tracks, num_samples), where each track is zero padded and scaled adequately to the offsets and SNR used in add_to_mix call.

Return type


property mixed_audio

Return a numpy ndarray with the shape (1, num_samples) - a mono mix of the tracks supplied with add_to_mix calls.

Return type


add_to_mix(audio, snr=None, offset=0.0)

Add audio (only support mono-channel) of a new track into the mix. :type audio: ndarray :param audio: An array of audio samples to be mixed in. :type snr: Optional[float] :param snr: Signal-to-noise ratio, assuming audio represents noise (positive SNR - lower audio energy, negative SNR - higher audio energy) :type offset: float :param offset: How many seconds to shift audio in time. For mixing, the signal will be padded before the start with low energy values. :return:

Return type


lhotse.audio.read_audio(path_or_fd, offset=0.0, duration=None)
Return type

Tuple[ndarray, int]

Supervision manifests

Data structures used for describing supervisions in a dataset.

class lhotse.supervision.SupervisionSegment(id: str, recording_id: str, start: float, duration: float, channel: int = 0, text: Union[str, NoneType] = None, language: Union[str, NoneType] = None, speaker: Union[str, NoneType] = None, gender: Union[str, NoneType] = None, custom: Union[Dict[str, Any], NoneType] = None)
id: str
recording_id: str
start: Seconds
duration: Seconds
channel: int = 0
text: Optional[str] = None
language: Optional[str] = None
speaker: Optional[str] = None
gender: Optional[str] = None
custom: Optional[Dict[str, Any]] = None
property end
Return type



Return an identical SupervisionSegment, but with the offset added to the start field.

Return type


perturb_speed(factor, sampling_rate, affix_id=True)

Return a SupervisionSegment that has time boundaries matching the recording/cut perturbed with the same factor.

  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • sampling_rate (int) – The sampling rate is necessary to accurately perturb the start and duration (going through the sample counts).

  • affix_id (bool) – When true, we will modify the id and recording_id fields by affixing it with “_sp{factor}”.

Return type



a modified copy of the current Recording.


Return an identical SupervisionSegment, but ensure that self.start is not negative (in which case it’s set to 0) and self.end does not exceed the end parameter.

This method is useful for ensuring that the supervision does not exceed a cut’s bounds, in which case pass cut.duration as the end argument, since supervision times are relative to the cut.

Return type



Return a copy of the current segment, transformed with transform_fn.


transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that takes a segment as input, transforms it and returns a new segment.

Return type



a modified SupervisionSegment.


Return a copy of the current segment with transformed text field. Useful for text normalization, phonetic transcription, etc.


transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type



a SupervisionSegment with adjusted text.

static from_dict(data)
Return type


__init__(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.supervision.SupervisionSet(*args, **kwds)

SupervisionSet represents a collection of segments containing some supervision information. The only required fields are the ID of the segment, ID of the corresponding recording, and the start and duration of the segment in seconds. All other fields, such as text, language or speaker, are deliberately optional to support a wide range of tasks, as well as adding more supervision types in the future, while retaining backwards compatibility.

segments: Dict[str, SupervisionSegment]
static from_segments(segments)
Return type


static from_dicts(data)
Return type


Return type


split(num_splits, shuffle=False)

Split the SupervisionSet into num_splits pieces of equal size.

  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the supervisions order first.

Return type



A list of SupervisionSet pieces.

subset(first=None, last=None)

Return a new SupervisionSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

  • first (Optional[int]) – int, the number of first supervisions to keep.

  • last (Optional[int]) – int, the number of last supervisions to keep.

Return type



a new SupervisionSet with the subset results.


Return a new SupervisionSet with the SupervisionSegments that satisfy the predicate.


predicate (Callable[[SupervisionSegment], bool]) – a function that takes a supervision as an argument and returns bool.

Return type



a filtered SupervisionSet.


Map a transform_fn to the SupervisionSegments and return a new SupervisionSet.


transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type



a new SupervisionSet with modified segments.


Return a copy of the current SupervisionSet with the segments having a transformed text field. Useful for text normalization, phonetic transcription, etc.


transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type



a SupervisionSet with adjusted text.

find(recording_id, channel=None, start_after=0, end_before=None, adjust_offset=False, tolerance=0.001)

Return an iterable of segments that match the provided recording_id.

  • recording_id (str) – Desired recording ID.

  • channel (Optional[int]) – When specified, return supervisions in that channel - otherwise, in all channels.

  • start_after (float) – When specified, return segments that start after the given value.

  • end_before (Optional[float]) – When specified, return segments that end before the given value.

  • adjust_offset (bool) – When true, return segments as if the recordings had started at start_after. This is useful for creating Cuts. Fom a user perspective, when dealing with a Cut, it is no longer helpful to know when the supervisions starts in a recording - instead, it’s useful to know when the supervision starts relative to the start of the Cut. In the anticipated use-case, start_after and end_before would be the beginning and end of a cut; this option converts the times to be relative to the start of the cut.

  • tolerance (float) – Additional margin to account for floating point rounding errors when comparing segment boundaries.

Return type



An iterator over supervision segments satisfying all criteria.

__init__(segments, _segments_by_recording_id=None)

Initialize self. See help(type(self)) for accurate signature.

Feature extraction and manifests

Data structures and tools used for feature extraction and description.

Features API - extractor and manifests

class lhotse.features.base.FeatureExtractor(config=None)

The base class for all feature extractors in Lhotse. It is initialized with a config object, specific to a particular feature extraction method. The config is expected to be a dataclass so that it can be easily serialized.

All derived feature extractors must implement at least the following:

  • a name class attribute (how are these features called, e.g. ‘mfcc’)

  • a config_type class attribute that points to the configuration dataclass type

  • the extract method,

  • the frame_shift property.

Feature extractors that support feature-domain mixing should additionally specify two static methods:

  • compute_energy, and

  • mix.

By itself, the FeatureExtractor offers the following high-level methods that are not intended for overriding:

  • extract_from_samples_and_store

  • extract_from_recording_and_store

These methods run a larger feature extraction pipeline that involves data augmentation and disk storage.

name = None
config_type = None

Initialize self. See help(type(self)) for accurate signature.

abstract extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type



a numpy ndarray representing the feature matrix.

abstract property frame_shift
Return type


abstract feature_dim(sampling_rate)
Return type


static mix(features_a, features_b, energy_scaling_factor_b)

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type



A mixed feature matrix.

static compute_energy(features)

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.


features (ndarray) – A feature matrix.

Return type



A positive float value of the signal energy.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

  • optional audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

  • samples (ndarray) – a numpy ndarray with the audio samples.

  • sampling_rate (int) – integer sampling rate of samples.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.

  • channel (Optional[int]) – an optional channel number to insert into Features manifest.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type



a Features manifest item for the extracted feature matrix (it is not written to disk).

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

  • load audio from disk;

  • optionally, perform audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features and the source data used.

  • recording (Recording) – a Recording that specifies what’s the input audio.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an optional offset in seconds for where to start reading the recording.

  • duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.

  • channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type



a Features manifest item for the extracted feature matrix.

classmethod from_dict(data)
Return type


classmethod from_yaml(path)
Return type



Return the feature extractor type corresponding to the given name.


name (str) – specifies which feature extractor should be used.

Return type



A feature extractors type.


Create a feature extractor object with a default configuration.


name (str) – specifies which feature extractor should be used.

Return type



A new feature extractor instance.


This decorator is used to register feature extractor classes in Lhotse so they can be easily created just by knowing their name.

An example of usage:

@register_extractor class MyFeatureExtractor: …


cls – A type (class) that is being registered.


Registered type.

class lhotse.features.base.TorchaudioFeatureExtractor(config=None)

Common abstract base class for all torchaudio based feature extractors.

feature_fn = None
extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type



a numpy ndarray representing the feature matrix.

property frame_shift
Return type


class lhotse.features.base.Features(type: str, num_frames: int, num_features: int, frame_shift: float, sampling_rate: int, start: float, duration: float, storage_type: str, storage_path: str, storage_key: str, recording_id: Optional[str] = None, channels: Optional[Union[int, List[int]]] = None)

Represents features extracted for some particular time range in a given recording and channel. It contains metadata about how it’s stored: storage_type describes “how to read it”, for now it supports numpy arrays serialized with np.save, as well as arrays compressed with lilcom; storage_path is the path to the file on the local filesystem.

type: str
num_frames: int
num_features: int
frame_shift: Seconds
sampling_rate: int
start: Seconds
duration: Seconds
storage_type: str
storage_path: str
storage_key: str
recording_id: Optional[str] = None
channels: Optional[Union[int, List[int]]] = None
property end
Return type


load(start=None, duration=None)
Return type


Return type


static from_dict(data)
Return type


__init__(type, num_frames, num_features, frame_shift, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.features.base.FeatureSet(*args, **kwds)

Represents a feature manifest, and allows to read features for given recordings within particular channels and time ranges. It also keeps information about the feature extractor parameters used to obtain this set. When a given recording/time-range/channel is unavailable, raises a KeyError.

features: List[Features]
static from_features(features)
Return type


static from_dicts(data)
Return type


Return type


Return type


split(num_splits, shuffle=False)

Split the FeatureSet into num_splits pieces of equal size.

  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the features order first.

Return type



A list of FeatureSet pieces.

subset(first=None, last=None)

Return a new FeatureSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

  • first (Optional[int]) – int, the number of first supervisions to keep.

  • last (Optional[int]) – int, the number of last supervisions to keep.

Return type



a new FeatureSet with the subset results.

find(recording_id, channel_id=0, start=0.0, duration=None, leeway=0.05)

Find and return a Features object that best satisfies the search criteria. Raise a KeyError when no such object is available.

  • recording_id (str) – str, requested recording ID.

  • channel_id (int) – int, requested channel.

  • start (float) – float, requested start time in seconds for the feature chunk.

  • duration (Optional[float]) – optional float, requested duration in seconds for the feature chunk. By default, return everything from the start.

  • leeway (float) – float, controls how strictly we have to match the requested start and duration criteria. It is necessary to keep a small positive value here (default 0.05s), as there might be differences between the duration of recording/supervision segment, and the duration of features. The latter one is constrained to be a multiple of frame_shift, while the former can be arbitrary.

Return type



a Features object satisfying the search criteria.

load(recording_id, channel_id=0, start=0.0, duration=None)

Find a Features object that best satisfies the search criteria and load the features as a numpy ndarray. Raise a KeyError when no such object is available.

Return type



Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.


storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.

Return a dict of ``{‘norm_means’``{‘norm_means’

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type

Dict[str, ndarray]

__init__(features=<factory>, _features_by_recording_id=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.features.base.FeatureSetBuilder(feature_extractor, storage, augment_fn=None)

An extended constructor for the FeatureSet. Think of it as a class wrapper for a feature extraction script. It consumes an iterable of Recordings, extracts the features specified by the FeatureExtractor config, and saves stores them on the disk.

Eventually, we plan to extend it with the capability to extract only the features in specified regions of recordings and to perform some time-domain data augmentation.

__init__(feature_extractor, storage, augment_fn=None)

Initialize self. See help(type(self)) for accurate signature.

process_and_store_recordings(recordings, output_manifest=None, num_jobs=1)
Return type


lhotse.features.base.store_feature_array(feats, storage)

Store feats array on disk, using lilcom compression by default.

  • feats (ndarray) – a numpy ndarray containing features.

  • storage (FeaturesWriter) – a FeaturesWriter object to use for array storage.

Return type



a path to the file containing the stored array.

lhotse.features.base.compute_global_stats(feature_manifests, storage_path=None)

Compute the global means and standard deviations for each feature bin in the manifest. It performs only a single pass over the data and iteratively updates the estimate of the means and variances.

We follow the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

  • feature_manifests (Iterable[Features]) – an iterable of Features objects.

  • storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.

Return a dict of ``{‘norm_means’``{‘norm_means’

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type

Dict[str, ndarray]

Torchaudio feature extractors

class lhotse.features.fbank.FbankConfig(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True, low_freq: float = 20.0, high_freq: float = - 400.0, num_mel_bins: int = 40, use_energy: bool = False, vtln_low: float = 100.0, vtln_high: float = - 500.0, vtln_warp: float = 1.0)
dither: float = 0.0
window_type: str = 'povey'
frame_length: float = 0.025
frame_shift: float = 0.01
remove_dc_offset: bool = True
round_to_power_of_two: bool = True
energy_floor: float = 1e-10
min_duration: float = 0.0
preemphasis_coefficient: float = 0.97
raw_energy: bool = True
low_freq: float = 20.0
high_freq: float = -400.0
num_mel_bins: int = 40
use_energy: bool = False
vtln_low: float = 100.0
vtln_high: float = -500.0
vtln_warp: float = 1.0
__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=- 400.0, num_mel_bins=40, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.features.fbank.Fbank(config=None)

Log Mel energy filter bank feature extractor based on torchaudio.compliance.kaldi.fbank function.

name = 'fbank'

alias of FbankConfig

Return type


static mix(features_a, features_b, energy_scaling_factor_b)

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type



A mixed feature matrix.

static compute_energy(features)

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.


features (ndarray) – A feature matrix.

Return type



A positive float value of the signal energy.

class lhotse.features.mfcc.MfccConfig(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True, low_freq: float = 20.0, high_freq: float = 0.0, num_mel_bins: int = 23, use_energy: bool = False, vtln_low: float = 100.0, vtln_high: float = - 500.0, vtln_warp: float = 1.0, cepstral_lifter: float = 22.0, num_ceps: int = 13)
dither: float = 0.0
window_type: str = 'povey'
frame_length: float = 0.025
frame_shift: float = 0.01
remove_dc_offset: bool = True
round_to_power_of_two: bool = True
energy_floor: float = 1e-10
min_duration: float = 0.0
preemphasis_coefficient: float = 0.97
raw_energy: bool = True
low_freq: float = 20.0
high_freq: float = 0.0
num_mel_bins: int = 23
use_energy: bool = False
vtln_low: float = 100.0
vtln_high: float = -500.0
vtln_warp: float = 1.0
cepstral_lifter: float = 22.0
num_ceps: int = 13
__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=0.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.features.mfcc.Mfcc(config=None)

MFCC feature extractor based on torchaudio.compliance.kaldi.mfcc function.

name = 'mfcc'

alias of MfccConfig

Return type


class lhotse.features.spectrogram.SpectrogramConfig(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True)
dither: float = 0.0
window_type: str = 'povey'
frame_length: float = 0.025
frame_shift: float = 0.01
remove_dc_offset: bool = True
round_to_power_of_two: bool = True
energy_floor: float = 1e-10
min_duration: float = 0.0
preemphasis_coefficient: float = 0.97
raw_energy: bool = True
__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.features.spectrogram.Spectrogram(config=None)

Log spectrogram feature extractor based on torchaudio.compliance.kaldi.spectrogram function.

name = 'spectrogram'

alias of SpectrogramConfig

Return type


static mix(features_a, features_b, energy_scaling_factor_b)

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type



A mixed feature matrix.

static compute_energy(features)

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.


features (ndarray) – A feature matrix.

Return type



A positive float value of the signal energy.

Feature storage

class lhotse.features.io.FeaturesWriter

FeaturesWriter defines the interface of how to store numpy arrays in a particular storage backend. This backend could either be:

  • separate files on a local filesystem;

  • a single file with multiple arrays;

  • cloud storage;

  • etc.

Each class inheriting from FeaturesWriter must define:

  • the write() method, which defines the storing operation

    (accepts a key used to place the value array in the storage);

  • the storage_path() property, which is either a common directory for the files,

    the name of the file storing multiple arrays, name of the cloud bucket, etc.

  • the name() property that is unique to this particular storage mechanism -

    it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

Each FeaturesWriter can also be used as a context manager, as some implementations might need to free a resource after the writing is finalized. By default nothing happens in the context manager functions, and this can be modified by the inheriting subclasses.

with MyWriter(‘some/path’) as storage:

extractor.extract_from_recording_and_store(recording, storage)

The features loading must be defined separately in a class inheriting from FeaturesReader.

abstract property name
Return type


abstract property storage_path
Return type


abstract write(key, value)
Return type


class lhotse.features.io.FeaturesReader

FeaturesReader defines the interface of how to load numpy arrays from a particular storage backend. This backend could either be:

  • separate files on a local filesystem;

  • a single file with multiple arrays;

  • cloud storage;

  • etc.

Each class inheriting from FeaturesReader must define:

  • the read() method, which defines the loading operation

    (accepts the key to locate the array in the storage and return it). The read method should support selecting only a subset of the feature matrix, with the bounds expressed as arguments left_offset_frames and right_offset_frames. It’s up to the Reader implementation to load only the required part or trim it to that range only after loading. It is assumed that the time dimension is always the first one.

  • the name() property that is unique to this particular storage mechanism -

    it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

The features writing must be defined separately in a class inheriting from FeaturesWriter.

abstract property name
Return type


abstract read(key, left_offset_frames=0, right_offset_frames=None)
Return type


Return type



Decorator used to add a new FeaturesReader to Lhotse’s registry.


@register_reader class MyFeatureReader(FeatureReader):


Decorator used to add a new FeaturesWriter to Lhotse’s registry.


@register_writer class MyFeatureWriter(FeatureWriter):


Find a FeaturesReader sub-class that corresponds to the provided name and return its type.


reader_type = get_reader(“lilcom_files”) reader = reader_type(“/storage/features/”)

Return type



Find a FeaturesWriter sub-class that corresponds to the provided name and return its type.


writer_type = get_writer(“lilcom_files”) writer = writer_type(“/storage/features/”)

Return type


class lhotse.features.io.LilcomFilesReader(storage_path, *args, **kwargs)

Reads Lilcom-compressed files from a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'lilcom_files'
__init__(storage_path, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

read(key, left_offset_frames=0, right_offset_frames=None)
Return type


class lhotse.features.io.LilcomFilesWriter(storage_path, tick_power=- 5, *args, **kwargs)

Writes Lilcom-compressed files to a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'lilcom_files'
__init__(storage_path, tick_power=- 5, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

property storage_path
Return type


write(key, value)
Return type


class lhotse.features.io.NumpyFilesReader(storage_path, *args, **kwargs)

Reads non-compressed numpy arrays from files in a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'numpy_files'
__init__(storage_path, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

read(key, left_offset_frames=0, right_offset_frames=None)
Return type


class lhotse.features.io.NumpyFilesWriter(storage_path, *args, **kwargs)

Writes non-compressed numpy arrays to files in a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'numpy_files'
__init__(storage_path, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

property storage_path
Return type


write(key, value)
Return type



Helper internal function used in HDF5 readers. It opens the HDF files and keeps their handles open in a global program cache to avoid excessive amount of syscalls when the *Reader class is instantiated and destroyed in a loop repeatedly (frequent use-case).

The file handles can be freed at any time by calling close_cached_file_handles().


Closes the cached file handles in lookup_cache_or_open (see its docs for more details).

Return type


class lhotse.features.io.NumpyHdf5Reader(storage_path, *args, **kwargs)

Reads non-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'numpy_hdf5'
__init__(storage_path, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

read(key, left_offset_frames=0, right_offset_frames=None)
Return type


class lhotse.features.io.NumpyHdf5Writer(storage_path, *args, **kwargs)

Writes non-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

Internally, this class opens the file lazily so that this object can be passed between processes without issues. This simplifies the parallel feature extraction code.

name = 'numpy_hdf5'
__init__(storage_path, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

property storage_path
Return type


write(key, value)
Return type


Return type


class lhotse.features.io.LilcomHdf5Reader(storage_path, *args, **kwargs)

Reads lilcom-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'lilcom_hdf5'
__init__(storage_path, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

read(key, left_offset_frames=0, right_offset_frames=None)
Return type


class lhotse.features.io.LilcomHdf5Writer(storage_path, tick_power=- 5, *args, **kwargs)

Writes lilcom-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'lilcom_hdf5'
__init__(storage_path, tick_power=- 5, *args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

property storage_path
Return type


write(key, value)
Return type


Return type


Feature-domain mixing

class lhotse.features.mixer.FeatureMixer(feature_extractor, base_feats, frame_shift, padding_value=- 1000.0)

Utility class to mix multiple feature matrices into a single one. It should be instantiated separately for each mixing session (i.e. each MixedCut will create a separate FeatureMixer to mix its tracks). It is initialized with a numpy array of features (typically float32) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using the add_to_mix method. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize the FeatureMixer.

It relies on the FeatureExtractor to have defined mix and compute_energy methods, so that the FeatureMixer knows how to scale and add two feature matrices together.

__init__(feature_extractor, base_feats, frame_shift, padding_value=- 1000.0)
  • feature_extractor (FeatureExtractor) – The FeatureExtractor instance that specifies how to mix the features.

  • base_feats (ndarray) – The features used to initialize the FeatureMixer are a point of reference in terms of energy and offset for all features mixed into them.

  • frame_shift (float) – Required to correctly compute offset and padding during the mix.

  • padding_value (float) – The value used to pad the shorter features during the mix. This value is adequate only for log space features. For non-log space features, e.g. energies, use either 0 or a small positive value like 1e-5.

property num_features
property unmixed_feats

Return a numpy ndarray with the shape (num_tracks, num_frames, num_features), where each track’s feature matrix is padded and scaled adequately to the offsets and SNR used in add_to_mix call.

Return type


property mixed_feats

Return a numpy ndarray with the shape (num_frames, num_features) - a mono mixed feature matrix of the tracks supplied with add_to_mix calls.

Return type


add_to_mix(feats, sampling_rate, snr=None, offset=0.0)

Add feature matrix of a new track into the mix. :type feats: ndarray :param feats: A 2D feature matrix to be mixed in. :type sampling_rate: int :param sampling_rate: The sampling rate of feats :type snr: Optional[float] :param snr: Signal-to-noise ratio, assuming feats represents noise (positive SNR - lower feats energy, negative SNR - higher feats energy) :type offset: float :param offset: How many seconds to shift feats in time. For mixing, the signal will be padded before the start with low energy values.

exception lhotse.features.mixer.NonPositiveEnergyError



Data structures and tools used to create training/testing examples.

class lhotse.cut.CutUtilsMixin

A mixin class for cuts which contains all the methods that share common implementations.

Note: Ideally, this would’ve been an abstract base class specifying the common interface, but ABC’s do not mix well with dataclasses in Python. It is possible we’ll ditch the dataclass for cuts in the future and make this an ABC instead.

property trimmed_supervisions

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Return type


mix(other, offset_other_by=0.0, snr=None)

Refer to mix() documentation.

Return type


append(other, snr=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Return type


compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type



a numpy ndarray with the computed features.


Display a plot of the waveform. Requires matplotlib to be installed.


Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).


Display the feature matrix as an image. Requires matplotlib to be installed.

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

Return type


speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

Return type



Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Return type



Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Return type



Return a copy of the Cut with a new ID.

Return type

Union[Cut, MixedCut, PaddingCut]

class lhotse.cut.Cut(id: str, start: float, duration: float, channel: int, supervisions: List[lhotse.supervision.SupervisionSegment] = <factory>, features: Optional[lhotse.features.base.Features] = None, recording: Optional[lhotse.audio.Recording] = None)

A Cut is a single “segment” that we’ll train on. It contains the features corresponding to a piece of a recording, with zero or more SupervisionSegments.

The SupervisionSegments indicate which time spans of the Cut contain some kind of supervision information: e.g. transcript, speaker, language, etc. The regions without a corresponding SupervisionSegment may contain anything - usually we assume it’s either silence or some kind of noise.

Note: The SupervisionSegment time boundaries are relative to the beginning of the cut. E.g. if the underlying Recording starts at 0s (always true), the Cut starts at 100s, and the SupervisionSegment starts at 3s, it means that in the Recording the supervision actually started at 103s. In some cases, the supervision might have a negative start, or a duration exceeding the duration of the Cut; this means that the supervision in the recording extends beyond the Cut.

id: str
start: Seconds
duration: Seconds
channel: int
supervisions: List[SupervisionSegment]
features: Optional[lhotse.features.base.Features] = None
recording: Optional[lhotse.audio.Recording] = None
property recording_id
Return type


property end
Return type


property has_features
Return type


property has_recording
Return type


property frame_shift
Return type


property num_frames
Return type


property num_samples
Return type


property num_features
Return type


property features_type
Return type


property sampling_rate
Return type



Load the features from the underlying storage and cut them to the relevant [begin, duration] region of the current Cut.

Return type



Load the audio by locating the appropriate recording in the supplied RecordingSet. The audio is trimmed to the [begin, end] range specified by the Cut.

Return type



a numpy ndarray with audio samples, with shape (1 <channel>, N <samples>)


Return a copy of the current Cut, detached from features.

Return type


compute_and_store_features(extractor, storage, augment_fn=None, *args, **kwargs)

Compute the features from this cut, store them on disk, and attach a feature manifest to this cut. This cut has to be able to load audio.

  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • output_dir – the directory where the computed features will be stored.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation.

Return type

Union[Cut, MixedCut, PaddingCut]


a new Cut instance with a Features manifest attached to it.

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)

Returns a new Cut that is a sub-region of the current Cut.

Note that no operation is done on the actual features - it’s only during the call to load_features() when the actual changes happen (a subset of features is loaded).

  • offset (float) – float (seconds), controls the start of the new cut relative to the current Cut’s start. E.g., if the current Cut starts at 10.0, and offset is 2.0, the new start is 12.0.

  • duration (Optional[float]) – optional float (seconds), controls the duration of the resulting Cut. By default, the duration is (end of the cut before truncation) - (offset).

  • keep_excessive_supervisions (bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.

  • preserve_id (bool) – bool. Should the truncated cut keep the same ID or get a new, random one.

  • _supervisions_index (Optional[Dict[str, IntervalTree]]) – when passed, allows to speed up processing of Cuts with a very large number of supervisions. Intended as an internal parameter.

Return type



a new Cut instance. If the current Cut is shorter than the duration, return None.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right')

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

  • cut – Cut to be padded.

  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

Return type

Union[Cut, MixedCut, PaddingCut]


a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

perturb_speed(factor, affix_id=True)

Return a new Cut that will lazily perturb the speed while loading audio. The num_samples, start and duration fields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlying Recording and the supervisions.

  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the Cut.id field by affixing it with “_sp{factor}”.

Return type



a modified copy of the current Cut.


Modify the SupervisionSegments by transform_fn of this Cut.


transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type

Union[Cut, MixedCut, PaddingCut]


a modified Cut.


Modify cut to store only supervisions accepted by predicate

>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids)
>>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0)
>>> cut = cut.filter_supervisions(lambda s: s.text is not None)

predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type

Union[Cut, MixedCut, PaddingCut]


a modified Cut

static from_dict(data)
Return type


Return type


Return type


__init__(id, start, duration, channel, supervisions=<factory>, features=None, recording=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.cut.PaddingCut(id: str, duration: float, sampling_rate: int, feat_value: float, num_frames: Optional[int] = None, num_features: Optional[int] = None, frame_shift: Optional[float] = None, num_samples: Optional[int] = None)

Represents a cut filled with zeroes in the time domain, or some specified value in the frequency domain. It’s used to make training samples evenly sized (same duration/number of frames).

id: str
duration: Seconds
sampling_rate: int
feat_value: float
num_frames: Optional[int] = None
num_features: Optional[int] = None
frame_shift: Optional[float] = None
num_samples: Optional[int] = None
property start
Return type


property end
Return type


property supervisions
property has_features
Return type


property has_recording
Return type


load_features(*args, **kwargs)
Return type


load_audio(*args, **kwargs)
Return type


truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, **kwargs)
Return type


pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right')

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

Return type

Union[Cut, MixedCut, PaddingCut]


a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

perturb_speed(factor, affix_id=True)

Return a new PaddingCut that will “mimic” the effect of speed perturbation on duration and num_samples.

  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the PaddingCut.id field by affixing it with “_sp{factor}”.

Return type



a modified copy of the current PaddingCut.


Return a copy of the current PaddingCut, detached from features.

Return type


compute_and_store_features(extractor, *args, **kwargs)

Returns a new PaddingCut with updates information about the feature dimension and number of feature frames, depending on the extractor properties.

Return type

Union[Cut, MixedCut, PaddingCut]


Just for consistency with Cut and MixedCut.


transform_fn (Callable[[Any], Any]) – a dummy function that would be never called actually.

Return type

Union[Cut, MixedCut, PaddingCut]


the PaddingCut itself.


Just for consistency with Cut and MixedCut.


predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type

Union[Cut, MixedCut, PaddingCut]


a modified Cut

static from_dict(data)
Return type


Return type


Return type


__init__(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.cut.MixTrack(cut: Union[lhotse.cut.Cut, lhotse.cut.PaddingCut], offset: float = 0.0, snr: Optional[float] = None)

Represents a single track in a mix of Cuts. Points to a specific Cut and holds information on how to mix it with other Cuts, relative to the first track in a mix.

cut: Union[Cut, PaddingCut]
offset: float = 0.0
snr: Optional[float] = None
static from_dict(data)
__init__(cut, offset=0.0, snr=None)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.cut.MixedCut(id: str, tracks: List[lhotse.cut.MixTrack])

Represents a Cut that’s created from other Cuts via mix or append operations. The actual mixing operations are performed upon loading the features into memory. In order to load the features, it needs to access the CutSet object that holds the “ingredient” cuts, as it only holds their IDs (“pointers”). The SNR and offset of all the tracks are specified relative to the first track.

id: str
tracks: List[MixTrack]
property supervisions

Lists the supervisions of the underlying source cuts. Each segment start time will be adjusted by the track offset.

Return type


property start
Return type


property end
Return type


property duration
Return type


property has_features
Return type


property has_recording
Return type


property num_frames
Return type


property frame_shift
Return type


property sampling_rate
Return type


property num_samples
Return type


property num_features
Return type


property features_type
Return type


truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)

Returns a new MixedCut that is a sub-region of the current MixedCut. This method truncates the underlying Cuts and modifies their offsets in the mix, as needed. Tracks that do not fit in the truncated cut are removed.

Note that no operation is done on the actual features - it’s only during the call to load_features() when the actual changes happen (a subset of features is loaded).

  • offset (float) – float (seconds), controls the start of the new cut relative to the current MixedCut’s start.

  • duration (Optional[float]) – optional float (seconds), controls the duration of the resulting MixedCut. By default, the duration is (end of the cut before truncation) - (offset).

  • keep_excessive_supervisions (bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.

  • preserve_id (bool) – bool. Should the truncated cut keep the same ID or get a new, random one.

Return type

Union[Cut, MixedCut, PaddingCut]


a new MixedCut instance.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right')

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

Return type

Union[Cut, MixedCut, PaddingCut]


a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

perturb_speed(factor, affix_id=True)

Return a new MixedCut that will lazily perturb the speed while loading audio. The num_samples, start and duration fields of the underlying Cuts (and their Recordings and SupervisionSegments) are updated to reflect the shrinking/extending effect of speed. We are also updating the offsets of all underlying tracks.

  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_sp{factor}”.

Return type



a modified copy of the current MixedCut.


Loads the features of the source cuts and mixes them on-the-fly.


mixed (bool) – when True (default), returns a 2D array of features mixed in the feature domain. Otherwise returns a 3D array with the first dimension equal to the number of tracks.

Return type



A numpy ndarray with features and with shape (num_frames, num_features), or (num_tracks, num_frames, num_features)


Loads the audios of the source cuts and mix them on-the-fly.


mixed (bool) – When True (default), returns a mono mix of the underlying tracks. Otherwise returns a numpy array with the number of channels equal to the number of tracks.

Return type



A numpy ndarray with audio samples and with shape (num_channels, num_samples)


Display the feature matrix as an image. Requires matplotlib to be installed.


Display plots of the individual tracks’ waveforms. Requires matplotlib to be installed.


Return a copy of the current MixedCut, detached from features.

Return type


compute_and_store_features(extractor, storage, augment_fn=None, mix_eagerly=True)

Compute the features from this cut, store them on disk, and create a new Cut object with the feature manifest attached. This cut has to be able to load audio.

  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • storage (FeaturesWriter) – a FeaturesWriter instance used to store the features.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation.

  • mix_eagerly (bool) – when False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a new Cut instance with the same ID. The returned Cut will not have a Recording attached.

Return type

Union[Cut, MixedCut, PaddingCut]


a new Cut instance if mix_eagerly is True, or returns self with each of the tracks containing the Features manifests.


Modify the SupervisionSegments by transform_fn of this MixedCut.


transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type

Union[Cut, MixedCut, PaddingCut]


a modified MixedCut.


Modify cut to store only supervisions accepted by predicate

>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids)
>>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0)
>>> cut = cut.filter_supervisions(lambda s: s.text is not None)

predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type

Union[Cut, MixedCut, PaddingCut]


a modified Cut

static from_dict(data)
Return type


Return type


Return type


__init__(id, tracks)

Initialize self. See help(type(self)) for accurate signature.

class lhotse.cut.CutSet(*args, **kwds)

CutSet combines features with their corresponding supervisions. It may have wider span than the actual supervisions, provided the features for the whole span exist. It is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.

cuts: Dict[str, AnyCut]
property mixed_cuts
Return type

Dict[str, MixedCut]

property simple_cuts
Return type

Dict[str, Cut]

property ids
Return type


property speakers
Return type


static from_cuts(cuts)
Return type


static from_manifests(recordings=None, supervisions=None, features=None, random_ids=False)

Create a CutSet from any combination of supervision, feature and recording manifests. At least one of recording_set or feature_set is required. The Cut boundaries correspond to those found in the feature_set, when available, otherwise to those found in the recording_set When a supervision_set is provided, we’ll attach to the Cut all supervisions that have a matching recording ID and are fully contained in the Cut’s boundaries.

  • recordings (Optional[RecordingSet]) – a RecordingSet manifest.

  • supervisions (Optional[SupervisionSet]) – a SupervisionSet manifest.

  • features (Optional[FeatureSet]) – a FeatureSet manifest.

  • random_ids (bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)

Return type



a new CutSet instance.

static from_dicts(data)
Return type


Return type



Print a message describing details about the CutSet - the number of cuts and the duration statistics, including the total duration and the percentage of speech segments.

Example output:

Cuts count: 547 Total duration (hours): 326.4 Speech duration (hours): 79.6 (24.4%) *** Duration statistics (seconds): mean 2148.0 std 870.9 min 477.0 25% 1523.0 50% 2157.0 75% 2423.0 max 5415.0 dtype: float64

Return type


split(num_splits, shuffle=False)

Split the CutSet into num_splits pieces of equal size.

  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the cuts order first.

Return type



A list of CutSet pieces.

subset(*, supervision_ids=None, cut_ids=None, first=None, last=None)

Return a new CutSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

>>> cuts = CutSet.from_yaml('path/to/cuts')
>>> train_set = cuts.subset(supervision_ids=train_ids)
>>> test_set = cuts.subset(supervision_ids=test_ids)
  • supervision_ids (Optional[Iterable[str]]) – List of supervision IDs to keep.

  • cut_ids (Optional[Iterable[str]]) – List of cut IDs to keep.

  • first (Optional[int]) – int, the number of first cuts to keep.

  • last (Optional[int]) – int, the number of last cuts to keep.

Return type



a new CutSet with the subset results.


Return a new CutSet with Cuts containing only SupervisionSegments satisfying predicate

Cuts without supervisions are preserved

>>> cuts = CutSet.from_yaml('path/to/cuts')
>>> at_least_five_second_supervisions = cuts.filter_supervisions(lambda s: s.duration >= 5)

predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type



a CutSet with filtered supervisions


Return a new CutSet with the Cuts that satisfy the predicate.


predicate (Callable[[Union[Cut, MixedCut, PaddingCut]], bool]) – a function that takes a cut as an argument and returns bool.

Return type



a filtered CutSet.


Return a new CutSet with Cuts that have identical spans as their supervisions.

Return type



a CutSet.


Return a new CutSet with Cuts created from segments that have no supervisions (likely silence or noise).

Return type



a CutSet.


Find cuts that come from the same recording and have matching start and end times, but represent different channels. Then, mix them together (in matching groups) and return a new CutSet that contains their mixes. This is useful for processing microphone array recordings.

It is intended to be used as the first operation after creating a new CutSet (but might also work in other circumstances, e.g. if it was cut to windows first).

>>> ami = prepare_ami('path/to/ami')
>>> cut_set = CutSet.from_manifests(recordings=ami['train']['recordings'])
>>> multi_channel_cut_set = cut_set.mix_same_recording_channels()

In the AMI example, the multi_channel_cut_set will yield MixedCuts that hold all single-channel Cuts together.

Return type



Sort the CutSet according to cuts duration and return the result. Descending by default.

Return type



Sort the CutSet according to the order of cut IDs in other and return the result.

Return type



Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.


index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.

Return type

Dict[str, IntervalTree]


a mapping from Cut ID to an interval tree of SupervisionSegments.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right')

Return a new CutSet with Cuts padded to duration, num_frames or num_samples. Cuts longer than the specified argument will not be affected. By default, cuts will be padded to the right (i.e. after the signal).

When none of duration, num_frames, or num_samples is specified, we’ll try to determine the best way to pad to the longest cut based on whether features or recordings are available.

  • duration (Optional[float]) – The cuts minimal duration after padding. When not specified, we’ll choose the duration of the longest cut in the CutSet.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

Return type



A padded CutSet.

truncate(max_duration, offset_type, keep_excessive_supervisions=True, preserve_id=False)

Return a new CutSet with the Cuts truncated so that their durations are at most max_duration. Cuts shorter than max_duration will not be changed. :type max_duration: float :param max_duration: float, the maximum duration in seconds of a cut in the resulting manifest. :type offset_type: str :param offset_type: str, can be: - ‘start’ => cuts are truncated from their start; - ‘end’ => cuts are truncated from their end minus max_duration; - ‘random’ => cuts are truncated randomly between their start and their end minus max_duration :type keep_excessive_supervisions: bool :param keep_excessive_supervisions: bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept. :type preserve_id: bool :param preserve_id: bool. Should the truncated cut keep the same ID or get a new, random one. :rtype: CutSet :return: a new CutSet instance with truncated cuts.

cut_into_windows(duration, keep_excessive_supervisions=True)

Return a new CutSet, made by traversing each Cut in windows of duration seconds and creating new Cut out of them.

The last window might have a shorter duration if there was not enough audio, so you might want to use either .filter() or .pad() afterwards to obtain a uniform duration CutSet.

  • duration (float) – Desired duration of the new cuts in seconds.

  • keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

Return type



a new CutSet with cuts made from shorter duration windows.


Randomly sample this CutSet and return n_cuts cuts. When n_cuts is 1, will return a single cut instance; otherwise will return a CutSet.

Return type

Union[Cut, MixedCut, PaddingCut, CutSet]

perturb_speed(factor, affix_id=True)
Return type


mix(cuts, duration=None, snr=20, mix_prob=1.0)

Mix cuts in this CutSet with randomly sampled cuts from another CutSet. A typical application would be data augmentation with noise, music, babble, etc.

  • cuts (CutSet) – a CutSet containing cuts to be mixed into this CutSet.

  • duration (Optional[float]) – an optional float in seconds. When None, we will preserve the duration of the cuts in self (i.e. we’ll truncate the mix if it exceeded the original duration). Otherwise, we will keep sampling cuts to mix in until we reach the specified duration (and truncate to that value, should it be exceeded).

  • snr (Union[float, Sequence[float], None]) – an optional float, or pair (range) of floats, in decibels. When it’s a single float, we will mix all cuts with this SNR level (where cuts in self are treated as signals, and cuts in cuts are treated as noise). When it’s a pair of floats, we will uniformly sample SNR values from that range. When None, we will mix the cuts without any level adjustment (could be too noisy for data augmentation).

  • mix_prob (float) – an optional float in range [0, 1]. Specifies the probability of performing a mix. Values lower than 1.0 mean that some cuts in the output will be unchanged.

Return type



a new CutSet with mixed cuts.


Return a new CutSet, where each Cut is copied and detached from its extracted features.

Return type


compute_and_store_features(extractor, storage_path, num_jobs=None, augment_fn=None, storage_type=<class 'lhotse.features.io.LilcomHdf5Writer'>, executor=None, mix_eagerly=True, progress_bar=True)

Extract features for all cuts, possibly in parallel, and store them using the specified storage object.


Extract fbank features on one machine using 8 processes, store arrays partitioned in 8 HDF5 files with lilcom compression:

>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=8,
... )

Extract fbank features on one machine using 8 processes, store each array in a separate file with lilcom compression:

>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=8,
...     storage_type=LilcomFilesWriter
... )

Extract fbank features on multiple machines using a Dask cluster with 80 jobs, store arrays partitioned in 80 HDF5 files with lilcom compression:

>>> from distributed import Client
... cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=80,
...     executor=Client(...)
... )
  • extractor (FeatureExtractor) – A FeatureExtractor instance (either Lhotse’s built-in or a custom implementation).

  • storage_path (Union[Path, str]) – The path to location where we will store the features. The exact type and layout of stored files will be dictated by the storage_type argument.

  • num_jobs (Optional[int]) – The number of parallel processes used to extract the features. We will internally split the CutSet into this many chunks and process each chunk in parallel.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

  • storage_type (Type[~FW]) – a FeaturesWriter subclass type. It determines how the featurs are stored to disk, e.g. separate file per array, HDF5 files with multiple arrays, etc.

  • executor (Optional[Executor]) – when provided, will be used to parallelize the feature extraction process. By default, we will instantiate a ProcessPoolExecutor. Learn more about the Executor API at https://lhotse.readthedocs.io/en/latest/parallelism.html

  • mix_eagerly (bool) – Related to how the features are extracted for MixedCut instances, if any are present. When False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a new Cut instance with the same ID. The returned Cut will not have a Recording attached.

  • progress_bar (bool) – Should a progress bar be displayed (automatically turned off for parallel computation).

Return type



Returns a new CutSet with Features manifests attached to the cuts.

compute_global_feature_stats(storage_path=None, max_cuts=None)

Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

  • storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.

  • max_cuts (Optional[int]) – optionally, limit the number of cuts used for stats estimation. The cuts will be selected randomly in that case.

Return a dict of ``{‘norm_means’``{‘norm_means’

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type

Dict[str, ndarray]

Return type


Return type



Modify the cuts in this CutSet and return a new CutSet.


transform_fn (Callable[[Union[Cut, MixedCut, PaddingCut]], Union[Cut, MixedCut, PaddingCut]]) – A callable (function) that accepts a single cut instance and returns a single cut instance.

Return type



a new CutSet with modified cuts.


Modify the IDs of cuts in this CutSet. Useful when combining multiple ``CutSet``s that were created from a single source, but contain features with different data augmentations techniques.


transform_fn (Callable[[str], str]) – A callable (function) that accepts a string (cut ID) and returns

a new string (new cut ID). :rtype: CutSet :return: a new CutSet with cuts with modified IDs.


Modify the SupervisionSegments by transform_fn in this CutSet.


transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type



a new, modified CutSet.


Return a copy of this CutSet with all SupervisionSegments text transformed with transform_fn. Useful for text normalization, phonetic transcription, etc.


transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type



a new, modified CutSet.


Initialize self. See help(type(self)) for accurate signature.

lhotse.cut.make_windowed_cuts_from_features(feature_set, cut_duration, cut_shift=None, keep_shorter_windows=False)

Converts a FeatureSet to a CutSet by traversing each Features object in - possibly overlapping - windows, and creating a Cut out of that area. By default, the last window in traversal will be discarded if it cannot satisfy the cut_duration requirement.

  • feature_set (FeatureSet) – a FeatureSet object.

  • cut_duration (float) – float, duration of created Cuts in seconds.

  • cut_shift (Optional[float]) – optional float, specifies how many seconds are in between the starts of consecutive windows. Equals cut_duration by default.

  • keep_shorter_windows (bool) – bool, when True, the last window will be used to create a Cut even if its duration is shorter than cut_duration.

Return type



a CutSet object.

lhotse.cut.mix(reference_cut, mixed_in_cut, offset=0, snr=None)

Overlay, or mix, two cuts. Optionally the mixed_in_cut may be shifted by offset seconds and scaled down (positive SNR) or scaled up (negative SNR). Returns a MixedCut, which contains both cuts and the mix information. The actual feature mixing is performed during the call to MixedCut.load_features().

  • reference_cut (Union[Cut, MixedCut, PaddingCut]) – The reference cut for the mix - offset and snr are specified w.r.t this cut.

  • mixed_in_cut (Union[Cut, MixedCut, PaddingCut]) – The mixed-in cut - it will be offset and rescaled to match the offset and snr parameters.

  • offset (float) – How many seconds to shift the mixed_in_cut w.r.t. the reference_cut.

  • snr (Optional[float]) – Desired SNR of the right_cut w.r.t. the left_cut in the mix.

Return type



A MixedCut instance.

lhotse.cut.pad(cut, duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right')

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

  • cut (Union[Cut, MixedCut, PaddingCut]) – Cut to be padded.

  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

Return type

Union[Cut, MixedCut, PaddingCut]


a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

lhotse.cut.append(left_cut, right_cut, snr=None)

Helper method for functional-style appending of Cuts.

Return type



Return a MixedCut that consists of the input Cuts mixed with each other as-is.

Return type



Return a MixedCut that consists of the input Cuts appended to each other as-is.

Return type

Union[Cut, MixedCut, PaddingCut]

lhotse.cut.compute_supervisions_frame_mask(cut, frame_shift=None)

Compute a mask that indicates which frames in a cut are covered by supervisions.

  • cut (Union[Cut, MixedCut, PaddingCut]) – a cut object.

  • frame_shift (Optional[float]) – optional frame shift in seconds; required when the cut does not have pre-computed features, otherwise ignored.

:returns a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.


Convenience methods used to prepare recording and supervision manifests for standard corpora.

Kaldi conversion

Convenience methods used to interact with Kaldi data directories.

lhotse.kaldi.load_kaldi_data_dir(path, sampling_rate)

Load a Kaldi data directory and convert it to a Lhotse RecordingSet and SupervisionSet manifests. For this to work, at least the wav.scp file must exist. SupervisionSet is created only when a segments file exists. All the other files (text, utt2spk, etc.) are optional, and some of them might not be handled yet. In particular, feats.scp files are ignored.

Return type

Tuple[RecordingSet, Optional[SupervisionSet]]

lhotse.kaldi.export_to_kaldi(recordings, supervisions, output_dir)

Export a pair of RecordingSet and SupervisionSet to a Kaldi data directory. Currently, it only supports single-channel recordings that have a single AudioSource.

The RecordingSet and SupervisionSet must be compatible, i.e. it must be possible to create a CutSet out of them.

  • recordings (RecordingSet) – a RecordingSet manifest.

  • supervisions (SupervisionSet) – a SupervisionSet manifest.

  • output_dir (Union[Path, str]) – path where the Kaldi-style data directory will be created.

lhotse.kaldi.load_kaldi_text_mapping(path, must_exist=False)

Load Kaldi files such as utt2spk, spk2gender, text, etc. as a dict.

Return type

Dict[str, Optional[str]]

lhotse.kaldi.save_kaldi_text_mapping(data, path)

Save flat dicts to Kaldi files such as utt2spk, spk2gender, text, etc.


Helper methods used throughout the codebase.


Combine multiple manifests of the same type into one.

>>> # Pass several arguments
>>> combine(recording_set1, recording_set2, recording_set3)
>>> # Or pass a single list/tuple of manifests
>>> combine([supervision_set1, supervision_set2])
Return type



Take an iterable of data types in Lhotse such as Recording, SupervisonSegment or Cut, and create the manifest of the corresponding type. When the iterable is empty, returns None.

Return type
