API Reference
This page contains a comprehensive list of all classes and functions within lhotse.
Audio loading, saving, and manifests
Data structures and utilities used for describing and manipulating audio recordings.
- class lhotse.audio.AudioSource(type, channels, source, video=None)[source]
AudioSource represents audio data that can be retrieved from somewhere.
-
type:
str The type of audio source. Supported types are: - ‘file’ (supports most standard audio encodings, possibly multi-channel) - ‘command’ [unix pipe] (supports most standard audio encodings, possibly multi-channel) - ‘url’ (any URL type that is supported by “smart_open” library, e.g. http/https/s3/gcp/azure/etc.) - ‘memory’ (any format, read from a binary string attached to ‘source’ member of AudioSource) - ‘shar’ (indicates a placeholder that will be filled later when using Lhotse Shar data format)
-
channels:
List[int] A list of integer channel IDs available in this AudioSource.
-
source:
Union[str,bytes] The actual source to read from. The contents depend on the
typefield, but in general it can be a path, a URL, or the encoded binary data itself.
-
video:
Optional[VideoInfo] = None Optional information about the video contained in this source, if any.
- property has_video: bool
- property format: str
- load_audio(offset=0.0, duration=None, force_opus_sampling_rate=None)[source]
Load the AudioSource (from files, commands, or URLs) with soundfile, accounting for many audio formats and multi-channel inputs. Returns numpy array with shapes: (n_samples,) for single-channel, (n_channels, n_samples) for multi-channel.
Note: The elements in the returned array are in the range [-1.0, 1.0] and are of dtype np.float32.
- Parameters:
force_opus_sampling_rate (
Optional[int]) – This parameter is only used when we detect an OPUS file. It will tell ffmpeg to resample OPUS to this sampling rate.- Return type:
ndarray
- load_video(offset=0.0, duration=None, with_audio=True)[source]
- Return type:
Tuple[Tensor,Optional[Tensor]]
- __init__(type, channels, source, video=None)
-
type:
- class lhotse.audio.Recording(id, sources, sampling_rate, num_samples, duration, channel_ids=None, transforms=None)[source]
The
Recordingmanifest describes the recordings in a given corpus. It contains information about the recording, such as its path(s), duration, the number of samples, etc. It allows to represent multiple channels coming from one or more files.This manifest does not specify any segmentation information or supervision such as the transcript or the speaker – we use
SupervisionSegmentfor that.Note that
Recordingcan represent both a single utterance (e.g., in LibriSpeech) and a 1-hour session with multiple channels and speakers (e.g., in AMI). In the latter case, it is partitioned into data suitable for model training usingCut.Internally, Lhotse supports multiple audio backends to read audio file. By default, we try to use libsoundfile, then torchaudio (with FFMPEG integration starting with torchaudio 2.1), and then audioread (which is an ffmpeg CLI wrapper). For sphere files we prefer to use sph2pipe binary as it can work with certain unique encodings such as “shorten”.
Audio backends in Lhotse are configurable. See:
available_audio_backends()audio_backend(),get_current_audio_backend()set_current_audio_backend()get_default_audio_backend()
Examples
A
Recordingcan be simply created from a local audio file:>>> from lhotse import RecordingSet, Recording, AudioSource >>> recording = Recording.from_file('meeting.wav') >>> recording Recording( id='meeting', sources=[AudioSource(type='file', channels=[0], source='meeting.wav')], sampling_rate=16000, num_samples=57600000, duration=3600.0, transforms=None )
This manifest can be easily converted to a Python dict and serialized to JSON/JSONL/YAML/etc:
>>> recording.to_dict() {'id': 'meeting', 'sources': [{'type': 'file', 'channels': [0], 'source': 'meeting.wav'}], 'sampling_rate': 16000, 'num_samples': 57600000, 'duration': 3600.0}
Recordings can be also created programatically, e.g. when they refer to URLs stored in S3 or somewhere else:
>>> s3_audio_files = ['s3://my-bucket/123-5678.flac', ...] >>> recs = RecordingSet.from_recordings( ... Recording( ... id=url.split('/')[-1].replace('.flac', ''), ... sources=[AudioSource(type='url', source=url, channels=[0])], ... sampling_rate=16000, ... num_samples=get_num_samples(url), ... duration=get_duration(url) ... ) ... for url in s3_audio_files ... )
It allows reading a subset of the audio samples as a numpy array:
>>> samples = recording.load_audio() >>> assert samples.shape == (1, 16000) >>> samples2 = recording.load_audio(offset=0.5) >>> assert samples2.shape == (1, 8000)
-
id:
str
-
sources:
List[AudioSource]
-
sampling_rate:
int
-
num_samples:
int
-
duration:
float
-
channel_ids:
Optional[List[int]] = None
-
transforms:
Optional[List[Union[AudioTransform,Dict]]] = None
- property has_video: bool
- property is_in_memory: bool
- property is_placeholder: bool
- property num_channels: int
- property source_format: str
Infer format of the audio sources. If all sources have the same format, return it. If sources have different formats, raise an error.
- static from_file(path, recording_id=None, relative_path_depth=None, force_opus_sampling_rate=None, force_read_audio=False)[source]
Read an audio file’s header and create the corresponding
Recording. Suitable to use when each physical file represents a separate recording session.Caution
If a recording session consists of multiple files (e.g. one per channel), it is advisable to create the
Recordingobject manually, with each file represented as a separateAudioSourceobject.- Parameters:
path (
Union[Path,str]) – Path to an audio file supported by libsoundfile (pysoundfile).recording_id (
Union[str,Callable[[Path],str],None]) – recording id, when not specified ream the filename’s stem (“x.wav” -> “x”). It can be specified as a string or a function that takes the recording path and returns a string.relative_path_depth (
Optional[int]) – optional int specifying how many last parts of the file path should be retained in theAudioSource. By default writes the path as is.force_opus_sampling_rate (
Optional[int]) – when specified, this value will be used as the sampling rate instead of the one we read from the manifest. This is useful for OPUS files that always have 48kHz rate and need to be resampled to the real one – we will perform that operation “under-the-hood”. For non-OPUS files this input is undefined.force_read_audio (
bool) – Set it toTruefor audio files that do not have any metadata in their headers (e.g., “The People’s Speech” FLAC files).
- Return type:
- Returns:
a new
Recordinginstance pointing to the audio file.
- static from_bytes(data, recording_id)[source]
Like
Recording.from_file(), but creates a manifest for a byte string with raw encoded audio data. This data is first decoded to obtain info such as the sampling rate, number of channels, etc. Then, the binary data is attached to the manifest. CallingRecording.load_audio()does not perform any I/O and instead decodes the byte string contents in memory.Note
Intended use of this method is for packing Recordings into archives where metadata and data should be available together (e.g., in WebDataset style tarballs).
Caution
Manifest created with this method cannot be stored as JSON because JSON doesn’t allow serializing binary data.
- Parameters:
data (
bytes) – bytes, byte string containing encoded audio contents.recording_id (
str) – recording id, unique string identifier.
- Return type:
- Returns:
a new
Recordinginstance that owns the byte string data.
- move_to_memory(channels=None, offset=None, duration=None, format=None)[source]
Read audio data and return a copy of the manifest with binary data attached. Calling
Recording.load_audio()on that copy will not trigger I/O.If all arguments are left as defaults, we won’t decode the audio and attach the bytes we read from disk/other source as-is. If
channels,duration, oroffsetare specified, we’ll decode the audio and re-encode it intoformatbefore attaching. The default format is FLAC, other formats compatible with torchaudio.save are also accepted.- Return type:
- to_cut()[source]
Create a Cut out of this recording — MonoCut or MultiCut, depending on the number of channels.
- load_audio(channels=None, offset=0.0, duration=None)[source]
Read the audio samples from the underlying audio source (path, URL, unix pipe/command).
- Parameters:
channels (
Union[int,List[int],None]) – int or iterable of ints, a subset of channel IDs to read (reads all by default).offset (
float) – seconds, where to start reading the audio (at offset 0 by default). Note that it is only efficient for local filesystem files, i.e. URLs and commands will read all the samples first and discard the unneeded ones afterwards.duration (
Optional[float]) – seconds, indicates the total audio time to read (starting fromoffset).
- Return type:
ndarray- Returns:
a numpy array of audio samples with shape
(num_channels, num_samples).
- load_video(channels=None, offset=0.0, duration=None, with_audio=True, force_consistent_duration=True)[source]
Read the video frames and audio samples from the underlying source (path, URL, unix pipe/command).
- Parameters:
channels (
Union[int,List[int],None]) – int or iterable of ints, a subset of channel IDs to read (reads all by default).offset (
float) – seconds, where to start reading the video (at offset 0 by default). Note that it is only efficient for local filesystem files, i.e. URLs and commands will read all the samples first and discard the unneeded ones afterwards.duration (
Optional[float]) – seconds, indicates the total video time to read (starting fromoffset).with_audio (
bool) – bool, whether to load and return audio alongside video. True by default.force_consistent_duration (
bool) – bool, if audio duration is different than video duration (as counted bynum_frames / fps), we’ll either truncate or pad the audio with zeros. True by default.
- Return type:
Tuple[Tensor,Optional[Tensor]]- Returns:
a tuple of video tensor and optional audio tensor (or None).
- perturb_speed(factor, affix_id=True)[source]
Return a new
Recordingthat will lazily perturb the speed while loading audio. Thenum_samplesanddurationfields are updated to reflect the shrinking/extending effect of speed.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_sp{factor}”.
- Return type:
- Returns:
a modified copy of the current
Recording.
- perturb_tempo(factor, affix_id=True)[source]
Return a new
Recordingthat will lazily perturb the tempo while loading audio.Compared to speed perturbation, tempo preserves pitch. The
num_samplesanddurationfields are updated to reflect the shrinking/extending effect of tempo.- Parameters:
factor (
float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_tp{factor}”.
- Return type:
- Returns:
a modified copy of the current
Recording.
- perturb_volume(factor, affix_id=True)[source]
Return a new
Recordingthat will lazily perturb the volume while loading audio.- Parameters:
factor (
float) – The volume scale to be applied (e.g. factor=1.1 means 1.1x louder).affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_tp{factor}”.
- Return type:
- Returns:
a modified copy of the current
Recording.
- narrowband(codec, restore_orig_sr=True, affix_id=True)[source]
- Return a new
Recordingthat will lazily apply narrowband effect while loading audio. by affixing it with “_nb_{codec}”.
- Return type:
- Returns:
a modified copy of the current
Recording.
- Return a new
- normalize_loudness(target, affix_id=False)[source]
Return a new
Recordingthat will lazily apply WPE dereverberation.- Parameters:
target (
float) – The target loudness (in dB) to normalize to.affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_ln{factor}”.
- Return type:
- Returns:
a modified copy of the current
Recording.
- dereverb_wpe(affix_id=True)[source]
Return a new
Recordingthat will lazily apply WPE dereverberation.- Parameters:
affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_wpe”.- Return type:
- Returns:
a modified copy of the current
Recording.
- reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=None, room_rng_seed=None, source_rng_seed=None)[source]
Return a new
Recordingthat will lazily apply reverberation based on provided impulse response while loading audio. If no impulse response is provided, we will generate an RIR using a fast random generator (https://arxiv.org/abs/2208.04101).- Parameters:
rir_recording (
Optional[Recording]) – The impulse response to be used.normalize_output (
bool) – When true, output will be normalized to have energy as input.early_only (
bool) – When true, only the early reflections (first 50 ms) will be used.affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_rvb”.rir_channels (
Optional[Sequence[int]]) – The channels of the impulse response to be used (in case of multi-channel impulse responses). By default, only the first channel is used. If no RIR is provided, we will generate one with as many channels as this argument specifies.room_rng_seed (
Optional[int]) – The seed to be used for the room configuration.source_rng_seed (
Optional[int]) – The seed to be used for the source position.
- Return type:
- Returns:
the perturbed
Recording.
- resample(sampling_rate)[source]
Return a new
Recordingthat will be lazily resampled while loading audio. :type sampling_rate:int:param sampling_rate: The new sampling rate. :rtype:Recording:return: A resampledRecording.
- clip_amplitude(hard=False, gain_db=0.0, normalize=True, oversampling=4, affix_id=False)[source]
Return a new
Recordingthat will lazily apply a clipping effect while loading audio. Saturates input signal in [-1, 1] range.- Parameters:
hard (
bool) – If True, apply hard clipping (sharp cutoff); otherwise, apply soft clipping (saturation).gain_db (
float) – The amount of gain in decibels to apply before clipping (and to revert back to original level after).normalize (
bool) – If True, normalize the input signal to 0 dBFS before applying clipping.oversampling (
Optional[int]) – If provided, we will oversample the input signal by the given integer factor before applying saturation and then downsample back to the original sampling rate.affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_cl{gain_db}”.
- Return type:
- Returns:
a modified copy of the current
Recordingwith the saturation transform applied.
- compress(codec='opus', compression_level=0.99)[source]
Return a new
Recordingthat will lazily apply audio compression while loading audio.- Parameters:
codec (
Literal['opus','mp3','vorbis','gsm']) – The codec to use for compression. Supported codecs are “opus”, “mp3”, “vorbis”, “gsm”.compression_level (
float) – The compression level between 0.0 and 1.0 (higher means more compression).
- Return type:
- Returns:
a modified copy of the current
Recording.
- __init__(id, sources, sampling_rate, num_samples, duration, channel_ids=None, transforms=None)
- class lhotse.audio.RecordingSet(recordings=None)[source]
RecordingSetrepresents a collection of recordings. It does not contain any annotation such as the transcript or the speaker identity – just the information needed to retrieve a recording such as its path, URL, number of channels, and some recording metadata (duration, number of samples).It also supports (de)serialization to/from YAML/JSON/etc. and takes care of mapping between rich Python classes and YAML/JSON/etc. primitives during conversion.
When coming from Kaldi, think of it as
wav.scpon steroids:RecordingSetalso has the information from reco2dur and reco2num_samples, is able to represent multi-channel recordings and read a specified subset of channels, and support reading audio files directly, via a unix pipe, or downloading them on-the-fly from a URL (HTTPS/S3/Azure/GCP/etc.).Examples:
RecordingSetcan be created from an iterable ofRecordingobjects:>>> from lhotse import RecordingSet >>> audio_paths = ['123-5678.wav', ...] >>> recs = RecordingSet.from_recordings(Recording.from_file(p) for p in audio_paths)
As well as from a directory, which will be scanned recursively for files with parallel processing:
>>> recs2 = RecordingSet.from_dir('/data/audio', pattern='*.flac', num_jobs=4)
It behaves similarly to a
dict:>>> '123-5678' in recs True >>> recording = recs['123-5678'] >>> for recording in recs: >>> pass >>> len(recs) 127
It also provides some utilities for I/O:
>>> recs.to_file('recordings.jsonl') >>> recs.to_file('recordings.json.gz') # auto-compression >>> recs2 = RecordingSet.from_file('recordings.jsonl')
Manipulation:
>>> longer_than_5s = recs.filter(lambda r: r.duration > 5) >>> first_100 = recs.subset(first=100) >>> split_into_4 = recs.split(num_splits=4) >>> shuffled = recs.shuffle()
And lazy data augmentation/transformation, that requires to adjust some information in the manifest (e.g.,
num_samplesorduration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:>>> recs_sp = recs.perturb_speed(factor=1.1) >>> recs_vp = recs.perturb_volume(factor=2.) >>> recs_rvb = recs.reverb_rir(rir_recs) >>> recs_24k = recs.resample(24000)
- property ids: Iterable[str]
- static from_items(recordings)
Function to be implemented by every sub-class of this mixin. It’s expected to create a sub-class instance out of an iterable of items that are held by the sub-class (e.g.,
CutSet.from_items(iterable_of_cuts)).- Return type:
- static from_dir(path, pattern, num_jobs=1, force_opus_sampling_rate=None, recording_id=None, exclude_pattern=None)[source]
Recursively scan a directory
pathfor audio files that match the givenpatternand create aRecordingSetmanifest for them. Suitable to use when each physical file represents a separate recording session.Caution
If a recording session consists of multiple files (e.g. one per channel), it is advisable to create each
Recordingobject manually, with each file represented as a separateAudioSourceobject, and then aRecordingSetthat contains all the recordings.- Parameters:
path (
Union[Path,str]) – Path to a directory of audio of files (possibly with sub-directories).pattern (
str) – A bash-like pattern specifying allowed filenames, e.g.*.wavorsession1-*.flac.num_jobs (
int) – The number of parallel workers for reading audio files to get their metadata.force_opus_sampling_rate (
Optional[int]) – when specified, this value will be used as the sampling rate instead of the one we read from the manifest. This is useful for OPUS files that always have 48kHz rate and need to be resampled to the real one – we will perform that operation “under-the-hood”. For non-OPUS files this input does nothing.recording_id (
Optional[Callable[[Path],str]]) – A function which takes the audio file path and returns the recording ID. If not specified, the filename will be used as the recording ID.exclude_pattern (
Optional[str]) – optional regex string for identifying file name patterns to exclude. There has to be a full regex match to trigger exclusion.
- Returns:
a new
Recordinginstance pointing to the audio file.
- split(num_splits, shuffle=False, drop_last=False)[source]
Split the
RecordingSetintonum_splitspieces of equal size.- Parameters:
num_splits (
int) – Requested number of splits.shuffle (
bool) – Optionally shuffle the recordings order first.drop_last (
bool) – determines how to handle splitting whenlen(seq)is not divisible bynum_splits. WhenFalse(default), the splits might have unequal lengths. WhenTrue, it may discard the last element in some splits to ensure they are equally long.
- Return type:
List[RecordingSet]- Returns:
A list of
RecordingSetpieces.
- split_lazy(output_dir, chunk_size, prefix='')[source]
Splits a manifest (either lazily or eagerly opened) into chunks, each with
chunk_sizeitems (except for the last one, typically).In order to be memory efficient, this implementation saves each chunk to disk in a
.jsonl.gzformat as the input manifest is sampled.Note
For lowest memory usage, use
load_manifest_lazyto open the input manifest for this method.- Parameters:
output_dir (
Union[Path,str]) – directory where the split manifests are saved. Each manifest is saved at:{output_dir}/{prefix}.{split_idx}.jsonl.gzchunk_size (
int) – the number of items in each chunk.prefix (
str) – the prefix of each manifest.
- Return type:
List[RecordingSet]- Returns:
a list of lazily opened chunk manifests.
- subset(first=None, last=None)[source]
Return a new
RecordingSetaccording to the selected subset criterion. Only a single argument tosubsetis supported at this time.- Parameters:
first (
Optional[int]) – int, the number of first recordings to keep.last (
Optional[int]) – int, the number of last recordings to keep.
- Return type:
- Returns:
a new
RecordingSetwith the subset results.
- load_audio(recording_id, channels=None, offset_seconds=0.0, duration_seconds=None)[source]
- Return type:
ndarray
- perturb_speed(factor, affix_id=True)[source]
Return a new
RecordingSetthat will lazily perturb the speed while loading audio. Thenum_samplesanddurationfields are updated to reflect the shrinking/extending effect of speed.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_sp{factor}”.
- Return type:
- Returns:
a
RecordingSetcontaining the perturbedRecordingobjects.
- perturb_tempo(factor, affix_id=True)[source]
Return a new
RecordingSetthat will lazily perturb the tempo while loading audio. Thenum_samplesanddurationfields are updated to reflect the shrinking/extending effect of tempo.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_sp{factor}”.
- Return type:
- Returns:
a
RecordingSetcontaining the perturbedRecordingobjects.
- perturb_volume(factor, affix_id=True)[source]
Return a new
RecordingSetthat will lazily perturb the volume while loading audio.- Parameters:
factor (
float) – The volume scale to be applied (e.g. factor=1.1 means 1.1x louder).affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_sp{factor}”.
- Return type:
- Returns:
a
RecordingSetcontaining the perturbedRecordingobjects.
- reverb_rir(rir_recordings=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0], room_rng_seed=None, source_rng_seed=None)[source]
Return a new
RecordingSetthat will lazily apply reverberation based on provided impulse responses while loading audio. If norir_recordingsare provided, we will generate a set of impulse responses using a fast random generator (https://arxiv.org/abs/2208.04101).- Parameters:
rir_recordings (
Optional[RecordingSet]) – The impulse responses to be used.normalize_output (
bool) – When true, output will be normalized to have energy as input.early_only (
bool) – When true, only the early reflections (first 50 ms) will be used.affix_id (
bool) – When true, we will modify theRecording.idfield by affixing it with “_rvb”.rir_channels (
List[int]) – The channels to be used for the RIRs (if multi-channel). Uses first channel by default. If no RIR is provided, we will generate one with as many channels as this argument specifies.room_rng_seed (
Optional[int]) – The seed to be used for the room configuration.source_rng_seed (
Optional[int]) – The seed to be used for the source positions.
- Return type:
- Returns:
a
RecordingSetcontaining the perturbedRecordingobjects.
- resample(sampling_rate)[source]
Apply resampling to all recordings in the
RecordingSetand return a newRecordingSet. :type sampling_rate:int:param sampling_rate: The new sampling rate. :rtype:RecordingSet:return: a newRecordingSetwith lazily resampledRecordingobjects.
- filter(predicate)
Return a new manifest containing only the items that satisfy
predicate. If the manifest is lazy, the filtering will also be applied lazily.- Parameters:
predicate (
Callable[[TypeVar(T)],bool]) – a function that takes a cut as an argument and returns bool.- Returns:
a filtered manifest.
- classmethod from_file(path)
- Return type:
Any
- classmethod from_json(path)
- Return type:
Any
- classmethod from_jsonl(path)
- Return type:
Any
- classmethod from_jsonl_lazy(path)
Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration. :rtype:
AnyWarning
Opening the manifest in this way might cause some methods that rely on random access to fail.
- classmethod from_yaml(path)
- Return type:
Any
- classmethod infinite_mux(*manifests, weights=None, seed=0, max_open_streams=None)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. Unlike
mux(), this method allows to limit the number of max open sub-iterators at any given time.To enable this, it performs 2-stage sampling. First, it samples with replacement the set of iterators
Ito construct a subsetI_subof sizemax_open_streams. Then, for each iteration step, it samples an iteratorifromI_sub, fetches the next item from it, and yields it. Onceibecomes exhausted, it is replaced with a new iteratorjsampled fromI_sub.Caution
Do not use this method with inputs that are infinitely iterable as they will silently break the multiplexing property by only using a subset of the input iterables.
Caution
This method is not recommended for multiplexing for a small amount of iterations, as it may be much less accurate than
mux()depending on the number of open streams, iterable sizes, and the random seed.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.max_open_streams (
Optional[int]) – the number of iterables that can be open simultaneously at any given time.
- property is_lazy: bool
Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.
- map(transform_fn)
Apply transform_fn to each item in this manifest and return a new manifest. If the manifest is opened lazy, the transform is also applied lazily.
- Parameters:
transform_fn (
Callable[[TypeVar(T)],TypeVar(T)]) – A callable (function) that accepts a single item instance and returns a new (or the same) instance of the same type. E.g. with CutSet, callable acceptsCutand returns alsoCut.- Returns:
a new
CutSetwith transformed cuts.
- classmethod mux(*manifests, stop_early=False, weights=None, seed=0)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. If one of the iterables is exhausted before the others, we will keep iterating until all iterables are exhausted. This behavior can be changed with
stop_earlyparameter.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
stop_early (
bool) – should we stop the iteration as soon as we exhaust one of the manifests.weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.
- classmethod open_writer(path, overwrite=True)
Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (
.jsonl), with optional gzip compression (.jsonl.gz). :rtype:Union[SequentialJsonlWriter,InMemoryWriter]Note
when
pathisNone, we will return aInMemoryWriterinstead has the same API but stores the manifests in memory. It is convenient when you want to make disk saving optional.Example:
>>> from lhotse import RecordingSet ... recordings = [...] ... with RecordingSet.open_writer('recordings.jsonl.gz') as writer: ... for recording in recordings: ... writer.write(recording)
This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifests.
Example:
>>> from lhotse import RecordingSet, Recording ... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer: ... for path in Path('.').rglob('*.wav'): ... recording_id = path.stem ... if writer.contains(recording_id): ... # Item already written previously - skip processing. ... continue ... # Item doesn't exist yet - run extra work to prepare the manifest ... # and store it. ... recording = Recording.from_file(path, recording_id=recording_id) ... writer.write(recording)
- repeat(times=None, preserve_id=False)
Return a new, lazily evaluated manifest that iterates over the original elements
timesnumber of times.- Parameters:
times (
Optional[int]) – how many times to repeat (infinite by default).preserve_id (
bool) – whenTrue, we won’t update the element ID with repeat number.
- Returns:
a repeated manifest.
- shuffle(rng=None, buffer_size=10000)
Shuffles the elements and returns a shuffled variant of self. If the manifest is opened lazily, performs shuffling on-the-fly with a fixed buffer size.
- Parameters:
rng (
Optional[Random]) – an optional instance ofrandom.Randomfor precise control of randomness.- Returns:
a shuffled copy of self, or a manifest that is shuffled lazily.
- to_eager()
Evaluates all lazy operations on this manifest, if any, and returns a copy that keeps all items in memory. If the manifest was “eager” already, this is a no-op and won’t copy anything.
- to_file(path)
- Return type:
None
- to_json(path)
- Return type:
None
- to_jsonl(path)
- Return type:
None
- to_yaml(path)
- Return type:
None
- exception lhotse.audio.AudioLoadingError[source]
- __init__(*args, **kwargs)
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception lhotse.audio.DurationMismatchError[source]
- __init__(*args, **kwargs)
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class lhotse.audio.VideoInfo(fps, num_frames, height, width)[source]
Metadata about video content in a
Recording.-
fps:
float Video frame rate (frames per second). It’s a float because some standard FPS are fractional (e.g. 59.94)
-
num_frames:
int Number of video frames.
-
height:
int Height in pixels.
-
width:
int Width in pixels.
- property duration: float
- property frame_length: float
- __init__(fps, num_frames, height, width)
-
fps:
- lhotse.audio.audio_backend(backend)[source]
Context manager that sets Lhotse’s audio backend to the specified value and restores the previous audio backend at the end of its scope.
Example:
>>> with audio_backend("LibsndfileBackend"): ... some_audio_loading_fn()
- Return type:
Generator[AudioBackend,None,None]
- lhotse.audio.available_audio_backends()[source]
Return a list of names of available audio backends, including “default”.
- Return type:
List[str]
- lhotse.audio.get_current_audio_backend()[source]
Return the audio backend currently set by the user, or default.
- Return type:
AudioBackend
- lhotse.audio.get_default_audio_backend()[source]
Return a backend that can be used to read all audio formats supported by Lhotse.
It first looks for special cases that need very specific handling (such as: opus, sphere/shorten, in-memory buffers) and tries to match them against relevant audio backends.
Then, it tries to use several audio loading libraries (torchaudio, soundfile, audioread). In case the first fails, it tries the next one, and so on.
- Return type:
AudioBackend
- lhotse.audio.get_audio_duration_mismatch_tolerance()[source]
Retrieve the current audio duration mismatch tolerance in seconds.
- Return type:
float
- lhotse.audio.get_ffmpeg_torchaudio_info_enabled()[source]
Return FFMPEG_TORCHAUDIO_INFO_ENABLED, which is Lhotse’s global setting for whether to use ffmpeg-torchaudio to compute the duration of audio files.
Example:
>>> import lhotse >>> lhotse.get_ffmpeg_torchaudio_info_enabled()
- Return type:
bool
- lhotse.audio.info(path, force_opus_sampling_rate=None, force_read_audio=False)[source]
- Return type:
LibsndfileCompatibleAudioInfo
- lhotse.audio.read_audio(path_or_fd, offset=0.0, duration=None, force_opus_sampling_rate=None)[source]
- Return type:
Tuple[ndarray,int]
- lhotse.audio.save_audio(dest, src, sampling_rate, format=None, encoding=None)[source]
- Return type:
None
- lhotse.audio.set_current_audio_backend(backend)[source]
Force Lhotse to use a specific audio backend to read every audio file, overriding the default behaviour of educated guessing + trial-and-error.
Example forcing Lhotse to use
audioreadlibrary for every audio loading operation:>>> set_current_audio_backend(AudioreadBackend())
- Return type:
AudioBackend
- lhotse.audio.set_audio_duration_mismatch_tolerance(delta)[source]
Override Lhotse’s global threshold for allowed audio duration mismatch between the manifest and the actual data.
Some scenarios when a mismatch can happen:
- the
Recordingmanifest duration is rounded off too much (i.e., bad user input, but too inconvenient to go back and fix the manifests)
- the
data augmentation changes the number of samples a bit in a difficult to predict way
When there is a mismatch, Lhotse will either trim or replicate the diff to match the value found in the
Recordingmanifest.Note
We don’t recommend setting this lower than the default value, as it could break some data augmentation transforms.
Example:
>>> import lhotse >>> lhotse.set_audio_duration_mismatch_tolerance(0.01) # 10ms tolerance
- Parameters:
delta (
float) – New tolerance in seconds.- Return type:
None
- lhotse.audio.set_ffmpeg_torchaudio_info_enabled(enabled)[source]
Override Lhotse’s global setting for whether to use ffmpeg-torchaudio to compute the duration of audio files. If disabled, we fall back to using a different backend such as sox_io or soundfile.
Note
See this issue for more details: https://github.com/lhotse-speech/lhotse/issues/1026
Example:
>>> import lhotse >>> lhotse.set_ffmpeg_torchaudio_info_enabled(False) # don't use ffmpeg-torchaudio
- Parameters:
enabled (
bool) – Whether to use torchaudio to compute audio file duration.- Return type:
None
- lhotse.audio.null_result_on_audio_loading_error(func)[source]
This is a decorator that makes a function return None when reading audio with Lhotse failed.
Example:
>>> @null_result_on_audio_loading_error ... def func_loading_audio(rec): ... audio = rec.load_audio() # if this fails, will return None instead ... return other_func(audio)
Another example:
>>> # crashes on loading audio >>> audio = load_audio(cut) >>> # does not crash on loading audio, return None instead >>> maybe_audio: Optional = null_result_on_audio_loading_error(load_audio)(cut)
- Return type:
Callable
Supervision manifests
Data structures used for describing supervisions in a dataset.
- class lhotse.supervision.AlignmentItem(symbol: str, start: float, duration: float, score: float | None = None)[source]
This class contains an alignment item, for example a word, along with its start time (w.r.t. the start of recording) and duration. It can potentially be used to store other kinds of alignment items, such as subwords, pdfid’s etc.
-
symbol:
str Alias for field number 0
-
start:
float Alias for field number 1
-
duration:
float Alias for field number 2
-
score:
Optional[float] Alias for field number 3
- property end: float
- with_offset(offset)[source]
Return an identical
AlignmentItem, but with theoffsetadded to thestartfield.- Return type:
- perturb_speed(factor, sampling_rate)[source]
Return an
AlignmentItemthat has time boundaries matching the recording/cut perturbed with the same factor. SeeSupervisionSegment.perturb_speed()for details.- Return type:
- trim(end, start=0)[source]
See
SupervisionSegment.trim().- Return type:
- transform(transform_fn)[source]
Perform specified transformation on the alignment content.
- Return type:
- count(value, /)
Return number of occurrences of value.
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
symbol:
- class lhotse.supervision.SupervisionSegment(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None)[source]
SupervisionSegmentrepresents a time interval (segment) annotated with some supervision labels and/or metadata, such as the transcription, the speaker identity, the language, etc.Each supervision has unique
idand always refers to a specific recording (viarecording_id) and one or morechannel(by default, 0). Note that multiple channels of the recording may share the same supervision, in which case thechannelfield will be a list of integers.It’s also characterized by the start time (relative to the beginning of a
Recordingor aCut) and a duration, both expressed in seconds.The remaining fields are all optional, and their availability depends on specific corpora. Since it is difficult to predict all possible types of metadata, the
customfield (a dict) can be used to insert types of supervisions that are not supported out of the box.SupervisionSegmentmay contain multiple types of alignments. Thealignmentfield is a dict, indexed by alignment’s type (e.g.,wordorphone), and contains a list ofAlignmentItemobjects – simple structures that contain a given symbol and its time interval. Alignments can be read from CTM files or created programatically.Examples
A simple segment with no supervision information:
>>> from lhotse import SupervisionSegment >>> sup0 = SupervisionSegment( ... id='rec00001-sup00000', recording_id='rec00001', ... start=0.5, duration=5.0, channel=0 ... )
Typical supervision containing transcript, speaker ID, gender, and language:
>>> sup1 = SupervisionSegment( ... id='rec00001-sup00001', recording_id='rec00001', ... start=5.5, duration=3.0, channel=0, ... text='transcript of the second segment', ... speaker='Norman Dyhrentfurth', language='English', gender='M' ... )
Two supervisions denoting overlapping speech on two separate channels in a microphone array/multiple headsets (pay attention to
start,duration, andchannel):>>> sup2 = SupervisionSegment( ... id='rec00001-sup00002', recording_id='rec00001', ... start=15.0, duration=5.0, channel=0, ... text="i have incredibly good news for you", ... speaker='Norman Dyhrentfurth', language='English', gender='M' ... ) >>> sup3 = SupervisionSegment( ... id='rec00001-sup00003', recording_id='rec00001', ... start=18.0, duration=3.0, channel=1, ... text="say what", ... speaker='Hervey Arman', language='English', gender='M' ... )
A supervision with a phone alignment:
>>> from lhotse.supervision import AlignmentItem >>> sup4 = SupervisionSegment( ... id='rec00001-sup00004', recording_id='rec00001', ... start=33.0, duration=1.0, channel=0, ... text="ice", ... speaker='Maryla Zechariah', language='English', gender='F', ... alignment={ ... 'phone': [ ... AlignmentItem(symbol='AY0', start=33.0, duration=0.6), ... AlignmentItem(symbol='S', start=33.6, duration=0.4) ... ] ... } ... )
A supervision shared across multiple channels of a recording (e.g. a microphone array):
>>> sup5 = SupervisionSegment( ... id='rec00001-sup00005', recording_id='rec00001', ... start=33.0, duration=1.0, channel=[0, 1], ... text="ice", ... speaker='Maryla Zechariah', ... )
Converting
SupervisionSegmentto adict:>>> sup0.to_dict() {'id': 'rec00001-sup00000', 'recording_id': 'rec00001', 'start': 0.5, 'duration': 5.0, 'channel': 0}
-
id:
str
-
recording_id:
str
-
start:
float
-
duration:
float
-
channel:
Union[int,List[int]] = 0
-
text:
Optional[str] = None
-
language:
Optional[str] = None
-
speaker:
Optional[str] = None
-
gender:
Optional[str] = None
-
custom:
Optional[Dict[str,Any]] = None
-
alignment:
Optional[Dict[str,List[AlignmentItem]]] = None
- property end: float
- with_offset(offset)[source]
Return an identical
SupervisionSegment, but with theoffsetadded to thestartfield.- Return type:
- perturb_speed(factor, sampling_rate, affix_id=True)[source]
Return a
SupervisionSegmentthat has time boundaries matching the recording/cut perturbed with the same factor.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).sampling_rate (
int) – The sampling rate is necessary to accurately perturb the start and duration (going through the sample counts).affix_id (
bool) – When true, we will modify theidandrecording_idfields by affixing it with “_sp{factor}”.
- Return type:
- Returns:
a modified copy of the current
SupervisionSegment.
- perturb_tempo(factor, sampling_rate, affix_id=True)[source]
Return a
SupervisionSegmentthat has time boundaries matching the recording/cut perturbed with the same factor.- Parameters:
factor (
float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).sampling_rate (
int) – The sampling rate is necessary to accurately perturb the start and duration (going through the sample counts).affix_id (
bool) – When true, we will modify theidandrecording_idfields by affixing it with “_tp{factor}”.
- Return type:
- Returns:
a modified copy of the current
SupervisionSegment.
- perturb_volume(factor, affix_id=True)[source]
Return a
SupervisionSegmentwith modified ids.- Parameters:
factor (
float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).affix_id (
bool) – When true, we will modify theidandrecording_idfields by affixing it with “_vp{factor}”.
- Return type:
- Returns:
a modified copy of the current
SupervisionSegment.
- narrowband(codec, affix_id=True)[source]
Return a
SupervisionSegmentwith modified ids.- Parameters:
codec (
str) – Codec name.affix_id (
bool) – When true, we will modify theidandrecording_idfields by affixing it with “_nb_{codec}”.
- Return type:
- Returns:
a modified copy of the current
SupervisionSegment.
- reverb_rir(affix_id=True, channel=None)[source]
Return a
SupervisionSegmentwith modified ids.- Parameters:
affix_id (
bool) – When true, we will modify theidandrecording_idfields by affixing it with “_rvb”.- Return type:
- Returns:
a modified copy of the current
SupervisionSegment.
- trim(end, start=0)[source]
Return an identical
SupervisionSegment, but ensure thatself.startis not negative (in which case it’s set to 0) andself.enddoes not exceed theendparameter. If a start is optionally provided, the supervision is trimmed from the left (note that start should be relative to the cut times).This method is useful for ensuring that the supervision does not exceed a cut’s bounds, in which case pass
cut.durationas theendargument, since supervision times are relative to the cut.- Return type:
- map(transform_fn)[source]
Return a copy of the current segment, transformed with
transform_fn.- Parameters:
transform_fn (
Callable[[SupervisionSegment],SupervisionSegment]) – a function that takes a segment as input, transforms it and returns a new segment.- Return type:
- Returns:
a modified
SupervisionSegment.
- transform_text(transform_fn)[source]
Return a copy of the current segment with transformed
textfield. Useful for text normalization, phonetic transcription, etc.- Parameters:
transform_fn (
Callable[[str],str]) – a function that accepts a string and returns a string.- Return type:
- Returns:
a
SupervisionSegmentwith adjusted text.
- transform_alignment(transform_fn, type='word')[source]
Return a copy of the current segment with transformed
alignmentfield. Useful for text normalization, phonetic transcription, etc.- Parameters:
type (
Optional[str]) – alignment type to transform (key for alignment dict).transform_fn (
Callable[[str],str]) – a function that accepts a string and returns a string.
- Return type:
- Returns:
a
SupervisionSegmentwith adjusted alignments.
- __init__(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None)
- drop_custom(name)
- has_custom(name)
Check if the Cut has a custom attribute with name
name.- Parameters:
name (
str) – name of the custom attribute.- Return type:
bool- Returns:
a boolean.
- load_custom(name, **kwargs)
Load custom data as numpy array. The custom data is expected to have been stored in cuts
customfield as anArray,TemporalArray, orImagemanifest.Note
It works with Array/Image manifests stored via attribute assignments, e.g.:
cut.my_custom_data = Array(...)orcut = cut.attach_image('img', ...).- Parameters:
name (
str) – name of the custom attribute.- Return type:
ndarray- Returns:
a numpy array with the data.
- with_custom(name, value)
Return a copy of this object with an extra custom field assigned to it.
-
id:
- class lhotse.supervision.SupervisionSet(segments=None)[source]
SupervisionSetrepresents a collection of segments containing some supervision information (seeSupervisionSegment).It acts as a Python
list, extended with an efficientfindoperation that indexes and caches the supervision segments in an interval tree. It allows to quickly find supervision segments that correspond to a specific time interval. However, it can also work with lazy iterables.When coming from Kaldi, think of
SupervisionSetas asegmentsfile on steroids, that may also contain text, utt2spk, utt2gender, utt2dur, etc.Examples
Building a
SupervisionSet:>>> from lhotse import SupervisionSet, SupervisionSegment >>> sups = SupervisionSet.from_segments([SupervisionSegment(...), ...])
Writing/reading a
SupervisionSet:>>> sups.to_file('supervisions.jsonl.gz') >>> sups2 = SupervisionSet.from_file('supervisions.jsonl.gz')
Using
SupervisionSetlike a dict:>>> 'rec00001-sup00000' in sups True >>> sups['rec00001-sup00000'] SupervisionSegment(id='rec00001-sup00000', recording_id='rec00001', start=0.5, ...) >>> for segment in sups: ... pass
Searching by
recording_idand time interval:>>> matched_segments = sups.find(recording_id='rec00001', start_after=17.0, end_before=25.0)
Manipulation:
>>> longer_than_5s = sups.filter(lambda s: s.duration > 5) >>> first_100 = sups.subset(first=100) >>> split_into_4 = sups.split(num_splits=4) >>> shuffled = sups.shuffle()
- property data: Dict[str, SupervisionSegment] | Iterable[SupervisionSegment]
Alias property for
self.segments
- property ids: Iterable[str]
- static from_items(segments)
Function to be implemented by every sub-class of this mixin. It’s expected to create a sub-class instance out of an iterable of items that are held by the sub-class (e.g.,
CutSet.from_items(iterable_of_cuts)).- Return type:
- static from_rttm(path)[source]
Read an RTTM file located at
path(or an iterator) and create aSupervisionSetmanifest for them. Can be used to create supervisions from custom RTTM files (see, for example,lhotse.dataset.DiarizationDataset).>>> from lhotse import SupervisionSet >>> sup1 = SupervisionSet.from_rttm('/path/to/rttm_file') >>> sup2 = SupervisionSet.from_rttm(Path('/path/to/rttm_dir').rglob('ref_*'))
The following description is taken from the [dscore](https://github.com/nryant/dscore#rttm) toolkit:
Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields:
Type– segment type; should always bySPEAKERFile ID– file name; basename of the recording minus extension (e.g.,
rec1_a) -Channel ID– channel (1-indexed) that turn is on; should always be1-Turn Onset– onset of turn in seconds from beginning of recording -Turn Duration– duration of turn in seconds -Orthography Field– should always by<NA>-Speaker Type– should always be<NA>-Speaker Name– name of speaker of turn; should be unique within scope of each file -Confidence Score– system confidence (probability) that information is correct; should always be<NA>-Signal Lookahead Time– should always be<NA>For instance:
SPEAKER CMU_20020319-1400_d01_NONE 1 130.430000 2.350 <NA> <NA> juliet <NA> <NA> SPEAKER CMU_20020319-1400_d01_NONE 1 157.610000 3.060 <NA> <NA> tbc <NA> <NA> SPEAKER CMU_20020319-1400_d01_NONE 1 130.490000 0.450 <NA> <NA> chek <NA> <NA>
- Parameters:
path (
Union[Path,str,Iterable[Union[Path,str]]]) – Path to RTTM file or an iterator of paths to RTTM files.- Return type:
- Returns:
a new
SupervisionSetinstance containing segments from the RTTM file.
- with_alignment_from_ctm(ctm_file, type='word', match_channel=False, verbose=False)[source]
Add alignments from CTM file to the supervision set.
- Parameters:
ctm – Path to CTM file.
type (
str) – Alignment type (optional, default = word).match_channel (
bool) – if True, also match channel between CTM and SupervisionSegmentverbose (
bool) – if True, show progress bar
- Return type:
- Returns:
A new SupervisionSet with AlignmentItem objects added to the segments.
- write_alignment_to_ctm(ctm_file, type='word')[source]
Write alignments to CTM file.
- Parameters:
ctm_file (
Union[Path,str]) – Path to output CTM file (will be created if not exists)type (
str) – Alignment type to write (default = word)
- Return type:
None
- split(num_splits, shuffle=False, drop_last=False)[source]
Split the
SupervisionSetintonum_splitspieces of equal size.- Parameters:
num_splits (
int) – Requested number of splits.shuffle (
bool) – Optionally shuffle the recordings order first.drop_last (
bool) – determines how to handle splitting whenlen(seq)is not divisible bynum_splits. WhenFalse(default), the splits might have unequal lengths. WhenTrue, it may discard the last element in some splits to ensure they are equally long.
- Return type:
List[SupervisionSet]- Returns:
A list of
SupervisionSetpieces.
- split_lazy(output_dir, chunk_size, prefix='')[source]
Splits a manifest (either lazily or eagerly opened) into chunks, each with
chunk_sizeitems (except for the last one, typically).In order to be memory efficient, this implementation saves each chunk to disk in a
.jsonl.gzformat as the input manifest is sampled.Note
For lowest memory usage, use
load_manifest_lazyto open the input manifest for this method.- Parameters:
it – any iterable of Lhotse manifests.
output_dir (
Union[Path,str]) – directory where the split manifests are saved. Each manifest is saved at:{output_dir}/{prefix}.{split_idx}.jsonl.gzchunk_size (
int) – the number of items in each chunk.prefix (
str) – the prefix of each manifest.
- Return type:
List[SupervisionSet]- Returns:
a list of lazily opened chunk manifests.
- subset(first=None, last=None)[source]
Return a new
SupervisionSetaccording to the selected subset criterion. Only a single argument tosubsetis supported at this time.- Parameters:
first (
Optional[int]) – int, the number of first supervisions to keep.last (
Optional[int]) – int, the number of last supervisions to keep.
- Return type:
- Returns:
a new
SupervisionSetwith the subset results.
- transform_text(transform_fn)[source]
Return a copy of the current
SupervisionSetwith the segments having a transformedtextfield. Useful for text normalization, phonetic transcription, etc.- Parameters:
transform_fn (
Callable[[str],str]) – a function that accepts a string and returns a string.- Return type:
- Returns:
a
SupervisionSetwith adjusted text.
- transform_alignment(transform_fn, type='word')[source]
Return a copy of the current
SupervisionSetwith the segments having a transformedalignmentfield. Useful for text normalization, phonetic transcription, etc.- Parameters:
transform_fn (
Callable[[str],str]) – a function that accepts a string and returns a string.type (
str) – alignment type to transform (key for alignment dict).
- Return type:
- Returns:
a
SupervisionSetwith adjusted text.
- find(recording_id, channel=None, start_after=0, end_before=None, adjust_offset=False, tolerance=0.001)[source]
Return an iterable of segments that match the provided
recording_id.- Parameters:
recording_id (
str) – Desired recording ID.channel (
Optional[int]) – When specified, return supervisions in that channel - otherwise, in all channels.start_after (
float) – When specified, return segments that start after the given value.end_before (
Optional[float]) – When specified, return segments that end before the given value.adjust_offset (
bool) – When true, return segments as if the recordings had started atstart_after. This is useful for creating Cuts. From a user perspective, when dealing with a Cut, it is no longer helpful to know when the supervisions starts in a recording - instead, it’s useful to know when the supervision starts relative to the start of the Cut. In the anticipated use-case,start_afterandend_beforewould be the beginning and end of a cut; this option converts the times to be relative to the start of the cut.tolerance (
float) – Additional margin to account for floating point rounding errors when comparing segment boundaries.
- Return type:
Iterable[SupervisionSegment]- Returns:
An iterator over supervision segments satisfying all criteria.
- filter(predicate)
Return a new manifest containing only the items that satisfy
predicate. If the manifest is lazy, the filtering will also be applied lazily.- Parameters:
predicate (
Callable[[TypeVar(T)],bool]) – a function that takes a cut as an argument and returns bool.- Returns:
a filtered manifest.
- classmethod from_file(path)
- Return type:
Any
- classmethod from_json(path)
- Return type:
Any
- classmethod from_jsonl(path)
- Return type:
Any
- classmethod from_jsonl_lazy(path)
Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration. :rtype:
AnyWarning
Opening the manifest in this way might cause some methods that rely on random access to fail.
- classmethod from_yaml(path)
- Return type:
Any
- classmethod infinite_mux(*manifests, weights=None, seed=0, max_open_streams=None)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. Unlike
mux(), this method allows to limit the number of max open sub-iterators at any given time.To enable this, it performs 2-stage sampling. First, it samples with replacement the set of iterators
Ito construct a subsetI_subof sizemax_open_streams. Then, for each iteration step, it samples an iteratorifromI_sub, fetches the next item from it, and yields it. Onceibecomes exhausted, it is replaced with a new iteratorjsampled fromI_sub.Caution
Do not use this method with inputs that are infinitely iterable as they will silently break the multiplexing property by only using a subset of the input iterables.
Caution
This method is not recommended for multiplexing for a small amount of iterations, as it may be much less accurate than
mux()depending on the number of open streams, iterable sizes, and the random seed.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.max_open_streams (
Optional[int]) – the number of iterables that can be open simultaneously at any given time.
- property is_lazy: bool
Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.
- map(transform_fn)
Apply transform_fn to each item in this manifest and return a new manifest. If the manifest is opened lazy, the transform is also applied lazily.
- Parameters:
transform_fn (
Callable[[TypeVar(T)],TypeVar(T)]) – A callable (function) that accepts a single item instance and returns a new (or the same) instance of the same type. E.g. with CutSet, callable acceptsCutand returns alsoCut.- Returns:
a new
CutSetwith transformed cuts.
- classmethod mux(*manifests, stop_early=False, weights=None, seed=0)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. If one of the iterables is exhausted before the others, we will keep iterating until all iterables are exhausted. This behavior can be changed with
stop_earlyparameter.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
stop_early (
bool) – should we stop the iteration as soon as we exhaust one of the manifests.weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.
- classmethod open_writer(path, overwrite=True)
Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (
.jsonl), with optional gzip compression (.jsonl.gz). :rtype:Union[SequentialJsonlWriter,InMemoryWriter]Note
when
pathisNone, we will return aInMemoryWriterinstead has the same API but stores the manifests in memory. It is convenient when you want to make disk saving optional.Example:
>>> from lhotse import RecordingSet ... recordings = [...] ... with RecordingSet.open_writer('recordings.jsonl.gz') as writer: ... for recording in recordings: ... writer.write(recording)
This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifests.
Example:
>>> from lhotse import RecordingSet, Recording ... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer: ... for path in Path('.').rglob('*.wav'): ... recording_id = path.stem ... if writer.contains(recording_id): ... # Item already written previously - skip processing. ... continue ... # Item doesn't exist yet - run extra work to prepare the manifest ... # and store it. ... recording = Recording.from_file(path, recording_id=recording_id) ... writer.write(recording)
- repeat(times=None, preserve_id=False)
Return a new, lazily evaluated manifest that iterates over the original elements
timesnumber of times.- Parameters:
times (
Optional[int]) – how many times to repeat (infinite by default).preserve_id (
bool) – whenTrue, we won’t update the element ID with repeat number.
- Returns:
a repeated manifest.
- shuffle(rng=None, buffer_size=10000)
Shuffles the elements and returns a shuffled variant of self. If the manifest is opened lazily, performs shuffling on-the-fly with a fixed buffer size.
- Parameters:
rng (
Optional[Random]) – an optional instance ofrandom.Randomfor precise control of randomness.- Returns:
a shuffled copy of self, or a manifest that is shuffled lazily.
- to_eager()
Evaluates all lazy operations on this manifest, if any, and returns a copy that keeps all items in memory. If the manifest was “eager” already, this is a no-op and won’t copy anything.
- to_file(path)
- Return type:
None
- to_json(path)
- Return type:
None
- to_jsonl(path)
- Return type:
None
- to_yaml(path)
- Return type:
None
Lhotse Shar – sequential storage
Documentation for Lhotse Shar multi-tarfile sequential I/O format.
Lhotse Shar readers
- class lhotse.shar.readers.LazySharIterator(fields=None, in_dir=None, split_for_dataloading=False, shuffle_shards=False, stateful_shuffle=True, seed=42, cut_map_fns=None, slice_length=None)[source]
LazySharIterator reads cuts and their corresponding data from multiple shards, also recognized as the Lhotse Shar format. Each shard is numbered and represented as a collection of one text manifest and one or more binary tarfiles. Each tarfile contains a single type of data, e.g., recordings, features, or custom fields.
Given an example directory named
some_dir, its expected layout issome_dir/cuts.000000.jsonl.gz,some_dir/recording.000000.tar,some_dir/features.000000.tar, and then the same names but numbered with000001, etc. There may also be other files if the cuts have custom data attached to them.The main idea behind Lhotse Shar format is to optimize dataloading with sequential reads, while keeping the data composition more flexible than e.g. WebDataset tar archives do. To achieve this, Lhotse Shar keeps each data type in a separate archive, along a single CutSet JSONL manifest. This way, the metadata can be investigated without iterating through the binary data. The format also allows iteration over a subset of fields, or extension of existing data with new fields.
As you iterate over cuts from
LazySharIterator, it keeps a file handle open for the JSONL manifest and all of the tar files that correspond to the current shard. The tar files are read item by item together, and their binary data is attached to the cuts. It can be normally accessed using methods such ascut.load_audio().We can simply load a directory created by
SharWriter. Example:>>> cuts = LazySharIterator(in_dir="some_dir") ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio() ... fbank = cut.load_features()
LazySharIteratorcan also be initialized from a dict, where the keys indicate fields to be read, and the values point to actual shard locations. This is useful when only a subset of data is needed, or it is stored in different directories. Example:>>> cuts = LazySharIterator({ ... "cuts": ["some_dir/cuts.000000.jsonl.gz"], ... "recording": ["another_dir/recording.000000.tar"], ... "features": ["yet_another_dir/features.000000.tar"], ... }) ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio() ... fbank = cut.load_features()
We also support providing shell commands as shard sources, inspired by WebDataset. Example:
>>> cuts = LazySharIterator({ ... "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl.gz"], ... "recording": ["pipe:curl https://my.page/recording.000000.tar"], ... }) ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio()
Finally, we allow specifying URLs or cloud storage URIs for the shard sources. We defer to
smart_openlibrary to handle those. Example:>>> cuts = LazySharIterator({ ... "cuts": ["s3://my-bucket/cuts.000000.jsonl.gz"], ... "recording": ["s3://my-bucket/recording.000000.tar"], ... }) ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio()
- Parameters:
fields (
Optional[Dict[str,Sequence[Union[Path,str]]]]) – a dict whose keys specify which fields to load, and values are lists of shards (either paths or shell commands). The field “cuts” pointing to CutSet shards always has to be present.in_dir (
Union[Path,str,None]) – path to a directory created withSharWriterwith all the shards in a single place. Can be used instead offields.split_for_dataloading (
bool) – bool, by defaultFalsewhich does nothing. Setting it toTrueis intended for PyTorch training with multiple dataloader workers and possibly multiple DDP nodes. It results in each node+worker combination receiving a unique subset of shards from which to read data to avoid data duplication. This is mutually exclusive withseed='randomized'.shuffle_shards (
bool) – bool, by defaultFalse. WhenTrue, the shards are shuffled (in case of multi-node training, the shuffling is the same on each node given the same seed).seed (
Union[int,Literal['randomized'],Literal['trng']]) – Whenshuffle_shardsisTrue, we use this number to seed the RNG. Seed can be set to'randomized'in which case we expect that the user providedlhotse.dataset.dataloading.worker_init_fn()as DataLoader’sworker_init_fnargument. It will cause the iterator to shuffle shards differently on each node and dataloading worker in PyTorch training. This is mutually exclusive withsplit_for_dataloading=True. Seed can be set to'trng'which, like'randomized', shuffles the shards differently on each iteration, but is not possible to control (and is not reproducible).trngmode is mostly useful when the user has limited control over the training loop and may not be able to guarantee internal Shar epoch is being incremented, but needs randomness on each iteration (e.g. useful with PyTorch Lightning).stateful_shuffle (
bool) – bool, by defaultTrue. WhenTrue, every time this object is fully iterated, it increments an internal epoch counter and triggers shard reshuffling with RNG seeded byseed+epoch. Doesn’t have any effect whenshuffle_shardsisFalse.cut_map_fns (
Optional[Sequence[Callable[[Cut],Cut]]]) – optional sequence of callables that accept cuts and return cuts. It’s expected to have the same length as the number of shards, so each function corresponds to a specific shard. It can be used to attach shard-specific custom attributes to cuts.slice_length (
Optional[int]) – optional int, when set enables random slicing of shards that may improve sampling randomness for many-dataset-with-many-large-shards setups at the cost of efficiency. In this mode, we randomly select K to skip first K examples and read onlyslice_lengthexamples from each shard, then move to the next one.
See also:
SharWriter
- class lhotse.shar.readers.TarIterator(source)[source]
TarIterator is a convenience class for reading arrays/audio stored in Lhotse Shar tar files. It is specific to Lhotse Shar format and expects the tar file to have the following structure:
each file is stored in a separate tar member
the file name is the key of the array
- every array has two corresponding files:
the metadata: the file extension is
.jsonand the file contains a Lhotse manifest (Recording, Array, TemporalArray, Features) for the data item.the data: the file extension is the format of the array, and the file contents are the serialized array (possibly compressed).
the data file can be empty in case some cut did not contain that field. In that case, the data file has extension
.nodataand the manifest file has extension.nometa.these files are saved one after another, the data is first, and the metadata follows.
Iterating over TarReader yields tuples of
(Optional[manifest], filename)wheremanifestis a Lhotse manifest with binary data attached to it, andfilenameis the name of the data file inside tar archive.
Lhotse Shar writers
- class lhotse.shar.writers.ArrayTarWriter(pattern, shard_size=1000, compression='numpy', lilcom_tick_power=-5, shard_offset=0)[source]
ArrayTarWriter writes numpy arrays or PyTorch tensors into a tar archive that is automatically sharded.
For floating point tensors, we support the option to use lilcom compression. Note that lilcom is only suitable for log-space features such as log-Mel filter banks.
Example:
>>> with ArrayTarWriter("some_dir/fbank.%06d.tar", shard_size=100, compression="lilcom") as w: ... w.write("fbank1", fbank1_array) ... w.write("fbank2", fbank2_array) # etc.
It would create files such as
some_dir/fbank.000000.tar,some_dir/fbank.000001.tar, etc. The starting shard offset can be set usingshard_offsetparameter. The writer starts from 0 by default.It’s also possible to use
ArrayTarWriterwith automatic sharding disabled:>>> with ArrayTarWriter("some_dir/fbank.tar", shard_size=None, compression="numpy") as w: ... w.write("fbank1", fbank1_array) ... w.write("fbank2", fbank2_array) # etc.
See also:
TarWriter,AudioTarWriter- __init__(pattern, shard_size=1000, compression='numpy', lilcom_tick_power=-5, shard_offset=0)[source]
- property output_paths: List[str]
- class lhotse.shar.writers.AudioTarWriter(pattern, shard_size=1000, format='flac', shard_offset=0)[source]
AudioTarWriter writes audio examples in numpy arrays or PyTorch tensors into a tar archive that is automatically sharded.
It is different from
ArrayTarWriterin that it supports audio-specific compression mechanisms, such asflac,opus,mp3, orwav.Example:
>>> with AudioTarWriter("some_dir/audio.%06d.tar", shard_size=100, format="mp3") as w: ... w.write("audio1", audio1_array) ... w.write("audio2", audio2_array) # etc.
It would create files such as
some_dir/audio.000000.tar,some_dir/audio.000001.tar, etc. The starting shard offset can be set usingshard_offsetparameter. The writer starts from 0 by default.It’s also possible to use
AudioTarWriterwith automatic sharding disabled:>>> with AudioTarWriter("some_dir/audio.tar", shard_size=None, format="flac") as w: ... w.write("audio1", audio1_array) ... w.write("audio2", audio2_array) # etc.
See also:
TarWriter,ArrayTarWriter- property output_paths: List[str]
- class lhotse.shar.writers.JsonlShardWriter(pattern, shard_size=1000, shard_offset=0)[source]
JsonlShardWriter writes Cuts or dicts into multiple JSONL file shards. The JSONL can be compressed with gzip if the file extension ends with
.gz.Example:
>>> with JsonlShardWriter("some_dir/cuts.%06d.jsonl.gz", shard_size=100) as w: ... for cut in ...: ... w.write(cut)
It would create files such as
some_dir/cuts.000000.jsonl.gz,some_dir/cuts.000001.jsonl.gz, etc. The starting shard offset can be set usingshard_offsetparameter. The writer starts from 0 by default.See also:
TarWriter- property sharding_enabled: bool
- property output_paths: List[str]
- class lhotse.shar.writers.SharWriter(output_dir, fields, shard_size=1000, warn_unused_fields=True, include_cuts=True, shard_suffix=None, shard_offset=0)[source]
SharWriter writes cuts and their corresponding data into multiple shards, also recognized as the Lhotse Shar format. Each shard is numbered and represented as a collection of one text manifest and one or more binary tarfiles. Each tarfile contains a single type of data, e.g., recordings, features, or custom fields.
The main idea behind Lhotse Shar format is to optimize dataloading with sequential reads, while keeping the data composition more flexible than e.g. WebDataset tar archives do. To achieve this, Lhotse Shar keeps each data type in a separate archive, along a single CutSet JSONL manifest. This way, the metadata can be investigated without iterating through the binary data. The format also allows iteration over a subset of fields, or extension of existing data with new fields.
The user has to specify which fields should be saved, and what compression to use for each of them. Currently we support
wav,flac,opus, andmp3compression forrecordingand custom audio fields, andlilcomornumpyforfeaturesand custom array fields.Example:
>>> cuts = CutSet(...) # cuts have 'recording' and 'features' >>> with SharWriter("some_dir", shard_size=100, fields={"recording": "opus", "features": "lilcom"}) as w: ... for cut in cuts: ... w.write(cut)
Note
Different audio backends in Lhotse may use different encoders for the same audio formats. It is advisable to use the same audio backend for saving and loading audio data in Shar and other formats. See:
lhotse.audio.recording.Recording.It would create a directory
some_dirwith files such assome_dir/cuts.000000.jsonl.gz,some_dir/recording.000000.tar,some_dir/features.000000.tar, and then the same names but numbered with000001, etc. The starting shard offset can be set usingshard_offsetparameter. The writer starts from 0 by default.When
shard_sizeis set toNone, we will disable automatic sharding and the shard number suffix will be omitted from the file names.The option
warn_unused_fieldswill emit a warning when cuts have some data attached to them (e.g., recording, features, or custom arrays) but saving it was not specified viafields.The option
include_cutscontrols whether we store the cuts alongsidefields(true by default). Turning it off is useful when extending existing dataset with new fields/feature types, but the original cuts do not require any modification.- See also:
TarWriter,AudioTarWriter,
- __init__(output_dir, fields, shard_size=1000, warn_unused_fields=True, include_cuts=True, shard_suffix=None, shard_offset=0)[source]
- property sharding_enabled: bool
- property output_paths: Dict[str, List[str]]
- See also:
- class lhotse.shar.writers.TarWriter(pattern, shard_size=1000, shard_offset=0)[source]
TarWriter is a convenience wrapper over
tarfile.TarFilethat allows writing binary data into tar files that are automatically segmented. Each segment is a separate tar file called a “shard.”Shards are useful in training of deep learning models that require a substantial amount of data. Each shard can be read sequentially, which allows faster reads from magnetic disks, NFS, or otherwise slow storage.
Example:
>>> with TarWriter("some_dir/data.%06d.tar", shard_size=100) as w: ... w.write("blob1", binary_blob1) ... w.write("blob2", binary_blob2) # etc.
It would create files such as
some_dir/data.000000.tar,some_dir/data.000001.tar, etc. The starting shard offset can be set usingshard_offsetparameter. The writer starts from 0 by default.It’s also possible to use
TarWriterwith automatic sharding disabled:>>> with TarWriter("some_dir/data.tar", shard_size=None) as w: ... w.write("blob1", binary_blob1) ... w.write("blob2", binary_blob2) # etc.
This class is heavily inspired by the WebDataset library: https://github.com/webdataset/webdataset
- property sharding_enabled: bool
- property output_paths: List[str]
Feature extraction and manifests
Data structures and tools used for feature extraction and description.
Features API - extractor and manifests
- class lhotse.features.base.FeatureExtractor(config=None)[source]
The base class for all feature extractors in Lhotse. It is initialized with a config object, specific to a particular feature extraction method. The config is expected to be a dataclass so that it can be easily serialized.
All derived feature extractors must implement at least the following:
a
nameclass attribute (how are these features called, e.g. ‘mfcc’)a
config_typeclass attribute that points to the configuration dataclass typethe
extractmethod,the
frame_shiftproperty.
Feature extractors that support feature-domain mixing should additionally specify two static methods:
compute_energy, andmix.
By itself, the
FeatureExtractoroffers the following high-level methods that are not intended for overriding:extract_from_samples_and_storeextract_from_recording_and_store
These methods run a larger feature extraction pipeline that involves data augmentation and disk storage.
- name = None
- config_type = None
- abstract extract(samples, sampling_rate)[source]
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
ndarray- Returns:
a numpy ndarray representing the feature matrix.
- abstract property frame_shift: float
- property device: str | device
- static mix(features_a, features_b, energy_scaling_factor_b)[source]
Perform feature-domain mix of two signals,
aandb, and return the mixed signal.- Parameters:
features_a (
ndarray) – Left-hand side (reference) signal.features_b (
ndarray) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float) – A scaling factor forfeatures_benergy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_aandfeatures_benergies are 100, thefeatures_bsignal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_bto the signal is determined by the implementer.
- Return type:
ndarray- Returns:
A mixed feature matrix.
- static compute_energy(features)[source]
Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energywill never return zero.- Parameters:
features (
ndarray) – A feature matrix.- Return type:
float- Returns:
A positive float value of the signal energy.
- static scale(features, energy_scaling_factor)[source]
Scale a single feature matrix by the provided energy factor.
- Parameters:
features (
ndarray) – A feature matrix.energy_scaling_factor (
float) – The energy scaling factor to apply.
- Return type:
ndarray- Returns:
A scaled feature matrix.
- extract_batch(samples, sampling_rate, lengths=None)[source]
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
- extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)[source]
Extract the features from an array of audio samples in a full pipeline:
optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features.
Note, unlike in
extract_from_recording_and_store, the returnedFeaturesobject might not be suitable to store in aFeatureSet, as it does not reference any particularRecording. Instead, this method is useful when extracting features from cuts - especiallyMixedCutinstances, which may be created from multiple recordings and channels.- Parameters:
samples (
ndarray) – a numpy ndarray with the audio samples.sampling_rate (
int) – integer sampling rate ofsamples.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an offset in seconds for where to start reading the recording - when used forCutfeature extraction, must be equal toCut.start.channel (
Union[int,List[int],None]) – an optional channel number(s) to insert intoFeaturesmanifest.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix (it is not written to disk).
- extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)[source]
Extract the features from a
Recordingin a full pipeline:load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features and the source data used.
- Parameters:
recording (
Recording) – aRecordingthat specifies what’s the input audio.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an optional offset in seconds for where to start reading the recording.duration (
Optional[float]) – an optional duration specifying how much audio to load from the recording.channels (
Union[int,List[int],None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix.
- lhotse.features.base.get_extractor_type(name)[source]
Return the feature extractor type corresponding to the given name.
- Parameters:
name (
str) – specifies which feature extractor should be used.- Return type:
Type- Returns:
A feature extractors type.
- lhotse.features.base.create_default_feature_extractor(name)[source]
Create a feature extractor object with a default configuration.
- Parameters:
name (
str) – specifies which feature extractor should be used.- Return type:
Optional[FeatureExtractor]- Returns:
A new feature extractor instance.
- lhotse.features.base.register_extractor(cls)[source]
This decorator is used to register feature extractor classes in Lhotse so they can be easily created just by knowing their name.
An example of usage:
@register_extractor class MyFeatureExtractor: …
- Parameters:
cls – A type (class) that is being registered.
- Returns:
Registered type.
- class lhotse.features.base.TorchaudioFeatureExtractor(config=None)[source]
Common abstract base class for all torchaudio based feature extractors.
- extract(samples, sampling_rate)[source]
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
ndarray- Returns:
a numpy ndarray representing the feature matrix.
- property frame_shift: float
- __init__(config=None)
- static compute_energy(features)
Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energywill never return zero.- Parameters:
features (
ndarray) – A feature matrix.- Return type:
float- Returns:
A positive float value of the signal energy.
- config_type = None
- property device: str | device
- extract_batch(samples, sampling_rate, lengths=None)
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
- extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)
Extract the features from a
Recordingin a full pipeline:load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features and the source data used.
- Parameters:
recording (
Recording) – aRecordingthat specifies what’s the input audio.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an optional offset in seconds for where to start reading the recording.duration (
Optional[float]) – an optional duration specifying how much audio to load from the recording.channels (
Union[int,List[int],None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix.
- extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)
Extract the features from an array of audio samples in a full pipeline:
optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features.
Note, unlike in
extract_from_recording_and_store, the returnedFeaturesobject might not be suitable to store in aFeatureSet, as it does not reference any particularRecording. Instead, this method is useful when extracting features from cuts - especiallyMixedCutinstances, which may be created from multiple recordings and channels.- Parameters:
samples (
ndarray) – a numpy ndarray with the audio samples.sampling_rate (
int) – integer sampling rate ofsamples.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an offset in seconds for where to start reading the recording - when used forCutfeature extraction, must be equal toCut.start.channel (
Union[int,List[int],None]) – an optional channel number(s) to insert intoFeaturesmanifest.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix (it is not written to disk).
- abstract feature_dim(sampling_rate)
- Return type:
int
- classmethod from_dict(data)
- Return type:
- classmethod from_yaml(path)
- Return type:
- static mix(features_a, features_b, energy_scaling_factor_b)
Perform feature-domain mix of two signals,
aandb, and return the mixed signal.- Parameters:
features_a (
ndarray) – Left-hand side (reference) signal.features_b (
ndarray) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float) – A scaling factor forfeatures_benergy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_aandfeatures_benergies are 100, thefeatures_bsignal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_bto the signal is determined by the implementer.
- Return type:
ndarray- Returns:
A mixed feature matrix.
- name = None
- static scale(features, energy_scaling_factor)
Scale a single feature matrix by the provided energy factor.
- Parameters:
features (
ndarray) – A feature matrix.energy_scaling_factor (
float) – The energy scaling factor to apply.
- Return type:
ndarray- Returns:
A scaled feature matrix.
- to_dict()
- Return type:
Dict[str,Any]
- to_yaml(path)
- class lhotse.features.base.Features(type, num_frames, num_features, frame_shift, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)[source]
Represents features extracted for some particular time range in a given recording and channel. It contains metadata about how it’s stored: storage_type describes “how to read it”, for now it supports numpy arrays serialized with np.save, as well as arrays compressed with lilcom; storage_path is the path to the file on the local filesystem.
-
type:
str
-
num_frames:
int
-
num_features:
int
-
frame_shift:
float
-
sampling_rate:
int
-
start:
float
-
duration:
float
-
storage_type:
str
-
storage_path:
str
-
storage_key:
Union[str,bytes]
-
recording_id:
Optional[str] = None
-
channels:
Union[int,List[int],None] = None
- property end: float
- property is_in_memory: bool
- property is_placeholder: bool
- copy_feats(writer)[source]
Read the referenced feature array and save it using
writer. Returns a copy of the manifest with updated fields related to the feature storage.- Return type:
- __init__(type, num_frames, num_features, frame_shift, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)
-
type:
- class lhotse.features.base.FeatureSet(features=None)[source]
Represents a feature manifest, and allows to read features for given recordings within particular channels and time ranges. It also keeps information about the feature extractor parameters used to obtain this set. When a given recording/time-range/channel is unavailable, raises a KeyError.
- static from_items(features)
Function to be implemented by every sub-class of this mixin. It’s expected to create a sub-class instance out of an iterable of items that are held by the sub-class (e.g.,
CutSet.from_items(iterable_of_cuts)).- Return type:
- split(num_splits, shuffle=False, drop_last=False)[source]
Split the
FeatureSetintonum_splitspieces of equal size.- Parameters:
num_splits (
int) – Requested number of splits.shuffle (
bool) – Optionally shuffle the recordings order first.drop_last (
bool) – determines how to handle splitting whenlen(seq)is not divisible bynum_splits. WhenFalse(default), the splits might have unequal lengths. WhenTrue, it may discard the last element in some splits to ensure they are equally long.
- Return type:
List[FeatureSet]- Returns:
A list of
FeatureSetpieces.
- split_lazy(output_dir, chunk_size, prefix='')[source]
Splits a manifest (either lazily or eagerly opened) into chunks, each with
chunk_sizeitems (except for the last one, typically).In order to be memory efficient, this implementation saves each chunk to disk in a
.jsonl.gzformat as the input manifest is sampled.Note
For lowest memory usage, use
load_manifest_lazyto open the input manifest for this method.- Parameters:
it – any iterable of Lhotse manifests.
output_dir (
Union[Path,str]) – directory where the split manifests are saved. Each manifest is saved at:{output_dir}/{prefix}.{split_idx}.jsonl.gzchunk_size (
int) – the number of items in each chunk.prefix (
str) – the prefix of each manifest.
- Return type:
List[FeatureSet]- Returns:
a list of lazily opened chunk manifests.
- shuffle(*args, **kwargs)[source]
Shuffles the elements and returns a shuffled variant of self. If the manifest is opened lazily, performs shuffling on-the-fly with a fixed buffer size.
- Parameters:
rng – an optional instance of
random.Randomfor precise control of randomness.- Returns:
a shuffled copy of self, or a manifest that is shuffled lazily.
- subset(first=None, last=None)[source]
Return a new
FeatureSetaccording to the selected subset criterion. Only a single argument tosubsetis supported at this time.- Parameters:
first (
Optional[int]) – int, the number of first supervisions to keep.last (
Optional[int]) – int, the number of last supervisions to keep.
- Return type:
- Returns:
a new
FeatureSetwith the subset results.
- find(recording_id, channel_id=0, start=0.0, duration=None, leeway=0.05)[source]
Find and return a Features object that best satisfies the search criteria. Raise a KeyError when no such object is available.
- Parameters:
recording_id (
str) – str, requested recording ID.channel_id (
Union[int,List[int]]) – int, requested channel.start (
float) – float, requested start time in seconds for the feature chunk.duration (
Optional[float]) – optional float, requested duration in seconds for the feature chunk. By default, return everything from the start.leeway (
float) – float, controls how strictly we have to match the requested start and duration criteria. It is necessary to keep a small positive value here (default 0.05s), as there might be differences between the duration of recording/supervision segment, and the duration of features. The latter one is constrained to be a multiple of frame_shift, while the former can be arbitrary.
- Return type:
- Returns:
a Features object satisfying the search criteria.
- load(recording_id, channel_id=0, start=0.0, duration=None)[source]
Find a Features object that best satisfies the search criteria and load the features as a numpy ndarray. Raise a KeyError when no such object is available.
- Return type:
ndarray
- copy_feats(writer)[source]
For each manifest in this FeatureSet, read the referenced feature array and save it using
writer. Returns a copy of the manifest with updated fields related to the feature storage.- Return type:
- compute_global_stats(storage_path=None)[source]
Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.
- Parameters:
storage_path (
Union[Path,str,None]) – an optional path to a file where the stats will be stored with pickle.- Return a dict of ``{‘norm_means’``{‘norm_means’:
np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.
- Return type:
Dict[str,ndarray]
- filter(predicate)
Return a new manifest containing only the items that satisfy
predicate. If the manifest is lazy, the filtering will also be applied lazily.- Parameters:
predicate (
Callable[[TypeVar(T)],bool]) – a function that takes a cut as an argument and returns bool.- Returns:
a filtered manifest.
- classmethod from_file(path)
- Return type:
Any
- classmethod from_json(path)
- Return type:
Any
- classmethod from_jsonl(path)
- Return type:
Any
- classmethod from_jsonl_lazy(path)
Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration. :rtype:
AnyWarning
Opening the manifest in this way might cause some methods that rely on random access to fail.
- classmethod from_yaml(path)
- Return type:
Any
- classmethod infinite_mux(*manifests, weights=None, seed=0, max_open_streams=None)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. Unlike
mux(), this method allows to limit the number of max open sub-iterators at any given time.To enable this, it performs 2-stage sampling. First, it samples with replacement the set of iterators
Ito construct a subsetI_subof sizemax_open_streams. Then, for each iteration step, it samples an iteratorifromI_sub, fetches the next item from it, and yields it. Onceibecomes exhausted, it is replaced with a new iteratorjsampled fromI_sub.Caution
Do not use this method with inputs that are infinitely iterable as they will silently break the multiplexing property by only using a subset of the input iterables.
Caution
This method is not recommended for multiplexing for a small amount of iterations, as it may be much less accurate than
mux()depending on the number of open streams, iterable sizes, and the random seed.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.max_open_streams (
Optional[int]) – the number of iterables that can be open simultaneously at any given time.
- property is_lazy: bool
Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.
- map(transform_fn)
Apply transform_fn to each item in this manifest and return a new manifest. If the manifest is opened lazy, the transform is also applied lazily.
- Parameters:
transform_fn (
Callable[[TypeVar(T)],TypeVar(T)]) – A callable (function) that accepts a single item instance and returns a new (or the same) instance of the same type. E.g. with CutSet, callable acceptsCutand returns alsoCut.- Returns:
a new
CutSetwith transformed cuts.
- classmethod mux(*manifests, stop_early=False, weights=None, seed=0)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. If one of the iterables is exhausted before the others, we will keep iterating until all iterables are exhausted. This behavior can be changed with
stop_earlyparameter.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
stop_early (
bool) – should we stop the iteration as soon as we exhaust one of the manifests.weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.
- classmethod open_writer(path, overwrite=True)
Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (
.jsonl), with optional gzip compression (.jsonl.gz). :rtype:Union[SequentialJsonlWriter,InMemoryWriter]Note
when
pathisNone, we will return aInMemoryWriterinstead has the same API but stores the manifests in memory. It is convenient when you want to make disk saving optional.Example:
>>> from lhotse import RecordingSet ... recordings = [...] ... with RecordingSet.open_writer('recordings.jsonl.gz') as writer: ... for recording in recordings: ... writer.write(recording)
This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifests.
Example:
>>> from lhotse import RecordingSet, Recording ... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer: ... for path in Path('.').rglob('*.wav'): ... recording_id = path.stem ... if writer.contains(recording_id): ... # Item already written previously - skip processing. ... continue ... # Item doesn't exist yet - run extra work to prepare the manifest ... # and store it. ... recording = Recording.from_file(path, recording_id=recording_id) ... writer.write(recording)
- repeat(times=None, preserve_id=False)
Return a new, lazily evaluated manifest that iterates over the original elements
timesnumber of times.- Parameters:
times (
Optional[int]) – how many times to repeat (infinite by default).preserve_id (
bool) – whenTrue, we won’t update the element ID with repeat number.
- Returns:
a repeated manifest.
- to_eager()
Evaluates all lazy operations on this manifest, if any, and returns a copy that keeps all items in memory. If the manifest was “eager” already, this is a no-op and won’t copy anything.
- to_file(path)
- Return type:
None
- to_json(path)
- Return type:
None
- to_jsonl(path)
- Return type:
None
- to_yaml(path)
- Return type:
None
- class lhotse.features.base.FeatureSetBuilder(feature_extractor, storage, augment_fn=None)[source]
An extended constructor for the FeatureSet. Think of it as a class wrapper for a feature extraction script. It consumes an iterable of Recordings, extracts the features specified by the FeatureExtractor config, and saves stores them on the disk.
Eventually, we plan to extend it with the capability to extract only the features in specified regions of recordings and to perform some time-domain data augmentation.
- lhotse.features.base.store_feature_array(feats, storage)[source]
Store
featsarray on disk, usinglilcomcompression by default.- Parameters:
feats (
ndarray) – a numpy ndarray containing features.storage (
FeaturesWriter) – aFeaturesWriterobject to use for array storage.
- Return type:
str- Returns:
a path to the file containing the stored array.
- lhotse.features.base.compute_global_stats(feature_manifests, storage_path=None)[source]
Compute the global means and standard deviations for each feature bin in the manifest. It performs only a single pass over the data and iteratively updates the estimate of the means and variances.
We follow the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.
- Parameters:
feature_manifests (
Iterable[Features]) – an iterable ofFeaturesobjects.storage_path (
Union[Path,str,None]) – an optional path to a file where the stats will be stored with pickle.
- Return a dict of ``{‘norm_means’``{‘norm_means’:
np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.
- Return type:
Dict[str,ndarray]
Lhotse’s feature extractors
- class lhotse.features.kaldi.extractors.Fbank(config=None)[source]
- name = 'kaldi-fbank'
- config_type
alias of
FbankConfig
- property device: str | device
- property frame_shift: float
- extract(samples, sampling_rate)[source]
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
Union[ndarray,Tensor]- Returns:
a numpy ndarray representing the feature matrix.
- extract_batch(samples, sampling_rate, lengths=None)[source]
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
- static mix(features_a, features_b, energy_scaling_factor_b)[source]
Perform feature-domain mix of two signals,
aandb, and return the mixed signal.- Parameters:
features_a (
ndarray) – Left-hand side (reference) signal.features_b (
ndarray) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float) – A scaling factor forfeatures_benergy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_aandfeatures_benergies are 100, thefeatures_bsignal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_bto the signal is determined by the implementer.
- Return type:
ndarray- Returns:
A mixed feature matrix.
- static compute_energy(features)[source]
Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energywill never return zero.- Parameters:
features (
ndarray) – A feature matrix.- Return type:
float- Returns:
A positive float value of the signal energy.
- class lhotse.features.kaldi.extractors.Mfcc(config=None)[source]
- name = 'kaldi-mfcc'
- config_type
alias of
MfccConfig
- property device: str | device
- property frame_shift: float
- extract(samples, sampling_rate)[source]
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
Union[ndarray,Tensor]- Returns:
a numpy ndarray representing the feature matrix.
- extract_batch(samples, sampling_rate, lengths=None)[source]
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
Kaldi feature extractors as network layers
- Copyright 2019 Johns Hopkins University (Author: Jesus Villalba)
2021 Johns Hopkins University (Author: Piotr Żelasko)
Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
This whole module is authored and contributed by Jesus Villalba, with minor changes by Piotr Żelasko to make it more consistent with Lhotse.
It contains a PyTorch implementation of feature extractors that is very close to Kaldi’s – notably, it differs in that the preemphasis and DC offset removal are applied in the time, rather than frequency domain. This should not significantly affect any results, as confirmed by Jesus.
This implementation works well with autograd and batching, and can be used neural network layers.
Update January 2022: These modules now expose a new API function called “online_inference” that may be used to compute the features when the audio is streaming. The implementation is stateless, and passes the waveform remainders back to the user to feed them to the modules once new data becomes available. The implementation is compatible with JIT scripting via TorchScript.
- class lhotse.features.kaldi.layers.Wav2Win(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, pad_length=None, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, return_log_energy=False)[source]
Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and partition them into overlapping frames (of audio samples). Note: no feature extraction happens in here, the output is still a time-domain signal.
Example:
>>> x = torch.randn(1, 16000, dtype=torch.float32) >>> x.shape torch.Size([1, 16000]) >>> t = Wav2Win() >>> t(x).shape torch.Size([1, 100, 400])
The input is a tensor of shape
(batch_size, num_samples). The output is a tensor of shape(batch_size, num_frames, window_length). Whenreturn_log_energy==True, returns a tuple where the second element is a log-energy tensor of shape(batch_size, num_frames).- __init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, pad_length=None, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, return_log_energy=False)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
Tuple[Tensor,Optional[Tensor]]Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- online_inference(x, context=None)[source]
The same as the
forward()method, except it accepts an extra argument with the remainder waveform from the previous call ofonline_inference(), and returns a tuple of((frames, log_energy), remainder).- Return type:
Tuple[Tuple[Tensor,Optional[Tensor]],Tensor]
- T_destination = ~T_destination
- add_module(name, module)
Add a child module to the current module.
The module can be accessed as an attribute using the given name.
- Return type:
None
- Args:
- name (str): name of the child module. The child module can be
accessed from this module using the given name
module (Module): child module to be added to the module.
- apply(fn)
Apply
fnrecursively to every submodule (as returned by.children()) as well as self.Typical use includes initializing the parameters of a model (see also nn-init-doc).
- Return type:
Self
- Args:
fn (
Module-> None): function to be applied to each submodule- Returns:
Module: self
Example:
>>> @torch.no_grad() >>> def init_weights(m): >>> print(m) >>> if type(m) is nn.Linear: >>> m.weight.fill_(1.0) >>> print(m.weight) >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2)) >>> net.apply(init_weights) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )
- bfloat16()
Casts all floating point parameters and buffers to
bfloat16datatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- buffers(recurse=True)
Return an iterator over module buffers.
- Return type:
Iterator[Tensor]
- Args:
- recurse (bool): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module.
- Yields:
torch.Tensor: module buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for buf in model.buffers(): >>> print(type(buf), buf.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- call_super_init: bool = False
- children()
Return an iterator over immediate children modules.
- Return type:
Iterator[Module]
- Yields:
Module: a child module
- compile(*args, **kwargs)
Compile this Module’s forward using
torch.compile().This Module’s __call__ method is compiled and all arguments are passed as-is to
torch.compile().See
torch.compile()for details on the arguments for this function.- Return type:
None
- cpu()
Move all model parameters and buffers to the CPU. :rtype:
SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- cuda(device=None)
Move all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on GPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Args:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- double()
Casts all floating point parameters and buffers to
doubledatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- dump_patches: bool = False
- eval()
Set the module in evaluation mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e. whether they are affected, e.g.
Dropout,BatchNorm, etc.This is equivalent with
self.train(False).See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Returns:
Module: self
- extra_repr()
Return the extra representation of the module.
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
- Return type:
str
- float()
Casts all floating point parameters and buffers to
floatdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- get_buffer(target)
Return the buffer given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Tensor
- Args:
- target: The fully-qualified string name of the buffer
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.Tensor: The buffer referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not a buffer
- get_extra_state()
Return any extra state to include in the module’s state_dict.
Implement this and a corresponding
set_extra_state()for your module if you need to store extra state. This function is called when building the module’s state_dict().Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.
- Return type:
Any
- Returns:
object: Any extra state to store in the module’s state_dict
- get_parameter(target)
Return the parameter given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Parameter
- Args:
- target: The fully-qualified string name of the Parameter
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.nn.Parameter: The Parameter referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not an
nn.Parameter
- get_submodule(target)
Return the submodule given by
targetif it exists, otherwise throw an error.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2)) ) (linear): Linear(in_features=100, out_features=200, bias=True) ) )(The diagram shows an
nn.ModuleA.Awhich has a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To check whether or not we have the
linearsubmodule, we would callget_submodule("net_b.linear"). To check whether we have theconvsubmodule, we would callget_submodule("net_b.net_c.conv").The runtime of
get_submoduleis bounded by the degree of module nesting intarget. A query againstnamed_modulesachieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists,get_submoduleshould always be used.- Return type:
Module
- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
- Returns:
torch.nn.Module: The submodule referenced by
target- Raises:
- AttributeError: If at any point along the path resulting from
the target string the (sub)path resolves to a non-existent attribute name or an object that is not an instance of
nn.Module.
- half()
Casts all floating point parameters and buffers to
halfdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- ipu(device=None)
Move all model parameters and buffers to the IPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on IPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- load_state_dict(state_dict, strict=True, assign=False)
Copy parameters and buffers from
state_dictinto this module and its descendants.If
strictisTrue, then the keys ofstate_dictmust exactly match the keys returned by this module’sstate_dict()function.Warning
If
assignisTruethe optimizer must be created after the call toload_state_dictunlessget_swap_module_params_on_conversion()isTrue.- Args:
- state_dict (dict): a dict containing parameters and
persistent buffers.
- strict (bool, optional): whether to strictly enforce that the keys
in
state_dictmatch the keys returned by this module’sstate_dict()function. Default:True- assign (bool, optional): When set to
False, the properties of the tensors in the current module are preserved whereas setting it to
Truepreserves properties of the Tensors in the state dict. The only exception is therequires_gradfield ofParameterfor which the value from the module is preserved. Default:False
- Returns:
NamedTuplewithmissing_keysandunexpected_keysfields:missing_keysis a list of str containing any keys that are expectedby this module but missing from the provided
state_dict.
unexpected_keysis a list of str containing the keys that are notexpected by this module but present in the provided
state_dict.
- Note:
If a parameter or buffer is registered as
Noneand its corresponding key exists instate_dict,load_state_dict()will raise aRuntimeError.
- modules(remove_duplicate=True)
Return an iterator over all modules in the network.
- Return type:
Iterator[Module]
- Args:
- remove_duplicate: whether to remove the duplicated module instances in the result
or not.
- Yields:
Module: a module in the network
- Note:
Duplicate modules are returned only once by default. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.modules()): ... print(idx, '->', m) 0 -> Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) ) 1 -> Linear(in_features=2, out_features=2, bias=True)
- mtia(device=None)
Move all model parameters and buffers to the MTIA.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on MTIA while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- named_buffers(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
- Return type:
Iterator[tuple[str,Tensor]]
- Args:
prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.
remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.
- Yields:
(str, torch.Tensor): Tuple containing the name and buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, buf in self.named_buffers(): >>> if name in ['running_var']: >>> print(buf.size())
- named_children()
Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
- Return type:
Iterator[tuple[str, Module]]
- Yields:
(str, Module): Tuple containing a name and child module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, module in model.named_children(): >>> if name in ['conv4', 'conv5']: >>> print(module)
- named_modules(memo=None, prefix='', remove_duplicate=True)
Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
- Args:
memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result
or not
- Yields:
(str, Module): Tuple of name and module
- Note:
Duplicate modules are returned only once. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.named_modules()): ... print(idx, '->', m) 0 -> ('', Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )) 1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
- named_parameters(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
- Return type:
Iterator[tuple[str,Parameter]]
- Args:
prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- remove_duplicate (bool, optional): whether to remove the duplicated
parameters in the result. Defaults to True.
- Yields:
(str, Parameter): Tuple containing the name and parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, param in self.named_parameters(): >>> if name in ['bias']: >>> print(param.size())
- parameters(recurse=True)
Return an iterator over module parameters.
This is typically passed to an optimizer.
- Return type:
Iterator[Parameter]
- Args:
- recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- Yields:
Parameter: module parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for param in model.parameters(): >>> print(type(param), param.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- register_backward_hook(hook)
Register a backward hook on the module.
This function is deprecated in favor of
register_full_backward_hook()and the behavior of this function will change in future versions.- Return type:
RemovableHandle
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_buffer(name, tensor, persistent=True)
Add a buffer to the module.
This is typically used to register a buffer that should not be considered a model parameter. For example, BatchNorm’s
running_meanis not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by settingpersistenttoFalse. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’sstate_dict.Buffers can be accessed as attributes using given names.
- Return type:
None
- Args:
- name (str): name of the buffer. The buffer can be accessed
from this module using the given name
- tensor (Tensor or None): buffer to be registered. If
None, then operations that run on buffers, such as
cuda, are ignored. IfNone, the buffer is not included in the module’sstate_dict.- persistent (bool): whether the buffer is part of this module’s
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> self.register_buffer('running_mean', torch.zeros(num_features))
- register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)
Register a forward hook on the module.
The hook will be called every time after
forward()has computed an output.If
with_kwargsisFalseor not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called afterforward()is called. The hook should have the following signature:hook(module, args, output) -> None or modified output
If
with_kwargsisTrue, the forward hook will be passed thekwargsgiven to the forward function and be expected to return the output possibly modified. The hook should have the following signature:hook(module, args, kwargs, output) -> None or modified output
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If
True, the providedhookwill be firedbefore all existing
forwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforwardhooks on thistorch.nn.Module. Note that globalforwardhooks registered withregister_module_forward_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If
True, thehookwill be passed the kwargs given to the forward function. Default:
False- always_call (bool): If
Truethehookwill be run regardless of whether an exception is raised while calling the Module. Default:
False
- with_kwargs (bool): If
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)
Register a forward pre-hook on the module.
The hook will be called every time before
forward()is invoked.If
with_kwargsis false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:hook(module, args) -> None or modified input
If
with_kwargsis true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
forward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforward_prehooks on thistorch.nn.Module. Note that globalforward_prehooks registered withregister_module_forward_pre_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If true, the
hookwill be passed the kwargs given to the forward function. Default:
False
- with_kwargs (bool): If true, the
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_hook(hook, prepend=False)
Register a backward hook on the module.
The hook will be called every time the gradients with respect to a module are computed, and its firing rules are as follows: :rtype:
RemovableHandleOrdinarily, the hook fires when the gradients are computed with respect to the module inputs.
If none of the module inputs require gradients, the hook will fire when the gradients are computed with respect to module outputs.
If none of the module outputs require gradients, then the hooks will not fire.
The hook should have the following signature:
hook(module, grad_input, grad_output) -> tuple(Tensor) or None
The
grad_inputandgrad_outputare tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place ofgrad_inputin subsequent computations.grad_inputwill only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries ingrad_inputandgrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.
Warning
Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackwardhooks on thistorch.nn.Module. Note that globalbackwardhooks registered withregister_module_full_backward_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_pre_hook(hook, prepend=False)
Register a backward pre-hook on the module.
The hook will be called every time the gradients for the module are computed. The hook should have the following signature:
hook(module, grad_output) -> tuple[Tensor, ...], Tensor or None
The
grad_outputis a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place ofgrad_outputin subsequent computations. Entries ingrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype:
RemovableHandleWarning
Modifying inputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackward_prehooks on thistorch.nn.Module. Note that globalbackward_prehooks registered withregister_module_full_backward_pre_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_post_hook(hook)
Register a post-hook to be run after module’s
load_state_dict()is called.- It should have the following signature::
hook(module, incompatible_keys) -> None
The
moduleargument is the current module that this hook is registered on, and theincompatible_keysargument is aNamedTupleconsisting of attributesmissing_keysandunexpected_keys.missing_keysis alistofstrcontaining the missing keys andunexpected_keysis alistofstrcontaining the unexpected keys.The given incompatible_keys can be modified inplace if needed.
Note that the checks performed when calling
load_state_dict()withstrict=Trueare affected by modifications the hook makes tomissing_keysorunexpected_keys, as expected. Additions to either set of keys will result in an error being thrown whenstrict=True, and clearing out both missing and unexpected keys will avoid an error.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_pre_hook(hook)
Register a pre-hook to be run before module’s
load_state_dict()is called.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs) -> None # noqa: B950
- Arguments:
- hook (Callable): Callable hook that will be invoked before
loading the state dict.
- register_module(name, module)
Alias for
add_module().- Return type:
None
- register_parameter(name, param)
Add a parameter to the module.
The parameter can be accessed as an attribute using given name.
- Return type:
None
- Args:
- name (str): name of the parameter. The parameter can be accessed
from this module using the given name
- param (Parameter or None): parameter to be added to the module. If
None, then operations that run on parameters, such ascuda, are ignored. IfNone, the parameter is not included in the module’sstate_dict.
- register_state_dict_post_hook(hook)
Register a post-hook for the
state_dict()method.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata) -> None
The registered hooks can modify the
state_dictinplace.
- register_state_dict_pre_hook(hook)
Register a pre-hook for the
state_dict()method.- It should have the following signature::
hook(module, prefix, keep_vars) -> None
The registered hooks can be used to perform pre-processing before the
state_dictcall is made.
- requires_grad_(requires_grad=True)
Change if autograd should record operations on parameters in this module.
This method sets the parameters’
requires_gradattributes in-place.This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).
See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Args:
- requires_grad (bool): whether autograd should record operations on
parameters in this module. Default:
True.
- Returns:
Module: self
- set_extra_state(state)
Set extra state contained in the loaded state_dict.
This function is called from
load_state_dict()to handle any extra state found within the state_dict. Implement this function and a correspondingget_extra_state()for your module if you need to store extra state within its state_dict.- Return type:
None
- Args:
state (dict): Extra state from the state_dict
- set_submodule(target, module, strict=False)
Set the submodule given by
targetif it exists, otherwise throw an error. :rtype:NoneNote
If
strictis set toFalse(default), the method will replace an existing submodule or create a new submodule if the parent module exists. Ifstrictis set toTrue, the method will only attempt to replace an existing submodule and throw an error if the submodule does not exist.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(3, 3, 3) ) (linear): Linear(3, 3) ) )(The diagram shows an
nn.ModuleA.Ahas a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To override the
Conv2dwith a new submoduleLinear, you could callset_submodule("net_b.net_c.conv", nn.Linear(1, 1))wherestrictcould beTrueorFalseTo add a new submodule
Conv2dto the existingnet_bmodule, you would callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1)).In the above if you set
strict=Trueand callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1), strict=True), an AttributeError will be raised becausenet_bdoes not have a submodule namedconv.- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
module: The module to set the submodule to. strict: If
False, the method will replace an existing submoduleor create a new submodule if the parent module exists. If
True, the method will only attempt to replace an existing submodule and throw an error if the submodule doesn’t already exist.- Raises:
ValueError: If the
targetstring is empty or ifmoduleis not an instance ofnn.Module. AttributeError: If at any point along the path resulting fromthe
targetstring the (sub)path resolves to a non-existent attribute name or an object that is not an instance ofnn.Module.
See
torch.Tensor.share_memory_().- Return type:
Self
- state_dict(*args, destination=None, prefix='', keep_vars=False)
Return a dictionary containing references to the whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to
Noneare not included.Note
The returned object is a shallow copy. It contains references to the module’s parameters and buffers.
Warning
Currently
state_dict()also accepts positional arguments fordestination,prefixandkeep_varsin order. However, this is being deprecated and keyword arguments will be enforced in future releases.Warning
Please avoid the use of argument
destinationas it is not designed for end-users.- Args:
- destination (dict, optional): If provided, the state of module will
be updated into the dict and the same object is returned. Otherwise, an
OrderedDictwill be created and returned. Default:None.- prefix (str, optional): a prefix added to parameter and buffer
names to compose the keys in state_dict. Default:
''.- keep_vars (bool, optional): by default the
Tensors returned in the state dict are detached from autograd. If it’s set to
True, detaching will not be performed. Default:False.
- Returns:
- dict:
a dictionary containing a whole state of the module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> module.state_dict().keys() ['bias', 'weight']
- to(*args, **kwargs)
Move and/or cast the parameters and buffers.
This can be called as
- to(device=None, dtype=None, non_blocking=False)
- to(dtype, non_blocking=False)
- to(tensor, non_blocking=False)
- to(memory_format=torch.channels_last)
Its signature is similar to
torch.Tensor.to(), but only accepts floating point or complexdtypes. In addition, this method will only cast the floating point or complex parameters and buffers todtype(if given). The integral parameters and buffers will be moveddevice, if that is given, but with dtypes unchanged. Whennon_blockingis set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.See below for examples.
Note
This method modifies the module in-place.
- Args:
- device (
torch.device): the desired device of the parameters and buffers in this module
- dtype (
torch.dtype): the desired floating point or complex dtype of the parameters and buffers in this module
- tensor (torch.Tensor): Tensor whose dtype and device are the desired
dtype and device for all parameters and buffers in this module
- memory_format (
torch.memory_format): the desired memory format for 4D parameters and buffers in this module (keyword only argument)
- device (
- Returns:
Module: self
Examples:
>>> # xdoctest: +IGNORE_WANT("non-deterministic") >>> linear = nn.Linear(2, 2) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]]) >>> linear.to(torch.double) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]], dtype=torch.float64) >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1) >>> gpu1 = torch.device("cuda:1") >>> linear.to(gpu1, dtype=torch.half, non_blocking=True) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1') >>> cpu = torch.device("cpu") >>> linear.to(cpu) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16) >>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble) >>> linear.weight Parameter containing: tensor([[ 0.3741+0.j, 0.2382+0.j], [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128) >>> linear(torch.ones(3, 2, dtype=torch.cdouble)) tensor([[0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
- to_empty(*, device, recurse=True)
Move the parameters and buffers to the specified device without copying storage.
- Return type:
Self
- Args:
- device (
torch.device): The desired device of the parameters and buffers in this module.
- recurse (bool): Whether parameters and buffers of submodules should
be recursively moved to the specified device.
- device (
- Returns:
Module: self
- train(mode=True)
Set the module in training mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e., whether they are affected, e.g.
Dropout,BatchNorm, etc.- Return type:
Self
- Args:
- mode (bool): whether to set training mode (
True) or evaluation mode (
False). Default:True.
- mode (bool): whether to set training mode (
- Returns:
Module: self
- type(dst_type)
Casts all parameters and buffers to
dst_type. :rtype:SelfNote
This method modifies the module in-place.
- Args:
dst_type (type or string): the desired type
- Returns:
Module: self
- xpu(device=None)
Move all model parameters and buffers to the XPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- zero_grad(set_to_none=True)
Reset gradients of all model parameters.
See similar function under
torch.optim.Optimizerfor more context.- Return type:
None
- Args:
- set_to_none (bool): instead of setting to zero, set the grads to None.
See
torch.optim.Optimizer.zero_grad()for details.
- training: bool
- class lhotse.features.kaldi.layers.Wav2FFT(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True)[source]
Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The output is a complex-valued tensor.
Example:
>>> x = torch.randn(1, 16000, dtype=torch.float32) >>> x.shape torch.Size([1, 16000]) >>> t = Wav2FFT() >>> t(x).shape torch.Size([1, 100, 257])
The input is a tensor of shape
(batch_size, num_samples). The output is a tensor of shape(batch_size, num_frames, num_fft_bins)with dtypetorch.complex64.- __init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- property sampling_rate: int
- property frame_length: float
- property frame_shift: float
- property remove_dc_offset: bool
- property preemph_coeff: float
- property window_type: str
- property dither: float
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
TensorNote
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- T_destination = ~T_destination
- add_module(name, module)
Add a child module to the current module.
The module can be accessed as an attribute using the given name.
- Return type:
None
- Args:
- name (str): name of the child module. The child module can be
accessed from this module using the given name
module (Module): child module to be added to the module.
- apply(fn)
Apply
fnrecursively to every submodule (as returned by.children()) as well as self.Typical use includes initializing the parameters of a model (see also nn-init-doc).
- Return type:
Self
- Args:
fn (
Module-> None): function to be applied to each submodule- Returns:
Module: self
Example:
>>> @torch.no_grad() >>> def init_weights(m): >>> print(m) >>> if type(m) is nn.Linear: >>> m.weight.fill_(1.0) >>> print(m.weight) >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2)) >>> net.apply(init_weights) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )
- bfloat16()
Casts all floating point parameters and buffers to
bfloat16datatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- buffers(recurse=True)
Return an iterator over module buffers.
- Return type:
Iterator[Tensor]
- Args:
- recurse (bool): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module.
- Yields:
torch.Tensor: module buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for buf in model.buffers(): >>> print(type(buf), buf.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- call_super_init: bool = False
- children()
Return an iterator over immediate children modules.
- Return type:
Iterator[Module]
- Yields:
Module: a child module
- compile(*args, **kwargs)
Compile this Module’s forward using
torch.compile().This Module’s __call__ method is compiled and all arguments are passed as-is to
torch.compile().See
torch.compile()for details on the arguments for this function.- Return type:
None
- cpu()
Move all model parameters and buffers to the CPU. :rtype:
SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- cuda(device=None)
Move all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on GPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Args:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- double()
Casts all floating point parameters and buffers to
doubledatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- dump_patches: bool = False
- eval()
Set the module in evaluation mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e. whether they are affected, e.g.
Dropout,BatchNorm, etc.This is equivalent with
self.train(False).See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Returns:
Module: self
- extra_repr()
Return the extra representation of the module.
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
- Return type:
str
- float()
Casts all floating point parameters and buffers to
floatdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- get_buffer(target)
Return the buffer given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Tensor
- Args:
- target: The fully-qualified string name of the buffer
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.Tensor: The buffer referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not a buffer
- get_extra_state()
Return any extra state to include in the module’s state_dict.
Implement this and a corresponding
set_extra_state()for your module if you need to store extra state. This function is called when building the module’s state_dict().Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.
- Return type:
Any
- Returns:
object: Any extra state to store in the module’s state_dict
- get_parameter(target)
Return the parameter given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Parameter
- Args:
- target: The fully-qualified string name of the Parameter
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.nn.Parameter: The Parameter referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not an
nn.Parameter
- get_submodule(target)
Return the submodule given by
targetif it exists, otherwise throw an error.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2)) ) (linear): Linear(in_features=100, out_features=200, bias=True) ) )(The diagram shows an
nn.ModuleA.Awhich has a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To check whether or not we have the
linearsubmodule, we would callget_submodule("net_b.linear"). To check whether we have theconvsubmodule, we would callget_submodule("net_b.net_c.conv").The runtime of
get_submoduleis bounded by the degree of module nesting intarget. A query againstnamed_modulesachieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists,get_submoduleshould always be used.- Return type:
Module
- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
- Returns:
torch.nn.Module: The submodule referenced by
target- Raises:
- AttributeError: If at any point along the path resulting from
the target string the (sub)path resolves to a non-existent attribute name or an object that is not an instance of
nn.Module.
- half()
Casts all floating point parameters and buffers to
halfdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- ipu(device=None)
Move all model parameters and buffers to the IPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on IPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- load_state_dict(state_dict, strict=True, assign=False)
Copy parameters and buffers from
state_dictinto this module and its descendants.If
strictisTrue, then the keys ofstate_dictmust exactly match the keys returned by this module’sstate_dict()function.Warning
If
assignisTruethe optimizer must be created after the call toload_state_dictunlessget_swap_module_params_on_conversion()isTrue.- Args:
- state_dict (dict): a dict containing parameters and
persistent buffers.
- strict (bool, optional): whether to strictly enforce that the keys
in
state_dictmatch the keys returned by this module’sstate_dict()function. Default:True- assign (bool, optional): When set to
False, the properties of the tensors in the current module are preserved whereas setting it to
Truepreserves properties of the Tensors in the state dict. The only exception is therequires_gradfield ofParameterfor which the value from the module is preserved. Default:False
- Returns:
NamedTuplewithmissing_keysandunexpected_keysfields:missing_keysis a list of str containing any keys that are expectedby this module but missing from the provided
state_dict.
unexpected_keysis a list of str containing the keys that are notexpected by this module but present in the provided
state_dict.
- Note:
If a parameter or buffer is registered as
Noneand its corresponding key exists instate_dict,load_state_dict()will raise aRuntimeError.
- modules(remove_duplicate=True)
Return an iterator over all modules in the network.
- Return type:
Iterator[Module]
- Args:
- remove_duplicate: whether to remove the duplicated module instances in the result
or not.
- Yields:
Module: a module in the network
- Note:
Duplicate modules are returned only once by default. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.modules()): ... print(idx, '->', m) 0 -> Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) ) 1 -> Linear(in_features=2, out_features=2, bias=True)
- mtia(device=None)
Move all model parameters and buffers to the MTIA.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on MTIA while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- named_buffers(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
- Return type:
Iterator[tuple[str,Tensor]]
- Args:
prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.
remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.
- Yields:
(str, torch.Tensor): Tuple containing the name and buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, buf in self.named_buffers(): >>> if name in ['running_var']: >>> print(buf.size())
- named_children()
Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
- Return type:
Iterator[tuple[str, Module]]
- Yields:
(str, Module): Tuple containing a name and child module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, module in model.named_children(): >>> if name in ['conv4', 'conv5']: >>> print(module)
- named_modules(memo=None, prefix='', remove_duplicate=True)
Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
- Args:
memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result
or not
- Yields:
(str, Module): Tuple of name and module
- Note:
Duplicate modules are returned only once. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.named_modules()): ... print(idx, '->', m) 0 -> ('', Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )) 1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
- named_parameters(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
- Return type:
Iterator[tuple[str,Parameter]]
- Args:
prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- remove_duplicate (bool, optional): whether to remove the duplicated
parameters in the result. Defaults to True.
- Yields:
(str, Parameter): Tuple containing the name and parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, param in self.named_parameters(): >>> if name in ['bias']: >>> print(param.size())
- parameters(recurse=True)
Return an iterator over module parameters.
This is typically passed to an optimizer.
- Return type:
Iterator[Parameter]
- Args:
- recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- Yields:
Parameter: module parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for param in model.parameters(): >>> print(type(param), param.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- register_backward_hook(hook)
Register a backward hook on the module.
This function is deprecated in favor of
register_full_backward_hook()and the behavior of this function will change in future versions.- Return type:
RemovableHandle
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_buffer(name, tensor, persistent=True)
Add a buffer to the module.
This is typically used to register a buffer that should not be considered a model parameter. For example, BatchNorm’s
running_meanis not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by settingpersistenttoFalse. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’sstate_dict.Buffers can be accessed as attributes using given names.
- Return type:
None
- Args:
- name (str): name of the buffer. The buffer can be accessed
from this module using the given name
- tensor (Tensor or None): buffer to be registered. If
None, then operations that run on buffers, such as
cuda, are ignored. IfNone, the buffer is not included in the module’sstate_dict.- persistent (bool): whether the buffer is part of this module’s
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> self.register_buffer('running_mean', torch.zeros(num_features))
- register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)
Register a forward hook on the module.
The hook will be called every time after
forward()has computed an output.If
with_kwargsisFalseor not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called afterforward()is called. The hook should have the following signature:hook(module, args, output) -> None or modified output
If
with_kwargsisTrue, the forward hook will be passed thekwargsgiven to the forward function and be expected to return the output possibly modified. The hook should have the following signature:hook(module, args, kwargs, output) -> None or modified output
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If
True, the providedhookwill be firedbefore all existing
forwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforwardhooks on thistorch.nn.Module. Note that globalforwardhooks registered withregister_module_forward_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If
True, thehookwill be passed the kwargs given to the forward function. Default:
False- always_call (bool): If
Truethehookwill be run regardless of whether an exception is raised while calling the Module. Default:
False
- with_kwargs (bool): If
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)
Register a forward pre-hook on the module.
The hook will be called every time before
forward()is invoked.If
with_kwargsis false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:hook(module, args) -> None or modified input
If
with_kwargsis true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
forward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforward_prehooks on thistorch.nn.Module. Note that globalforward_prehooks registered withregister_module_forward_pre_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If true, the
hookwill be passed the kwargs given to the forward function. Default:
False
- with_kwargs (bool): If true, the
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_hook(hook, prepend=False)
Register a backward hook on the module.
The hook will be called every time the gradients with respect to a module are computed, and its firing rules are as follows: :rtype:
RemovableHandleOrdinarily, the hook fires when the gradients are computed with respect to the module inputs.
If none of the module inputs require gradients, the hook will fire when the gradients are computed with respect to module outputs.
If none of the module outputs require gradients, then the hooks will not fire.
The hook should have the following signature:
hook(module, grad_input, grad_output) -> tuple(Tensor) or None
The
grad_inputandgrad_outputare tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place ofgrad_inputin subsequent computations.grad_inputwill only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries ingrad_inputandgrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.
Warning
Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackwardhooks on thistorch.nn.Module. Note that globalbackwardhooks registered withregister_module_full_backward_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_pre_hook(hook, prepend=False)
Register a backward pre-hook on the module.
The hook will be called every time the gradients for the module are computed. The hook should have the following signature:
hook(module, grad_output) -> tuple[Tensor, ...], Tensor or None
The
grad_outputis a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place ofgrad_outputin subsequent computations. Entries ingrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype:
RemovableHandleWarning
Modifying inputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackward_prehooks on thistorch.nn.Module. Note that globalbackward_prehooks registered withregister_module_full_backward_pre_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_post_hook(hook)
Register a post-hook to be run after module’s
load_state_dict()is called.- It should have the following signature::
hook(module, incompatible_keys) -> None
The
moduleargument is the current module that this hook is registered on, and theincompatible_keysargument is aNamedTupleconsisting of attributesmissing_keysandunexpected_keys.missing_keysis alistofstrcontaining the missing keys andunexpected_keysis alistofstrcontaining the unexpected keys.The given incompatible_keys can be modified inplace if needed.
Note that the checks performed when calling
load_state_dict()withstrict=Trueare affected by modifications the hook makes tomissing_keysorunexpected_keys, as expected. Additions to either set of keys will result in an error being thrown whenstrict=True, and clearing out both missing and unexpected keys will avoid an error.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_pre_hook(hook)
Register a pre-hook to be run before module’s
load_state_dict()is called.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs) -> None # noqa: B950
- Arguments:
- hook (Callable): Callable hook that will be invoked before
loading the state dict.
- register_module(name, module)
Alias for
add_module().- Return type:
None
- register_parameter(name, param)
Add a parameter to the module.
The parameter can be accessed as an attribute using given name.
- Return type:
None
- Args:
- name (str): name of the parameter. The parameter can be accessed
from this module using the given name
- param (Parameter or None): parameter to be added to the module. If
None, then operations that run on parameters, such ascuda, are ignored. IfNone, the parameter is not included in the module’sstate_dict.
- register_state_dict_post_hook(hook)
Register a post-hook for the
state_dict()method.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata) -> None
The registered hooks can modify the
state_dictinplace.
- register_state_dict_pre_hook(hook)
Register a pre-hook for the
state_dict()method.- It should have the following signature::
hook(module, prefix, keep_vars) -> None
The registered hooks can be used to perform pre-processing before the
state_dictcall is made.
- requires_grad_(requires_grad=True)
Change if autograd should record operations on parameters in this module.
This method sets the parameters’
requires_gradattributes in-place.This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).
See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Args:
- requires_grad (bool): whether autograd should record operations on
parameters in this module. Default:
True.
- Returns:
Module: self
- set_extra_state(state)
Set extra state contained in the loaded state_dict.
This function is called from
load_state_dict()to handle any extra state found within the state_dict. Implement this function and a correspondingget_extra_state()for your module if you need to store extra state within its state_dict.- Return type:
None
- Args:
state (dict): Extra state from the state_dict
- set_submodule(target, module, strict=False)
Set the submodule given by
targetif it exists, otherwise throw an error. :rtype:NoneNote
If
strictis set toFalse(default), the method will replace an existing submodule or create a new submodule if the parent module exists. Ifstrictis set toTrue, the method will only attempt to replace an existing submodule and throw an error if the submodule does not exist.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(3, 3, 3) ) (linear): Linear(3, 3) ) )(The diagram shows an
nn.ModuleA.Ahas a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To override the
Conv2dwith a new submoduleLinear, you could callset_submodule("net_b.net_c.conv", nn.Linear(1, 1))wherestrictcould beTrueorFalseTo add a new submodule
Conv2dto the existingnet_bmodule, you would callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1)).In the above if you set
strict=Trueand callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1), strict=True), an AttributeError will be raised becausenet_bdoes not have a submodule namedconv.- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
module: The module to set the submodule to. strict: If
False, the method will replace an existing submoduleor create a new submodule if the parent module exists. If
True, the method will only attempt to replace an existing submodule and throw an error if the submodule doesn’t already exist.- Raises:
ValueError: If the
targetstring is empty or ifmoduleis not an instance ofnn.Module. AttributeError: If at any point along the path resulting fromthe
targetstring the (sub)path resolves to a non-existent attribute name or an object that is not an instance ofnn.Module.
See
torch.Tensor.share_memory_().- Return type:
Self
- state_dict(*args, destination=None, prefix='', keep_vars=False)
Return a dictionary containing references to the whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to
Noneare not included.Note
The returned object is a shallow copy. It contains references to the module’s parameters and buffers.
Warning
Currently
state_dict()also accepts positional arguments fordestination,prefixandkeep_varsin order. However, this is being deprecated and keyword arguments will be enforced in future releases.Warning
Please avoid the use of argument
destinationas it is not designed for end-users.- Args:
- destination (dict, optional): If provided, the state of module will
be updated into the dict and the same object is returned. Otherwise, an
OrderedDictwill be created and returned. Default:None.- prefix (str, optional): a prefix added to parameter and buffer
names to compose the keys in state_dict. Default:
''.- keep_vars (bool, optional): by default the
Tensors returned in the state dict are detached from autograd. If it’s set to
True, detaching will not be performed. Default:False.
- Returns:
- dict:
a dictionary containing a whole state of the module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> module.state_dict().keys() ['bias', 'weight']
- to(*args, **kwargs)
Move and/or cast the parameters and buffers.
This can be called as
- to(device=None, dtype=None, non_blocking=False)
- to(dtype, non_blocking=False)
- to(tensor, non_blocking=False)
- to(memory_format=torch.channels_last)
Its signature is similar to
torch.Tensor.to(), but only accepts floating point or complexdtypes. In addition, this method will only cast the floating point or complex parameters and buffers todtype(if given). The integral parameters and buffers will be moveddevice, if that is given, but with dtypes unchanged. Whennon_blockingis set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.See below for examples.
Note
This method modifies the module in-place.
- Args:
- device (
torch.device): the desired device of the parameters and buffers in this module
- dtype (
torch.dtype): the desired floating point or complex dtype of the parameters and buffers in this module
- tensor (torch.Tensor): Tensor whose dtype and device are the desired
dtype and device for all parameters and buffers in this module
- memory_format (
torch.memory_format): the desired memory format for 4D parameters and buffers in this module (keyword only argument)
- device (
- Returns:
Module: self
Examples:
>>> # xdoctest: +IGNORE_WANT("non-deterministic") >>> linear = nn.Linear(2, 2) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]]) >>> linear.to(torch.double) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]], dtype=torch.float64) >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1) >>> gpu1 = torch.device("cuda:1") >>> linear.to(gpu1, dtype=torch.half, non_blocking=True) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1') >>> cpu = torch.device("cpu") >>> linear.to(cpu) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16) >>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble) >>> linear.weight Parameter containing: tensor([[ 0.3741+0.j, 0.2382+0.j], [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128) >>> linear(torch.ones(3, 2, dtype=torch.cdouble)) tensor([[0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
- to_empty(*, device, recurse=True)
Move the parameters and buffers to the specified device without copying storage.
- Return type:
Self
- Args:
- device (
torch.device): The desired device of the parameters and buffers in this module.
- recurse (bool): Whether parameters and buffers of submodules should
be recursively moved to the specified device.
- device (
- Returns:
Module: self
- train(mode=True)
Set the module in training mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e., whether they are affected, e.g.
Dropout,BatchNorm, etc.- Return type:
Self
- Args:
- mode (bool): whether to set training mode (
True) or evaluation mode (
False). Default:True.
- mode (bool): whether to set training mode (
- Returns:
Module: self
- type(dst_type)
Casts all parameters and buffers to
dst_type. :rtype:SelfNote
This method modifies the module in-place.
- Args:
dst_type (type or string): the desired type
- Returns:
Module: self
- xpu(device=None)
Move all model parameters and buffers to the XPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- zero_grad(set_to_none=True)
Reset gradients of all model parameters.
See similar function under
torch.optim.Optimizerfor more context.- Return type:
None
- Args:
- set_to_none (bool): instead of setting to zero, set the grads to None.
See
torch.optim.Optimizer.zero_grad()for details.
- training: bool
- class lhotse.features.kaldi.layers.Wav2Spec(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]
Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The STFT is transformed either to a magnitude spectrum (
use_fft_mag=True) or a power spectrum (use_fft_mag=False).Example:
>>> x = torch.randn(1, 16000, dtype=torch.float32) >>> x.shape torch.Size([1, 16000]) >>> t = Wav2Spec() >>> t(x).shape torch.Size([1, 100, 257])
The input is a tensor of shape
(batch_size, num_samples). The output is a tensor of shape(batch_size, num_frames, num_fft_bins).- __init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- T_destination = ~T_destination
- add_module(name, module)
Add a child module to the current module.
The module can be accessed as an attribute using the given name.
- Return type:
None
- Args:
- name (str): name of the child module. The child module can be
accessed from this module using the given name
module (Module): child module to be added to the module.
- apply(fn)
Apply
fnrecursively to every submodule (as returned by.children()) as well as self.Typical use includes initializing the parameters of a model (see also nn-init-doc).
- Return type:
Self
- Args:
fn (
Module-> None): function to be applied to each submodule- Returns:
Module: self
Example:
>>> @torch.no_grad() >>> def init_weights(m): >>> print(m) >>> if type(m) is nn.Linear: >>> m.weight.fill_(1.0) >>> print(m.weight) >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2)) >>> net.apply(init_weights) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )
- bfloat16()
Casts all floating point parameters and buffers to
bfloat16datatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- buffers(recurse=True)
Return an iterator over module buffers.
- Return type:
Iterator[Tensor]
- Args:
- recurse (bool): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module.
- Yields:
torch.Tensor: module buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for buf in model.buffers(): >>> print(type(buf), buf.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- call_super_init: bool = False
- children()
Return an iterator over immediate children modules.
- Return type:
Iterator[Module]
- Yields:
Module: a child module
- compile(*args, **kwargs)
Compile this Module’s forward using
torch.compile().This Module’s __call__ method is compiled and all arguments are passed as-is to
torch.compile().See
torch.compile()for details on the arguments for this function.- Return type:
None
- cpu()
Move all model parameters and buffers to the CPU. :rtype:
SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- cuda(device=None)
Move all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on GPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Args:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- property dither: float
- double()
Casts all floating point parameters and buffers to
doubledatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- dump_patches: bool = False
- eval()
Set the module in evaluation mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e. whether they are affected, e.g.
Dropout,BatchNorm, etc.This is equivalent with
self.train(False).See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Returns:
Module: self
- extra_repr()
Return the extra representation of the module.
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
- Return type:
str
- float()
Casts all floating point parameters and buffers to
floatdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- forward(x)
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
TensorNote
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- property frame_length: float
- property frame_shift: float
- get_buffer(target)
Return the buffer given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Tensor
- Args:
- target: The fully-qualified string name of the buffer
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.Tensor: The buffer referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not a buffer
- get_extra_state()
Return any extra state to include in the module’s state_dict.
Implement this and a corresponding
set_extra_state()for your module if you need to store extra state. This function is called when building the module’s state_dict().Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.
- Return type:
Any
- Returns:
object: Any extra state to store in the module’s state_dict
- get_parameter(target)
Return the parameter given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Parameter
- Args:
- target: The fully-qualified string name of the Parameter
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.nn.Parameter: The Parameter referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not an
nn.Parameter
- get_submodule(target)
Return the submodule given by
targetif it exists, otherwise throw an error.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2)) ) (linear): Linear(in_features=100, out_features=200, bias=True) ) )(The diagram shows an
nn.ModuleA.Awhich has a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To check whether or not we have the
linearsubmodule, we would callget_submodule("net_b.linear"). To check whether we have theconvsubmodule, we would callget_submodule("net_b.net_c.conv").The runtime of
get_submoduleis bounded by the degree of module nesting intarget. A query againstnamed_modulesachieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists,get_submoduleshould always be used.- Return type:
Module
- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
- Returns:
torch.nn.Module: The submodule referenced by
target- Raises:
- AttributeError: If at any point along the path resulting from
the target string the (sub)path resolves to a non-existent attribute name or an object that is not an instance of
nn.Module.
- half()
Casts all floating point parameters and buffers to
halfdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- ipu(device=None)
Move all model parameters and buffers to the IPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on IPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- load_state_dict(state_dict, strict=True, assign=False)
Copy parameters and buffers from
state_dictinto this module and its descendants.If
strictisTrue, then the keys ofstate_dictmust exactly match the keys returned by this module’sstate_dict()function.Warning
If
assignisTruethe optimizer must be created after the call toload_state_dictunlessget_swap_module_params_on_conversion()isTrue.- Args:
- state_dict (dict): a dict containing parameters and
persistent buffers.
- strict (bool, optional): whether to strictly enforce that the keys
in
state_dictmatch the keys returned by this module’sstate_dict()function. Default:True- assign (bool, optional): When set to
False, the properties of the tensors in the current module are preserved whereas setting it to
Truepreserves properties of the Tensors in the state dict. The only exception is therequires_gradfield ofParameterfor which the value from the module is preserved. Default:False
- Returns:
NamedTuplewithmissing_keysandunexpected_keysfields:missing_keysis a list of str containing any keys that are expectedby this module but missing from the provided
state_dict.
unexpected_keysis a list of str containing the keys that are notexpected by this module but present in the provided
state_dict.
- Note:
If a parameter or buffer is registered as
Noneand its corresponding key exists instate_dict,load_state_dict()will raise aRuntimeError.
- modules(remove_duplicate=True)
Return an iterator over all modules in the network.
- Return type:
Iterator[Module]
- Args:
- remove_duplicate: whether to remove the duplicated module instances in the result
or not.
- Yields:
Module: a module in the network
- Note:
Duplicate modules are returned only once by default. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.modules()): ... print(idx, '->', m) 0 -> Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) ) 1 -> Linear(in_features=2, out_features=2, bias=True)
- mtia(device=None)
Move all model parameters and buffers to the MTIA.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on MTIA while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- named_buffers(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
- Return type:
Iterator[tuple[str,Tensor]]
- Args:
prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.
remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.
- Yields:
(str, torch.Tensor): Tuple containing the name and buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, buf in self.named_buffers(): >>> if name in ['running_var']: >>> print(buf.size())
- named_children()
Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
- Return type:
Iterator[tuple[str, Module]]
- Yields:
(str, Module): Tuple containing a name and child module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, module in model.named_children(): >>> if name in ['conv4', 'conv5']: >>> print(module)
- named_modules(memo=None, prefix='', remove_duplicate=True)
Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
- Args:
memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result
or not
- Yields:
(str, Module): Tuple of name and module
- Note:
Duplicate modules are returned only once. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.named_modules()): ... print(idx, '->', m) 0 -> ('', Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )) 1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
- named_parameters(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
- Return type:
Iterator[tuple[str,Parameter]]
- Args:
prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- remove_duplicate (bool, optional): whether to remove the duplicated
parameters in the result. Defaults to True.
- Yields:
(str, Parameter): Tuple containing the name and parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, param in self.named_parameters(): >>> if name in ['bias']: >>> print(param.size())
- online_inference(x, context=None)
- Return type:
Tuple[Tensor,Tensor]
- parameters(recurse=True)
Return an iterator over module parameters.
This is typically passed to an optimizer.
- Return type:
Iterator[Parameter]
- Args:
- recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- Yields:
Parameter: module parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for param in model.parameters(): >>> print(type(param), param.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- property preemph_coeff: float
- register_backward_hook(hook)
Register a backward hook on the module.
This function is deprecated in favor of
register_full_backward_hook()and the behavior of this function will change in future versions.- Return type:
RemovableHandle
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_buffer(name, tensor, persistent=True)
Add a buffer to the module.
This is typically used to register a buffer that should not be considered a model parameter. For example, BatchNorm’s
running_meanis not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by settingpersistenttoFalse. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’sstate_dict.Buffers can be accessed as attributes using given names.
- Return type:
None
- Args:
- name (str): name of the buffer. The buffer can be accessed
from this module using the given name
- tensor (Tensor or None): buffer to be registered. If
None, then operations that run on buffers, such as
cuda, are ignored. IfNone, the buffer is not included in the module’sstate_dict.- persistent (bool): whether the buffer is part of this module’s
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> self.register_buffer('running_mean', torch.zeros(num_features))
- register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)
Register a forward hook on the module.
The hook will be called every time after
forward()has computed an output.If
with_kwargsisFalseor not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called afterforward()is called. The hook should have the following signature:hook(module, args, output) -> None or modified output
If
with_kwargsisTrue, the forward hook will be passed thekwargsgiven to the forward function and be expected to return the output possibly modified. The hook should have the following signature:hook(module, args, kwargs, output) -> None or modified output
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If
True, the providedhookwill be firedbefore all existing
forwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforwardhooks on thistorch.nn.Module. Note that globalforwardhooks registered withregister_module_forward_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If
True, thehookwill be passed the kwargs given to the forward function. Default:
False- always_call (bool): If
Truethehookwill be run regardless of whether an exception is raised while calling the Module. Default:
False
- with_kwargs (bool): If
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)
Register a forward pre-hook on the module.
The hook will be called every time before
forward()is invoked.If
with_kwargsis false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:hook(module, args) -> None or modified input
If
with_kwargsis true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
forward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforward_prehooks on thistorch.nn.Module. Note that globalforward_prehooks registered withregister_module_forward_pre_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If true, the
hookwill be passed the kwargs given to the forward function. Default:
False
- with_kwargs (bool): If true, the
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_hook(hook, prepend=False)
Register a backward hook on the module.
The hook will be called every time the gradients with respect to a module are computed, and its firing rules are as follows: :rtype:
RemovableHandleOrdinarily, the hook fires when the gradients are computed with respect to the module inputs.
If none of the module inputs require gradients, the hook will fire when the gradients are computed with respect to module outputs.
If none of the module outputs require gradients, then the hooks will not fire.
The hook should have the following signature:
hook(module, grad_input, grad_output) -> tuple(Tensor) or None
The
grad_inputandgrad_outputare tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place ofgrad_inputin subsequent computations.grad_inputwill only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries ingrad_inputandgrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.
Warning
Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackwardhooks on thistorch.nn.Module. Note that globalbackwardhooks registered withregister_module_full_backward_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_pre_hook(hook, prepend=False)
Register a backward pre-hook on the module.
The hook will be called every time the gradients for the module are computed. The hook should have the following signature:
hook(module, grad_output) -> tuple[Tensor, ...], Tensor or None
The
grad_outputis a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place ofgrad_outputin subsequent computations. Entries ingrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype:
RemovableHandleWarning
Modifying inputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackward_prehooks on thistorch.nn.Module. Note that globalbackward_prehooks registered withregister_module_full_backward_pre_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_post_hook(hook)
Register a post-hook to be run after module’s
load_state_dict()is called.- It should have the following signature::
hook(module, incompatible_keys) -> None
The
moduleargument is the current module that this hook is registered on, and theincompatible_keysargument is aNamedTupleconsisting of attributesmissing_keysandunexpected_keys.missing_keysis alistofstrcontaining the missing keys andunexpected_keysis alistofstrcontaining the unexpected keys.The given incompatible_keys can be modified inplace if needed.
Note that the checks performed when calling
load_state_dict()withstrict=Trueare affected by modifications the hook makes tomissing_keysorunexpected_keys, as expected. Additions to either set of keys will result in an error being thrown whenstrict=True, and clearing out both missing and unexpected keys will avoid an error.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_pre_hook(hook)
Register a pre-hook to be run before module’s
load_state_dict()is called.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs) -> None # noqa: B950
- Arguments:
- hook (Callable): Callable hook that will be invoked before
loading the state dict.
- register_module(name, module)
Alias for
add_module().- Return type:
None
- register_parameter(name, param)
Add a parameter to the module.
The parameter can be accessed as an attribute using given name.
- Return type:
None
- Args:
- name (str): name of the parameter. The parameter can be accessed
from this module using the given name
- param (Parameter or None): parameter to be added to the module. If
None, then operations that run on parameters, such ascuda, are ignored. IfNone, the parameter is not included in the module’sstate_dict.
- register_state_dict_post_hook(hook)
Register a post-hook for the
state_dict()method.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata) -> None
The registered hooks can modify the
state_dictinplace.
- register_state_dict_pre_hook(hook)
Register a pre-hook for the
state_dict()method.- It should have the following signature::
hook(module, prefix, keep_vars) -> None
The registered hooks can be used to perform pre-processing before the
state_dictcall is made.
- property remove_dc_offset: bool
- requires_grad_(requires_grad=True)
Change if autograd should record operations on parameters in this module.
This method sets the parameters’
requires_gradattributes in-place.This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).
See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Args:
- requires_grad (bool): whether autograd should record operations on
parameters in this module. Default:
True.
- Returns:
Module: self
- property sampling_rate: int
- set_extra_state(state)
Set extra state contained in the loaded state_dict.
This function is called from
load_state_dict()to handle any extra state found within the state_dict. Implement this function and a correspondingget_extra_state()for your module if you need to store extra state within its state_dict.- Return type:
None
- Args:
state (dict): Extra state from the state_dict
- set_submodule(target, module, strict=False)
Set the submodule given by
targetif it exists, otherwise throw an error. :rtype:NoneNote
If
strictis set toFalse(default), the method will replace an existing submodule or create a new submodule if the parent module exists. Ifstrictis set toTrue, the method will only attempt to replace an existing submodule and throw an error if the submodule does not exist.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(3, 3, 3) ) (linear): Linear(3, 3) ) )(The diagram shows an
nn.ModuleA.Ahas a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To override the
Conv2dwith a new submoduleLinear, you could callset_submodule("net_b.net_c.conv", nn.Linear(1, 1))wherestrictcould beTrueorFalseTo add a new submodule
Conv2dto the existingnet_bmodule, you would callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1)).In the above if you set
strict=Trueand callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1), strict=True), an AttributeError will be raised becausenet_bdoes not have a submodule namedconv.- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
module: The module to set the submodule to. strict: If
False, the method will replace an existing submoduleor create a new submodule if the parent module exists. If
True, the method will only attempt to replace an existing submodule and throw an error if the submodule doesn’t already exist.- Raises:
ValueError: If the
targetstring is empty or ifmoduleis not an instance ofnn.Module. AttributeError: If at any point along the path resulting fromthe
targetstring the (sub)path resolves to a non-existent attribute name or an object that is not an instance ofnn.Module.
See
torch.Tensor.share_memory_().- Return type:
Self
- state_dict(*args, destination=None, prefix='', keep_vars=False)
Return a dictionary containing references to the whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to
Noneare not included.Note
The returned object is a shallow copy. It contains references to the module’s parameters and buffers.
Warning
Currently
state_dict()also accepts positional arguments fordestination,prefixandkeep_varsin order. However, this is being deprecated and keyword arguments will be enforced in future releases.Warning
Please avoid the use of argument
destinationas it is not designed for end-users.- Args:
- destination (dict, optional): If provided, the state of module will
be updated into the dict and the same object is returned. Otherwise, an
OrderedDictwill be created and returned. Default:None.- prefix (str, optional): a prefix added to parameter and buffer
names to compose the keys in state_dict. Default:
''.- keep_vars (bool, optional): by default the
Tensors returned in the state dict are detached from autograd. If it’s set to
True, detaching will not be performed. Default:False.
- Returns:
- dict:
a dictionary containing a whole state of the module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> module.state_dict().keys() ['bias', 'weight']
- to(*args, **kwargs)
Move and/or cast the parameters and buffers.
This can be called as
- to(device=None, dtype=None, non_blocking=False)
- to(dtype, non_blocking=False)
- to(tensor, non_blocking=False)
- to(memory_format=torch.channels_last)
Its signature is similar to
torch.Tensor.to(), but only accepts floating point or complexdtypes. In addition, this method will only cast the floating point or complex parameters and buffers todtype(if given). The integral parameters and buffers will be moveddevice, if that is given, but with dtypes unchanged. Whennon_blockingis set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.See below for examples.
Note
This method modifies the module in-place.
- Args:
- device (
torch.device): the desired device of the parameters and buffers in this module
- dtype (
torch.dtype): the desired floating point or complex dtype of the parameters and buffers in this module
- tensor (torch.Tensor): Tensor whose dtype and device are the desired
dtype and device for all parameters and buffers in this module
- memory_format (
torch.memory_format): the desired memory format for 4D parameters and buffers in this module (keyword only argument)
- device (
- Returns:
Module: self
Examples:
>>> # xdoctest: +IGNORE_WANT("non-deterministic") >>> linear = nn.Linear(2, 2) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]]) >>> linear.to(torch.double) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]], dtype=torch.float64) >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1) >>> gpu1 = torch.device("cuda:1") >>> linear.to(gpu1, dtype=torch.half, non_blocking=True) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1') >>> cpu = torch.device("cpu") >>> linear.to(cpu) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16) >>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble) >>> linear.weight Parameter containing: tensor([[ 0.3741+0.j, 0.2382+0.j], [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128) >>> linear(torch.ones(3, 2, dtype=torch.cdouble)) tensor([[0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
- to_empty(*, device, recurse=True)
Move the parameters and buffers to the specified device without copying storage.
- Return type:
Self
- Args:
- device (
torch.device): The desired device of the parameters and buffers in this module.
- recurse (bool): Whether parameters and buffers of submodules should
be recursively moved to the specified device.
- device (
- Returns:
Module: self
- train(mode=True)
Set the module in training mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e., whether they are affected, e.g.
Dropout,BatchNorm, etc.- Return type:
Self
- Args:
- mode (bool): whether to set training mode (
True) or evaluation mode (
False). Default:True.
- mode (bool): whether to set training mode (
- Returns:
Module: self
- type(dst_type)
Casts all parameters and buffers to
dst_type. :rtype:SelfNote
This method modifies the module in-place.
- Args:
dst_type (type or string): the desired type
- Returns:
Module: self
- property window_type: str
- xpu(device=None)
Move all model parameters and buffers to the XPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- zero_grad(set_to_none=True)
Reset gradients of all model parameters.
See similar function under
torch.optim.Optimizerfor more context.- Return type:
None
- Args:
- set_to_none (bool): instead of setting to zero, set the grads to None.
See
torch.optim.Optimizer.zero_grad()for details.
- training: bool
- class lhotse.features.kaldi.layers.Wav2LogSpec(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]
Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The STFT is transformed either to a log-magnitude spectrum (
use_fft_mag=True) or a log-power spectrum (use_fft_mag=False).Example:
>>> x = torch.randn(1, 16000, dtype=torch.float32) >>> x.shape torch.Size([1, 16000]) >>> t = Wav2LogSpec() >>> t(x).shape torch.Size([1, 100, 257])
The input is a tensor of shape
(batch_size, num_samples). The output is a tensor of shape(batch_size, num_frames, num_fft_bins).- __init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- T_destination = ~T_destination
- add_module(name, module)
Add a child module to the current module.
The module can be accessed as an attribute using the given name.
- Return type:
None
- Args:
- name (str): name of the child module. The child module can be
accessed from this module using the given name
module (Module): child module to be added to the module.
- apply(fn)
Apply
fnrecursively to every submodule (as returned by.children()) as well as self.Typical use includes initializing the parameters of a model (see also nn-init-doc).
- Return type:
Self
- Args:
fn (
Module-> None): function to be applied to each submodule- Returns:
Module: self
Example:
>>> @torch.no_grad() >>> def init_weights(m): >>> print(m) >>> if type(m) is nn.Linear: >>> m.weight.fill_(1.0) >>> print(m.weight) >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2)) >>> net.apply(init_weights) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )
- bfloat16()
Casts all floating point parameters and buffers to
bfloat16datatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- buffers(recurse=True)
Return an iterator over module buffers.
- Return type:
Iterator[Tensor]
- Args:
- recurse (bool): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module.
- Yields:
torch.Tensor: module buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for buf in model.buffers(): >>> print(type(buf), buf.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- call_super_init: bool = False
- children()
Return an iterator over immediate children modules.
- Return type:
Iterator[Module]
- Yields:
Module: a child module
- compile(*args, **kwargs)
Compile this Module’s forward using
torch.compile().This Module’s __call__ method is compiled and all arguments are passed as-is to
torch.compile().See
torch.compile()for details on the arguments for this function.- Return type:
None
- cpu()
Move all model parameters and buffers to the CPU. :rtype:
SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- cuda(device=None)
Move all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on GPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Args:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- property dither: float
- double()
Casts all floating point parameters and buffers to
doubledatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- dump_patches: bool = False
- eval()
Set the module in evaluation mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e. whether they are affected, e.g.
Dropout,BatchNorm, etc.This is equivalent with
self.train(False).See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Returns:
Module: self
- extra_repr()
Return the extra representation of the module.
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
- Return type:
str
- float()
Casts all floating point parameters and buffers to
floatdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- forward(x)
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
TensorNote
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- property frame_length: float
- property frame_shift: float
- get_buffer(target)
Return the buffer given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Tensor
- Args:
- target: The fully-qualified string name of the buffer
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.Tensor: The buffer referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not a buffer
- get_extra_state()
Return any extra state to include in the module’s state_dict.
Implement this and a corresponding
set_extra_state()for your module if you need to store extra state. This function is called when building the module’s state_dict().Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.
- Return type:
Any
- Returns:
object: Any extra state to store in the module’s state_dict
- get_parameter(target)
Return the parameter given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Parameter
- Args:
- target: The fully-qualified string name of the Parameter
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.nn.Parameter: The Parameter referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not an
nn.Parameter
- get_submodule(target)
Return the submodule given by
targetif it exists, otherwise throw an error.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2)) ) (linear): Linear(in_features=100, out_features=200, bias=True) ) )(The diagram shows an
nn.ModuleA.Awhich has a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To check whether or not we have the
linearsubmodule, we would callget_submodule("net_b.linear"). To check whether we have theconvsubmodule, we would callget_submodule("net_b.net_c.conv").The runtime of
get_submoduleis bounded by the degree of module nesting intarget. A query againstnamed_modulesachieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists,get_submoduleshould always be used.- Return type:
Module
- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
- Returns:
torch.nn.Module: The submodule referenced by
target- Raises:
- AttributeError: If at any point along the path resulting from
the target string the (sub)path resolves to a non-existent attribute name or an object that is not an instance of
nn.Module.
- half()
Casts all floating point parameters and buffers to
halfdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- ipu(device=None)
Move all model parameters and buffers to the IPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on IPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- load_state_dict(state_dict, strict=True, assign=False)
Copy parameters and buffers from
state_dictinto this module and its descendants.If
strictisTrue, then the keys ofstate_dictmust exactly match the keys returned by this module’sstate_dict()function.Warning
If
assignisTruethe optimizer must be created after the call toload_state_dictunlessget_swap_module_params_on_conversion()isTrue.- Args:
- state_dict (dict): a dict containing parameters and
persistent buffers.
- strict (bool, optional): whether to strictly enforce that the keys
in
state_dictmatch the keys returned by this module’sstate_dict()function. Default:True- assign (bool, optional): When set to
False, the properties of the tensors in the current module are preserved whereas setting it to
Truepreserves properties of the Tensors in the state dict. The only exception is therequires_gradfield ofParameterfor which the value from the module is preserved. Default:False
- Returns:
NamedTuplewithmissing_keysandunexpected_keysfields:missing_keysis a list of str containing any keys that are expectedby this module but missing from the provided
state_dict.
unexpected_keysis a list of str containing the keys that are notexpected by this module but present in the provided
state_dict.
- Note:
If a parameter or buffer is registered as
Noneand its corresponding key exists instate_dict,load_state_dict()will raise aRuntimeError.
- modules(remove_duplicate=True)
Return an iterator over all modules in the network.
- Return type:
Iterator[Module]
- Args:
- remove_duplicate: whether to remove the duplicated module instances in the result
or not.
- Yields:
Module: a module in the network
- Note:
Duplicate modules are returned only once by default. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.modules()): ... print(idx, '->', m) 0 -> Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) ) 1 -> Linear(in_features=2, out_features=2, bias=True)
- mtia(device=None)
Move all model parameters and buffers to the MTIA.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on MTIA while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- named_buffers(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
- Return type:
Iterator[tuple[str,Tensor]]
- Args:
prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.
remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.
- Yields:
(str, torch.Tensor): Tuple containing the name and buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, buf in self.named_buffers(): >>> if name in ['running_var']: >>> print(buf.size())
- named_children()
Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
- Return type:
Iterator[tuple[str, Module]]
- Yields:
(str, Module): Tuple containing a name and child module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, module in model.named_children(): >>> if name in ['conv4', 'conv5']: >>> print(module)
- named_modules(memo=None, prefix='', remove_duplicate=True)
Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
- Args:
memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result
or not
- Yields:
(str, Module): Tuple of name and module
- Note:
Duplicate modules are returned only once. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.named_modules()): ... print(idx, '->', m) 0 -> ('', Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )) 1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
- named_parameters(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
- Return type:
Iterator[tuple[str,Parameter]]
- Args:
prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- remove_duplicate (bool, optional): whether to remove the duplicated
parameters in the result. Defaults to True.
- Yields:
(str, Parameter): Tuple containing the name and parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, param in self.named_parameters(): >>> if name in ['bias']: >>> print(param.size())
- online_inference(x, context=None)
- Return type:
Tuple[Tensor,Tensor]
- parameters(recurse=True)
Return an iterator over module parameters.
This is typically passed to an optimizer.
- Return type:
Iterator[Parameter]
- Args:
- recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- Yields:
Parameter: module parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for param in model.parameters(): >>> print(type(param), param.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- property preemph_coeff: float
- register_backward_hook(hook)
Register a backward hook on the module.
This function is deprecated in favor of
register_full_backward_hook()and the behavior of this function will change in future versions.- Return type:
RemovableHandle
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_buffer(name, tensor, persistent=True)
Add a buffer to the module.
This is typically used to register a buffer that should not be considered a model parameter. For example, BatchNorm’s
running_meanis not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by settingpersistenttoFalse. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’sstate_dict.Buffers can be accessed as attributes using given names.
- Return type:
None
- Args:
- name (str): name of the buffer. The buffer can be accessed
from this module using the given name
- tensor (Tensor or None): buffer to be registered. If
None, then operations that run on buffers, such as
cuda, are ignored. IfNone, the buffer is not included in the module’sstate_dict.- persistent (bool): whether the buffer is part of this module’s
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> self.register_buffer('running_mean', torch.zeros(num_features))
- register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)
Register a forward hook on the module.
The hook will be called every time after
forward()has computed an output.If
with_kwargsisFalseor not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called afterforward()is called. The hook should have the following signature:hook(module, args, output) -> None or modified output
If
with_kwargsisTrue, the forward hook will be passed thekwargsgiven to the forward function and be expected to return the output possibly modified. The hook should have the following signature:hook(module, args, kwargs, output) -> None or modified output
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If
True, the providedhookwill be firedbefore all existing
forwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforwardhooks on thistorch.nn.Module. Note that globalforwardhooks registered withregister_module_forward_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If
True, thehookwill be passed the kwargs given to the forward function. Default:
False- always_call (bool): If
Truethehookwill be run regardless of whether an exception is raised while calling the Module. Default:
False
- with_kwargs (bool): If
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)
Register a forward pre-hook on the module.
The hook will be called every time before
forward()is invoked.If
with_kwargsis false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:hook(module, args) -> None or modified input
If
with_kwargsis true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
forward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforward_prehooks on thistorch.nn.Module. Note that globalforward_prehooks registered withregister_module_forward_pre_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If true, the
hookwill be passed the kwargs given to the forward function. Default:
False
- with_kwargs (bool): If true, the
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_hook(hook, prepend=False)
Register a backward hook on the module.
The hook will be called every time the gradients with respect to a module are computed, and its firing rules are as follows: :rtype:
RemovableHandleOrdinarily, the hook fires when the gradients are computed with respect to the module inputs.
If none of the module inputs require gradients, the hook will fire when the gradients are computed with respect to module outputs.
If none of the module outputs require gradients, then the hooks will not fire.
The hook should have the following signature:
hook(module, grad_input, grad_output) -> tuple(Tensor) or None
The
grad_inputandgrad_outputare tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place ofgrad_inputin subsequent computations.grad_inputwill only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries ingrad_inputandgrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.
Warning
Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackwardhooks on thistorch.nn.Module. Note that globalbackwardhooks registered withregister_module_full_backward_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_pre_hook(hook, prepend=False)
Register a backward pre-hook on the module.
The hook will be called every time the gradients for the module are computed. The hook should have the following signature:
hook(module, grad_output) -> tuple[Tensor, ...], Tensor or None
The
grad_outputis a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place ofgrad_outputin subsequent computations. Entries ingrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype:
RemovableHandleWarning
Modifying inputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackward_prehooks on thistorch.nn.Module. Note that globalbackward_prehooks registered withregister_module_full_backward_pre_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_post_hook(hook)
Register a post-hook to be run after module’s
load_state_dict()is called.- It should have the following signature::
hook(module, incompatible_keys) -> None
The
moduleargument is the current module that this hook is registered on, and theincompatible_keysargument is aNamedTupleconsisting of attributesmissing_keysandunexpected_keys.missing_keysis alistofstrcontaining the missing keys andunexpected_keysis alistofstrcontaining the unexpected keys.The given incompatible_keys can be modified inplace if needed.
Note that the checks performed when calling
load_state_dict()withstrict=Trueare affected by modifications the hook makes tomissing_keysorunexpected_keys, as expected. Additions to either set of keys will result in an error being thrown whenstrict=True, and clearing out both missing and unexpected keys will avoid an error.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_pre_hook(hook)
Register a pre-hook to be run before module’s
load_state_dict()is called.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs) -> None # noqa: B950
- Arguments:
- hook (Callable): Callable hook that will be invoked before
loading the state dict.
- register_module(name, module)
Alias for
add_module().- Return type:
None
- register_parameter(name, param)
Add a parameter to the module.
The parameter can be accessed as an attribute using given name.
- Return type:
None
- Args:
- name (str): name of the parameter. The parameter can be accessed
from this module using the given name
- param (Parameter or None): parameter to be added to the module. If
None, then operations that run on parameters, such ascuda, are ignored. IfNone, the parameter is not included in the module’sstate_dict.
- register_state_dict_post_hook(hook)
Register a post-hook for the
state_dict()method.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata) -> None
The registered hooks can modify the
state_dictinplace.
- register_state_dict_pre_hook(hook)
Register a pre-hook for the
state_dict()method.- It should have the following signature::
hook(module, prefix, keep_vars) -> None
The registered hooks can be used to perform pre-processing before the
state_dictcall is made.
- property remove_dc_offset: bool
- requires_grad_(requires_grad=True)
Change if autograd should record operations on parameters in this module.
This method sets the parameters’
requires_gradattributes in-place.This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).
See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Args:
- requires_grad (bool): whether autograd should record operations on
parameters in this module. Default:
True.
- Returns:
Module: self
- property sampling_rate: int
- set_extra_state(state)
Set extra state contained in the loaded state_dict.
This function is called from
load_state_dict()to handle any extra state found within the state_dict. Implement this function and a correspondingget_extra_state()for your module if you need to store extra state within its state_dict.- Return type:
None
- Args:
state (dict): Extra state from the state_dict
- set_submodule(target, module, strict=False)
Set the submodule given by
targetif it exists, otherwise throw an error. :rtype:NoneNote
If
strictis set toFalse(default), the method will replace an existing submodule or create a new submodule if the parent module exists. Ifstrictis set toTrue, the method will only attempt to replace an existing submodule and throw an error if the submodule does not exist.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(3, 3, 3) ) (linear): Linear(3, 3) ) )(The diagram shows an
nn.ModuleA.Ahas a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To override the
Conv2dwith a new submoduleLinear, you could callset_submodule("net_b.net_c.conv", nn.Linear(1, 1))wherestrictcould beTrueorFalseTo add a new submodule
Conv2dto the existingnet_bmodule, you would callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1)).In the above if you set
strict=Trueand callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1), strict=True), an AttributeError will be raised becausenet_bdoes not have a submodule namedconv.- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
module: The module to set the submodule to. strict: If
False, the method will replace an existing submoduleor create a new submodule if the parent module exists. If
True, the method will only attempt to replace an existing submodule and throw an error if the submodule doesn’t already exist.- Raises:
ValueError: If the
targetstring is empty or ifmoduleis not an instance ofnn.Module. AttributeError: If at any point along the path resulting fromthe
targetstring the (sub)path resolves to a non-existent attribute name or an object that is not an instance ofnn.Module.
See
torch.Tensor.share_memory_().- Return type:
Self
- state_dict(*args, destination=None, prefix='', keep_vars=False)
Return a dictionary containing references to the whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to
Noneare not included.Note
The returned object is a shallow copy. It contains references to the module’s parameters and buffers.
Warning
Currently
state_dict()also accepts positional arguments fordestination,prefixandkeep_varsin order. However, this is being deprecated and keyword arguments will be enforced in future releases.Warning
Please avoid the use of argument
destinationas it is not designed for end-users.- Args:
- destination (dict, optional): If provided, the state of module will
be updated into the dict and the same object is returned. Otherwise, an
OrderedDictwill be created and returned. Default:None.- prefix (str, optional): a prefix added to parameter and buffer
names to compose the keys in state_dict. Default:
''.- keep_vars (bool, optional): by default the
Tensors returned in the state dict are detached from autograd. If it’s set to
True, detaching will not be performed. Default:False.
- Returns:
- dict:
a dictionary containing a whole state of the module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> module.state_dict().keys() ['bias', 'weight']
- to(*args, **kwargs)
Move and/or cast the parameters and buffers.
This can be called as
- to(device=None, dtype=None, non_blocking=False)
- to(dtype, non_blocking=False)
- to(tensor, non_blocking=False)
- to(memory_format=torch.channels_last)
Its signature is similar to
torch.Tensor.to(), but only accepts floating point or complexdtypes. In addition, this method will only cast the floating point or complex parameters and buffers todtype(if given). The integral parameters and buffers will be moveddevice, if that is given, but with dtypes unchanged. Whennon_blockingis set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.See below for examples.
Note
This method modifies the module in-place.
- Args:
- device (
torch.device): the desired device of the parameters and buffers in this module
- dtype (
torch.dtype): the desired floating point or complex dtype of the parameters and buffers in this module
- tensor (torch.Tensor): Tensor whose dtype and device are the desired
dtype and device for all parameters and buffers in this module
- memory_format (
torch.memory_format): the desired memory format for 4D parameters and buffers in this module (keyword only argument)
- device (
- Returns:
Module: self
Examples:
>>> # xdoctest: +IGNORE_WANT("non-deterministic") >>> linear = nn.Linear(2, 2) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]]) >>> linear.to(torch.double) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]], dtype=torch.float64) >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1) >>> gpu1 = torch.device("cuda:1") >>> linear.to(gpu1, dtype=torch.half, non_blocking=True) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1') >>> cpu = torch.device("cpu") >>> linear.to(cpu) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16) >>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble) >>> linear.weight Parameter containing: tensor([[ 0.3741+0.j, 0.2382+0.j], [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128) >>> linear(torch.ones(3, 2, dtype=torch.cdouble)) tensor([[0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
- to_empty(*, device, recurse=True)
Move the parameters and buffers to the specified device without copying storage.
- Return type:
Self
- Args:
- device (
torch.device): The desired device of the parameters and buffers in this module.
- recurse (bool): Whether parameters and buffers of submodules should
be recursively moved to the specified device.
- device (
- Returns:
Module: self
- train(mode=True)
Set the module in training mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e., whether they are affected, e.g.
Dropout,BatchNorm, etc.- Return type:
Self
- Args:
- mode (bool): whether to set training mode (
True) or evaluation mode (
False). Default:True.
- mode (bool): whether to set training mode (
- Returns:
Module: self
- type(dst_type)
Casts all parameters and buffers to
dst_type. :rtype:SelfNote
This method modifies the module in-place.
- Args:
dst_type (type or string): the desired type
- Returns:
Module: self
- property window_type: str
- xpu(device=None)
Move all model parameters and buffers to the XPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- zero_grad(set_to_none=True)
Reset gradients of all model parameters.
See similar function under
torch.optim.Optimizerfor more context.- Return type:
None
- Args:
- set_to_none (bool): instead of setting to zero, set the grads to None.
See
torch.optim.Optimizer.zero_grad()for details.
- training: bool
- class lhotse.features.kaldi.layers.Wav2LogFilterBank(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=80, norm_filters=False, torchaudio_compatible_mel_scale=True)[source]
Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their log-Mel filter bank energies (also known as “fbank”).
Example:
>>> x = torch.randn(1, 16000, dtype=torch.float32) >>> x.shape torch.Size([1, 16000]) >>> t = Wav2LogFilterBank() >>> t(x).shape torch.Size([1, 100, 80])
The input is a tensor of shape
(batch_size, num_samples). The output is a tensor of shape(batch_size, num_frames, num_filters).- __init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=80, norm_filters=False, torchaudio_compatible_mel_scale=True)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- T_destination = ~T_destination
- add_module(name, module)
Add a child module to the current module.
The module can be accessed as an attribute using the given name.
- Return type:
None
- Args:
- name (str): name of the child module. The child module can be
accessed from this module using the given name
module (Module): child module to be added to the module.
- apply(fn)
Apply
fnrecursively to every submodule (as returned by.children()) as well as self.Typical use includes initializing the parameters of a model (see also nn-init-doc).
- Return type:
Self
- Args:
fn (
Module-> None): function to be applied to each submodule- Returns:
Module: self
Example:
>>> @torch.no_grad() >>> def init_weights(m): >>> print(m) >>> if type(m) is nn.Linear: >>> m.weight.fill_(1.0) >>> print(m.weight) >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2)) >>> net.apply(init_weights) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )
- bfloat16()
Casts all floating point parameters and buffers to
bfloat16datatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- buffers(recurse=True)
Return an iterator over module buffers.
- Return type:
Iterator[Tensor]
- Args:
- recurse (bool): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module.
- Yields:
torch.Tensor: module buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for buf in model.buffers(): >>> print(type(buf), buf.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- call_super_init: bool = False
- children()
Return an iterator over immediate children modules.
- Return type:
Iterator[Module]
- Yields:
Module: a child module
- compile(*args, **kwargs)
Compile this Module’s forward using
torch.compile().This Module’s __call__ method is compiled and all arguments are passed as-is to
torch.compile().See
torch.compile()for details on the arguments for this function.- Return type:
None
- cpu()
Move all model parameters and buffers to the CPU. :rtype:
SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- cuda(device=None)
Move all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on GPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Args:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- property dither: float
- double()
Casts all floating point parameters and buffers to
doubledatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- dump_patches: bool = False
- eval()
Set the module in evaluation mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e. whether they are affected, e.g.
Dropout,BatchNorm, etc.This is equivalent with
self.train(False).See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Returns:
Module: self
- extra_repr()
Return the extra representation of the module.
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
- Return type:
str
- float()
Casts all floating point parameters and buffers to
floatdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- forward(x)
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
TensorNote
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- property frame_length: float
- property frame_shift: float
- get_buffer(target)
Return the buffer given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Tensor
- Args:
- target: The fully-qualified string name of the buffer
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.Tensor: The buffer referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not a buffer
- get_extra_state()
Return any extra state to include in the module’s state_dict.
Implement this and a corresponding
set_extra_state()for your module if you need to store extra state. This function is called when building the module’s state_dict().Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.
- Return type:
Any
- Returns:
object: Any extra state to store in the module’s state_dict
- get_parameter(target)
Return the parameter given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Parameter
- Args:
- target: The fully-qualified string name of the Parameter
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.nn.Parameter: The Parameter referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not an
nn.Parameter
- get_submodule(target)
Return the submodule given by
targetif it exists, otherwise throw an error.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2)) ) (linear): Linear(in_features=100, out_features=200, bias=True) ) )(The diagram shows an
nn.ModuleA.Awhich has a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To check whether or not we have the
linearsubmodule, we would callget_submodule("net_b.linear"). To check whether we have theconvsubmodule, we would callget_submodule("net_b.net_c.conv").The runtime of
get_submoduleis bounded by the degree of module nesting intarget. A query againstnamed_modulesachieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists,get_submoduleshould always be used.- Return type:
Module
- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
- Returns:
torch.nn.Module: The submodule referenced by
target- Raises:
- AttributeError: If at any point along the path resulting from
the target string the (sub)path resolves to a non-existent attribute name or an object that is not an instance of
nn.Module.
- half()
Casts all floating point parameters and buffers to
halfdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- ipu(device=None)
Move all model parameters and buffers to the IPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on IPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- load_state_dict(state_dict, strict=True, assign=False)
Copy parameters and buffers from
state_dictinto this module and its descendants.If
strictisTrue, then the keys ofstate_dictmust exactly match the keys returned by this module’sstate_dict()function.Warning
If
assignisTruethe optimizer must be created after the call toload_state_dictunlessget_swap_module_params_on_conversion()isTrue.- Args:
- state_dict (dict): a dict containing parameters and
persistent buffers.
- strict (bool, optional): whether to strictly enforce that the keys
in
state_dictmatch the keys returned by this module’sstate_dict()function. Default:True- assign (bool, optional): When set to
False, the properties of the tensors in the current module are preserved whereas setting it to
Truepreserves properties of the Tensors in the state dict. The only exception is therequires_gradfield ofParameterfor which the value from the module is preserved. Default:False
- Returns:
NamedTuplewithmissing_keysandunexpected_keysfields:missing_keysis a list of str containing any keys that are expectedby this module but missing from the provided
state_dict.
unexpected_keysis a list of str containing the keys that are notexpected by this module but present in the provided
state_dict.
- Note:
If a parameter or buffer is registered as
Noneand its corresponding key exists instate_dict,load_state_dict()will raise aRuntimeError.
- modules(remove_duplicate=True)
Return an iterator over all modules in the network.
- Return type:
Iterator[Module]
- Args:
- remove_duplicate: whether to remove the duplicated module instances in the result
or not.
- Yields:
Module: a module in the network
- Note:
Duplicate modules are returned only once by default. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.modules()): ... print(idx, '->', m) 0 -> Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) ) 1 -> Linear(in_features=2, out_features=2, bias=True)
- mtia(device=None)
Move all model parameters and buffers to the MTIA.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on MTIA while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- named_buffers(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
- Return type:
Iterator[tuple[str,Tensor]]
- Args:
prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.
remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.
- Yields:
(str, torch.Tensor): Tuple containing the name and buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, buf in self.named_buffers(): >>> if name in ['running_var']: >>> print(buf.size())
- named_children()
Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
- Return type:
Iterator[tuple[str, Module]]
- Yields:
(str, Module): Tuple containing a name and child module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, module in model.named_children(): >>> if name in ['conv4', 'conv5']: >>> print(module)
- named_modules(memo=None, prefix='', remove_duplicate=True)
Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
- Args:
memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result
or not
- Yields:
(str, Module): Tuple of name and module
- Note:
Duplicate modules are returned only once. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.named_modules()): ... print(idx, '->', m) 0 -> ('', Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )) 1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
- named_parameters(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
- Return type:
Iterator[tuple[str,Parameter]]
- Args:
prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- remove_duplicate (bool, optional): whether to remove the duplicated
parameters in the result. Defaults to True.
- Yields:
(str, Parameter): Tuple containing the name and parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, param in self.named_parameters(): >>> if name in ['bias']: >>> print(param.size())
- online_inference(x, context=None)
- Return type:
Tuple[Tensor,Tensor]
- parameters(recurse=True)
Return an iterator over module parameters.
This is typically passed to an optimizer.
- Return type:
Iterator[Parameter]
- Args:
- recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- Yields:
Parameter: module parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for param in model.parameters(): >>> print(type(param), param.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- property preemph_coeff: float
- register_backward_hook(hook)
Register a backward hook on the module.
This function is deprecated in favor of
register_full_backward_hook()and the behavior of this function will change in future versions.- Return type:
RemovableHandle
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_buffer(name, tensor, persistent=True)
Add a buffer to the module.
This is typically used to register a buffer that should not be considered a model parameter. For example, BatchNorm’s
running_meanis not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by settingpersistenttoFalse. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’sstate_dict.Buffers can be accessed as attributes using given names.
- Return type:
None
- Args:
- name (str): name of the buffer. The buffer can be accessed
from this module using the given name
- tensor (Tensor or None): buffer to be registered. If
None, then operations that run on buffers, such as
cuda, are ignored. IfNone, the buffer is not included in the module’sstate_dict.- persistent (bool): whether the buffer is part of this module’s
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> self.register_buffer('running_mean', torch.zeros(num_features))
- register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)
Register a forward hook on the module.
The hook will be called every time after
forward()has computed an output.If
with_kwargsisFalseor not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called afterforward()is called. The hook should have the following signature:hook(module, args, output) -> None or modified output
If
with_kwargsisTrue, the forward hook will be passed thekwargsgiven to the forward function and be expected to return the output possibly modified. The hook should have the following signature:hook(module, args, kwargs, output) -> None or modified output
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If
True, the providedhookwill be firedbefore all existing
forwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforwardhooks on thistorch.nn.Module. Note that globalforwardhooks registered withregister_module_forward_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If
True, thehookwill be passed the kwargs given to the forward function. Default:
False- always_call (bool): If
Truethehookwill be run regardless of whether an exception is raised while calling the Module. Default:
False
- with_kwargs (bool): If
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)
Register a forward pre-hook on the module.
The hook will be called every time before
forward()is invoked.If
with_kwargsis false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:hook(module, args) -> None or modified input
If
with_kwargsis true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
forward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforward_prehooks on thistorch.nn.Module. Note that globalforward_prehooks registered withregister_module_forward_pre_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If true, the
hookwill be passed the kwargs given to the forward function. Default:
False
- with_kwargs (bool): If true, the
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_hook(hook, prepend=False)
Register a backward hook on the module.
The hook will be called every time the gradients with respect to a module are computed, and its firing rules are as follows: :rtype:
RemovableHandleOrdinarily, the hook fires when the gradients are computed with respect to the module inputs.
If none of the module inputs require gradients, the hook will fire when the gradients are computed with respect to module outputs.
If none of the module outputs require gradients, then the hooks will not fire.
The hook should have the following signature:
hook(module, grad_input, grad_output) -> tuple(Tensor) or None
The
grad_inputandgrad_outputare tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place ofgrad_inputin subsequent computations.grad_inputwill only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries ingrad_inputandgrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.
Warning
Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackwardhooks on thistorch.nn.Module. Note that globalbackwardhooks registered withregister_module_full_backward_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_pre_hook(hook, prepend=False)
Register a backward pre-hook on the module.
The hook will be called every time the gradients for the module are computed. The hook should have the following signature:
hook(module, grad_output) -> tuple[Tensor, ...], Tensor or None
The
grad_outputis a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place ofgrad_outputin subsequent computations. Entries ingrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype:
RemovableHandleWarning
Modifying inputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackward_prehooks on thistorch.nn.Module. Note that globalbackward_prehooks registered withregister_module_full_backward_pre_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_post_hook(hook)
Register a post-hook to be run after module’s
load_state_dict()is called.- It should have the following signature::
hook(module, incompatible_keys) -> None
The
moduleargument is the current module that this hook is registered on, and theincompatible_keysargument is aNamedTupleconsisting of attributesmissing_keysandunexpected_keys.missing_keysis alistofstrcontaining the missing keys andunexpected_keysis alistofstrcontaining the unexpected keys.The given incompatible_keys can be modified inplace if needed.
Note that the checks performed when calling
load_state_dict()withstrict=Trueare affected by modifications the hook makes tomissing_keysorunexpected_keys, as expected. Additions to either set of keys will result in an error being thrown whenstrict=True, and clearing out both missing and unexpected keys will avoid an error.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_pre_hook(hook)
Register a pre-hook to be run before module’s
load_state_dict()is called.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs) -> None # noqa: B950
- Arguments:
- hook (Callable): Callable hook that will be invoked before
loading the state dict.
- register_module(name, module)
Alias for
add_module().- Return type:
None
- register_parameter(name, param)
Add a parameter to the module.
The parameter can be accessed as an attribute using given name.
- Return type:
None
- Args:
- name (str): name of the parameter. The parameter can be accessed
from this module using the given name
- param (Parameter or None): parameter to be added to the module. If
None, then operations that run on parameters, such ascuda, are ignored. IfNone, the parameter is not included in the module’sstate_dict.
- register_state_dict_post_hook(hook)
Register a post-hook for the
state_dict()method.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata) -> None
The registered hooks can modify the
state_dictinplace.
- register_state_dict_pre_hook(hook)
Register a pre-hook for the
state_dict()method.- It should have the following signature::
hook(module, prefix, keep_vars) -> None
The registered hooks can be used to perform pre-processing before the
state_dictcall is made.
- property remove_dc_offset: bool
- requires_grad_(requires_grad=True)
Change if autograd should record operations on parameters in this module.
This method sets the parameters’
requires_gradattributes in-place.This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).
See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Args:
- requires_grad (bool): whether autograd should record operations on
parameters in this module. Default:
True.
- Returns:
Module: self
- property sampling_rate: int
- set_extra_state(state)
Set extra state contained in the loaded state_dict.
This function is called from
load_state_dict()to handle any extra state found within the state_dict. Implement this function and a correspondingget_extra_state()for your module if you need to store extra state within its state_dict.- Return type:
None
- Args:
state (dict): Extra state from the state_dict
- set_submodule(target, module, strict=False)
Set the submodule given by
targetif it exists, otherwise throw an error. :rtype:NoneNote
If
strictis set toFalse(default), the method will replace an existing submodule or create a new submodule if the parent module exists. Ifstrictis set toTrue, the method will only attempt to replace an existing submodule and throw an error if the submodule does not exist.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(3, 3, 3) ) (linear): Linear(3, 3) ) )(The diagram shows an
nn.ModuleA.Ahas a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To override the
Conv2dwith a new submoduleLinear, you could callset_submodule("net_b.net_c.conv", nn.Linear(1, 1))wherestrictcould beTrueorFalseTo add a new submodule
Conv2dto the existingnet_bmodule, you would callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1)).In the above if you set
strict=Trueand callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1), strict=True), an AttributeError will be raised becausenet_bdoes not have a submodule namedconv.- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
module: The module to set the submodule to. strict: If
False, the method will replace an existing submoduleor create a new submodule if the parent module exists. If
True, the method will only attempt to replace an existing submodule and throw an error if the submodule doesn’t already exist.- Raises:
ValueError: If the
targetstring is empty or ifmoduleis not an instance ofnn.Module. AttributeError: If at any point along the path resulting fromthe
targetstring the (sub)path resolves to a non-existent attribute name or an object that is not an instance ofnn.Module.
See
torch.Tensor.share_memory_().- Return type:
Self
- state_dict(*args, destination=None, prefix='', keep_vars=False)
Return a dictionary containing references to the whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to
Noneare not included.Note
The returned object is a shallow copy. It contains references to the module’s parameters and buffers.
Warning
Currently
state_dict()also accepts positional arguments fordestination,prefixandkeep_varsin order. However, this is being deprecated and keyword arguments will be enforced in future releases.Warning
Please avoid the use of argument
destinationas it is not designed for end-users.- Args:
- destination (dict, optional): If provided, the state of module will
be updated into the dict and the same object is returned. Otherwise, an
OrderedDictwill be created and returned. Default:None.- prefix (str, optional): a prefix added to parameter and buffer
names to compose the keys in state_dict. Default:
''.- keep_vars (bool, optional): by default the
Tensors returned in the state dict are detached from autograd. If it’s set to
True, detaching will not be performed. Default:False.
- Returns:
- dict:
a dictionary containing a whole state of the module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> module.state_dict().keys() ['bias', 'weight']
- to(*args, **kwargs)
Move and/or cast the parameters and buffers.
This can be called as
- to(device=None, dtype=None, non_blocking=False)
- to(dtype, non_blocking=False)
- to(tensor, non_blocking=False)
- to(memory_format=torch.channels_last)
Its signature is similar to
torch.Tensor.to(), but only accepts floating point or complexdtypes. In addition, this method will only cast the floating point or complex parameters and buffers todtype(if given). The integral parameters and buffers will be moveddevice, if that is given, but with dtypes unchanged. Whennon_blockingis set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.See below for examples.
Note
This method modifies the module in-place.
- Args:
- device (
torch.device): the desired device of the parameters and buffers in this module
- dtype (
torch.dtype): the desired floating point or complex dtype of the parameters and buffers in this module
- tensor (torch.Tensor): Tensor whose dtype and device are the desired
dtype and device for all parameters and buffers in this module
- memory_format (
torch.memory_format): the desired memory format for 4D parameters and buffers in this module (keyword only argument)
- device (
- Returns:
Module: self
Examples:
>>> # xdoctest: +IGNORE_WANT("non-deterministic") >>> linear = nn.Linear(2, 2) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]]) >>> linear.to(torch.double) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]], dtype=torch.float64) >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1) >>> gpu1 = torch.device("cuda:1") >>> linear.to(gpu1, dtype=torch.half, non_blocking=True) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1') >>> cpu = torch.device("cpu") >>> linear.to(cpu) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16) >>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble) >>> linear.weight Parameter containing: tensor([[ 0.3741+0.j, 0.2382+0.j], [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128) >>> linear(torch.ones(3, 2, dtype=torch.cdouble)) tensor([[0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
- to_empty(*, device, recurse=True)
Move the parameters and buffers to the specified device without copying storage.
- Return type:
Self
- Args:
- device (
torch.device): The desired device of the parameters and buffers in this module.
- recurse (bool): Whether parameters and buffers of submodules should
be recursively moved to the specified device.
- device (
- Returns:
Module: self
- train(mode=True)
Set the module in training mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e., whether they are affected, e.g.
Dropout,BatchNorm, etc.- Return type:
Self
- Args:
- mode (bool): whether to set training mode (
True) or evaluation mode (
False). Default:True.
- mode (bool): whether to set training mode (
- Returns:
Module: self
- type(dst_type)
Casts all parameters and buffers to
dst_type. :rtype:SelfNote
This method modifies the module in-place.
- Args:
dst_type (type or string): the desired type
- Returns:
Module: self
- property window_type: str
- xpu(device=None)
Move all model parameters and buffers to the XPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- zero_grad(set_to_none=True)
Reset gradients of all model parameters.
See similar function under
torch.optim.Optimizerfor more context.- Return type:
None
- Args:
- set_to_none (bool): instead of setting to zero, set the grads to None.
See
torch.optim.Optimizer.zero_grad()for details.
- training: bool
- class lhotse.features.kaldi.layers.Wav2MFCC(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=23, norm_filters=False, num_ceps=13, cepstral_lifter=22, torchaudio_compatible_mel_scale=True)[source]
Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Mel-Frequency Cepstral Coefficients (MFCC).
Example:
>>> x = torch.randn(1, 16000, dtype=torch.float32) >>> x.shape torch.Size([1, 16000]) >>> t = Wav2MFCC() >>> t(x).shape torch.Size([1, 100, 13])
The input is a tensor of shape
(batch_size, num_samples). The output is a tensor of shape(batch_size, num_frames, num_ceps).- __init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=-400.0, num_filters=23, norm_filters=False, num_ceps=13, cepstral_lifter=22, torchaudio_compatible_mel_scale=True)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- static make_lifter(N, Q)[source]
Makes the liftering function
- Args:
N: Number of cepstral coefficients. Q: Liftering parameter
- Returns:
Liftering vector.
- T_destination = ~T_destination
- add_module(name, module)
Add a child module to the current module.
The module can be accessed as an attribute using the given name.
- Return type:
None
- Args:
- name (str): name of the child module. The child module can be
accessed from this module using the given name
module (Module): child module to be added to the module.
- apply(fn)
Apply
fnrecursively to every submodule (as returned by.children()) as well as self.Typical use includes initializing the parameters of a model (see also nn-init-doc).
- Return type:
Self
- Args:
fn (
Module-> None): function to be applied to each submodule- Returns:
Module: self
Example:
>>> @torch.no_grad() >>> def init_weights(m): >>> print(m) >>> if type(m) is nn.Linear: >>> m.weight.fill_(1.0) >>> print(m.weight) >>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2)) >>> net.apply(init_weights) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Linear(in_features=2, out_features=2, bias=True) Parameter containing: tensor([[1., 1.], [1., 1.]], requires_grad=True) Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )
- bfloat16()
Casts all floating point parameters and buffers to
bfloat16datatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- buffers(recurse=True)
Return an iterator over module buffers.
- Return type:
Iterator[Tensor]
- Args:
- recurse (bool): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module.
- Yields:
torch.Tensor: module buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for buf in model.buffers(): >>> print(type(buf), buf.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- call_super_init: bool = False
- children()
Return an iterator over immediate children modules.
- Return type:
Iterator[Module]
- Yields:
Module: a child module
- compile(*args, **kwargs)
Compile this Module’s forward using
torch.compile().This Module’s __call__ method is compiled and all arguments are passed as-is to
torch.compile().See
torch.compile()for details on the arguments for this function.- Return type:
None
- cpu()
Move all model parameters and buffers to the CPU. :rtype:
SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- cuda(device=None)
Move all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on GPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Args:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- property dither: float
- double()
Casts all floating point parameters and buffers to
doubledatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- dump_patches: bool = False
- eval()
Set the module in evaluation mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e. whether they are affected, e.g.
Dropout,BatchNorm, etc.This is equivalent with
self.train(False).See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Returns:
Module: self
- extra_repr()
Return the extra representation of the module.
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
- Return type:
str
- float()
Casts all floating point parameters and buffers to
floatdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- forward(x)
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
TensorNote
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- property frame_length: float
- property frame_shift: float
- get_buffer(target)
Return the buffer given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Tensor
- Args:
- target: The fully-qualified string name of the buffer
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.Tensor: The buffer referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not a buffer
- get_extra_state()
Return any extra state to include in the module’s state_dict.
Implement this and a corresponding
set_extra_state()for your module if you need to store extra state. This function is called when building the module’s state_dict().Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.
- Return type:
Any
- Returns:
object: Any extra state to store in the module’s state_dict
- get_parameter(target)
Return the parameter given by
targetif it exists, otherwise throw an error.See the docstring for
get_submodulefor a more detailed explanation of this method’s functionality as well as how to correctly specifytarget.- Return type:
Parameter
- Args:
- target: The fully-qualified string name of the Parameter
to look for. (See
get_submodulefor how to specify a fully-qualified string.)
- Returns:
torch.nn.Parameter: The Parameter referenced by
target- Raises:
- AttributeError: If the target string references an invalid
path or resolves to something that is not an
nn.Parameter
- get_submodule(target)
Return the submodule given by
targetif it exists, otherwise throw an error.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(16, 33, kernel_size=(3, 3), stride=(2, 2)) ) (linear): Linear(in_features=100, out_features=200, bias=True) ) )(The diagram shows an
nn.ModuleA.Awhich has a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To check whether or not we have the
linearsubmodule, we would callget_submodule("net_b.linear"). To check whether we have theconvsubmodule, we would callget_submodule("net_b.net_c.conv").The runtime of
get_submoduleis bounded by the degree of module nesting intarget. A query againstnamed_modulesachieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists,get_submoduleshould always be used.- Return type:
Module
- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
- Returns:
torch.nn.Module: The submodule referenced by
target- Raises:
- AttributeError: If at any point along the path resulting from
the target string the (sub)path resolves to a non-existent attribute name or an object that is not an instance of
nn.Module.
- half()
Casts all floating point parameters and buffers to
halfdatatype. :rtype:SelfNote
This method modifies the module in-place.
- Returns:
Module: self
- ipu(device=None)
Move all model parameters and buffers to the IPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on IPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- load_state_dict(state_dict, strict=True, assign=False)
Copy parameters and buffers from
state_dictinto this module and its descendants.If
strictisTrue, then the keys ofstate_dictmust exactly match the keys returned by this module’sstate_dict()function.Warning
If
assignisTruethe optimizer must be created after the call toload_state_dictunlessget_swap_module_params_on_conversion()isTrue.- Args:
- state_dict (dict): a dict containing parameters and
persistent buffers.
- strict (bool, optional): whether to strictly enforce that the keys
in
state_dictmatch the keys returned by this module’sstate_dict()function. Default:True- assign (bool, optional): When set to
False, the properties of the tensors in the current module are preserved whereas setting it to
Truepreserves properties of the Tensors in the state dict. The only exception is therequires_gradfield ofParameterfor which the value from the module is preserved. Default:False
- Returns:
NamedTuplewithmissing_keysandunexpected_keysfields:missing_keysis a list of str containing any keys that are expectedby this module but missing from the provided
state_dict.
unexpected_keysis a list of str containing the keys that are notexpected by this module but present in the provided
state_dict.
- Note:
If a parameter or buffer is registered as
Noneand its corresponding key exists instate_dict,load_state_dict()will raise aRuntimeError.
- modules(remove_duplicate=True)
Return an iterator over all modules in the network.
- Return type:
Iterator[Module]
- Args:
- remove_duplicate: whether to remove the duplicated module instances in the result
or not.
- Yields:
Module: a module in the network
- Note:
Duplicate modules are returned only once by default. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.modules()): ... print(idx, '->', m) 0 -> Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) ) 1 -> Linear(in_features=2, out_features=2, bias=True)
- mtia(device=None)
Move all model parameters and buffers to the MTIA.
This also makes associated parameters and buffers different objects. So it should be called before constructing the optimizer if the module will live on MTIA while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- named_buffers(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
- Return type:
Iterator[tuple[str,Tensor]]
- Args:
prefix (str): prefix to prepend to all buffer names. recurse (bool, optional): if True, then yields buffers of this module
and all submodules. Otherwise, yields only buffers that are direct members of this module. Defaults to True.
remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.
- Yields:
(str, torch.Tensor): Tuple containing the name and buffer
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, buf in self.named_buffers(): >>> if name in ['running_var']: >>> print(buf.size())
- named_children()
Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
- Return type:
Iterator[tuple[str, Module]]
- Yields:
(str, Module): Tuple containing a name and child module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, module in model.named_children(): >>> if name in ['conv4', 'conv5']: >>> print(module)
- named_modules(memo=None, prefix='', remove_duplicate=True)
Return an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
- Args:
memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result
or not
- Yields:
(str, Module): Tuple of name and module
- Note:
Duplicate modules are returned only once. In the following example,
lwill be returned only once.
Example:
>>> l = nn.Linear(2, 2) >>> net = nn.Sequential(l, l) >>> for idx, m in enumerate(net.named_modules()): ... print(idx, '->', m) 0 -> ('', Sequential( (0): Linear(in_features=2, out_features=2, bias=True) (1): Linear(in_features=2, out_features=2, bias=True) )) 1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
- named_parameters(prefix='', recurse=True, remove_duplicate=True)
Return an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
- Return type:
Iterator[tuple[str,Parameter]]
- Args:
prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- remove_duplicate (bool, optional): whether to remove the duplicated
parameters in the result. Defaults to True.
- Yields:
(str, Parameter): Tuple containing the name and parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for name, param in self.named_parameters(): >>> if name in ['bias']: >>> print(param.size())
- online_inference(x, context=None)
- Return type:
Tuple[Tensor,Tensor]
- parameters(recurse=True)
Return an iterator over module parameters.
This is typically passed to an optimizer.
- Return type:
Iterator[Parameter]
- Args:
- recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
- Yields:
Parameter: module parameter
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> for param in model.parameters(): >>> print(type(param), param.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- property preemph_coeff: float
- register_backward_hook(hook)
Register a backward hook on the module.
This function is deprecated in favor of
register_full_backward_hook()and the behavior of this function will change in future versions.- Return type:
RemovableHandle
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_buffer(name, tensor, persistent=True)
Add a buffer to the module.
This is typically used to register a buffer that should not be considered a model parameter. For example, BatchNorm’s
running_meanis not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by settingpersistenttoFalse. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’sstate_dict.Buffers can be accessed as attributes using given names.
- Return type:
None
- Args:
- name (str): name of the buffer. The buffer can be accessed
from this module using the given name
- tensor (Tensor or None): buffer to be registered. If
None, then operations that run on buffers, such as
cuda, are ignored. IfNone, the buffer is not included in the module’sstate_dict.- persistent (bool): whether the buffer is part of this module’s
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> self.register_buffer('running_mean', torch.zeros(num_features))
- register_forward_hook(hook, *, prepend=False, with_kwargs=False, always_call=False)
Register a forward hook on the module.
The hook will be called every time after
forward()has computed an output.If
with_kwargsisFalseor not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called afterforward()is called. The hook should have the following signature:hook(module, args, output) -> None or modified output
If
with_kwargsisTrue, the forward hook will be passed thekwargsgiven to the forward function and be expected to return the output possibly modified. The hook should have the following signature:hook(module, args, kwargs, output) -> None or modified output
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If
True, the providedhookwill be firedbefore all existing
forwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforwardhooks on thistorch.nn.Module. Note that globalforwardhooks registered withregister_module_forward_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If
True, thehookwill be passed the kwargs given to the forward function. Default:
False- always_call (bool): If
Truethehookwill be run regardless of whether an exception is raised while calling the Module. Default:
False
- with_kwargs (bool): If
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_forward_pre_hook(hook, *, prepend=False, with_kwargs=False)
Register a forward pre-hook on the module.
The hook will be called every time before
forward()is invoked.If
with_kwargsis false or not specified, the input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to theforward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned (unless that value is already a tuple). The hook should have the following signature:hook(module, args) -> None or modified input
If
with_kwargsis true, the forward pre-hook will be passed the kwargs given to the forward function. And if the hook modifies the input, both the args and kwargs should be returned. The hook should have the following signature:hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
- Return type:
RemovableHandle
- Args:
hook (Callable): The user defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
forward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingforward_prehooks on thistorch.nn.Module. Note that globalforward_prehooks registered withregister_module_forward_pre_hook()will fire before all hooks registered by this method. Default:False- with_kwargs (bool): If true, the
hookwill be passed the kwargs given to the forward function. Default:
False
- with_kwargs (bool): If true, the
- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_hook(hook, prepend=False)
Register a backward hook on the module.
The hook will be called every time the gradients with respect to a module are computed, and its firing rules are as follows: :rtype:
RemovableHandleOrdinarily, the hook fires when the gradients are computed with respect to the module inputs.
If none of the module inputs require gradients, the hook will fire when the gradients are computed with respect to module outputs.
If none of the module outputs require gradients, then the hooks will not fire.
The hook should have the following signature:
hook(module, grad_input, grad_output) -> tuple(Tensor) or None
The
grad_inputandgrad_outputare tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place ofgrad_inputin subsequent computations.grad_inputwill only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries ingrad_inputandgrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.
Warning
Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backwardhooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackwardhooks on thistorch.nn.Module. Note that globalbackwardhooks registered withregister_module_full_backward_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_full_backward_pre_hook(hook, prepend=False)
Register a backward pre-hook on the module.
The hook will be called every time the gradients for the module are computed. The hook should have the following signature:
hook(module, grad_output) -> tuple[Tensor, ...], Tensor or None
The
grad_outputis a tuple. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the output that will be used in place ofgrad_outputin subsequent computations. Entries ingrad_outputwill beNonefor all non-Tensor arguments.For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function. :rtype:
RemovableHandleWarning
Modifying inputs inplace is not allowed when using backward hooks and will raise an error.
- Args:
hook (Callable): The user-defined hook to be registered. prepend (bool): If true, the provided
hookwill be fired beforeall existing
backward_prehooks on thistorch.nn.Module. Otherwise, the providedhookwill be fired after all existingbackward_prehooks on thistorch.nn.Module. Note that globalbackward_prehooks registered withregister_module_full_backward_pre_hook()will fire before all hooks registered by this method.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_post_hook(hook)
Register a post-hook to be run after module’s
load_state_dict()is called.- It should have the following signature::
hook(module, incompatible_keys) -> None
The
moduleargument is the current module that this hook is registered on, and theincompatible_keysargument is aNamedTupleconsisting of attributesmissing_keysandunexpected_keys.missing_keysis alistofstrcontaining the missing keys andunexpected_keysis alistofstrcontaining the unexpected keys.The given incompatible_keys can be modified inplace if needed.
Note that the checks performed when calling
load_state_dict()withstrict=Trueare affected by modifications the hook makes tomissing_keysorunexpected_keys, as expected. Additions to either set of keys will result in an error being thrown whenstrict=True, and clearing out both missing and unexpected keys will avoid an error.- Returns:
torch.utils.hooks.RemovableHandle:a handle that can be used to remove the added hook by calling
handle.remove()
- register_load_state_dict_pre_hook(hook)
Register a pre-hook to be run before module’s
load_state_dict()is called.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs) -> None # noqa: B950
- Arguments:
- hook (Callable): Callable hook that will be invoked before
loading the state dict.
- register_module(name, module)
Alias for
add_module().- Return type:
None
- register_parameter(name, param)
Add a parameter to the module.
The parameter can be accessed as an attribute using given name.
- Return type:
None
- Args:
- name (str): name of the parameter. The parameter can be accessed
from this module using the given name
- param (Parameter or None): parameter to be added to the module. If
None, then operations that run on parameters, such ascuda, are ignored. IfNone, the parameter is not included in the module’sstate_dict.
- register_state_dict_post_hook(hook)
Register a post-hook for the
state_dict()method.- It should have the following signature::
hook(module, state_dict, prefix, local_metadata) -> None
The registered hooks can modify the
state_dictinplace.
- register_state_dict_pre_hook(hook)
Register a pre-hook for the
state_dict()method.- It should have the following signature::
hook(module, prefix, keep_vars) -> None
The registered hooks can be used to perform pre-processing before the
state_dictcall is made.
- property remove_dc_offset: bool
- requires_grad_(requires_grad=True)
Change if autograd should record operations on parameters in this module.
This method sets the parameters’
requires_gradattributes in-place.This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).
See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.
- Return type:
Self
- Args:
- requires_grad (bool): whether autograd should record operations on
parameters in this module. Default:
True.
- Returns:
Module: self
- property sampling_rate: int
- set_extra_state(state)
Set extra state contained in the loaded state_dict.
This function is called from
load_state_dict()to handle any extra state found within the state_dict. Implement this function and a correspondingget_extra_state()for your module if you need to store extra state within its state_dict.- Return type:
None
- Args:
state (dict): Extra state from the state_dict
- set_submodule(target, module, strict=False)
Set the submodule given by
targetif it exists, otherwise throw an error. :rtype:NoneNote
If
strictis set toFalse(default), the method will replace an existing submodule or create a new submodule if the parent module exists. Ifstrictis set toTrue, the method will only attempt to replace an existing submodule and throw an error if the submodule does not exist.For example, let’s say you have an
nn.ModuleAthat looks like this:A( (net_b): Module( (net_c): Module( (conv): Conv2d(3, 3, 3) ) (linear): Linear(3, 3) ) )(The diagram shows an
nn.ModuleA.Ahas a nested submodulenet_b, which itself has two submodulesnet_candlinear.net_cthen has a submoduleconv.)To override the
Conv2dwith a new submoduleLinear, you could callset_submodule("net_b.net_c.conv", nn.Linear(1, 1))wherestrictcould beTrueorFalseTo add a new submodule
Conv2dto the existingnet_bmodule, you would callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1)).In the above if you set
strict=Trueand callset_submodule("net_b.conv", nn.Conv2d(1, 1, 1), strict=True), an AttributeError will be raised becausenet_bdoes not have a submodule namedconv.- Args:
- target: The fully-qualified string name of the submodule
to look for. (See above example for how to specify a fully-qualified string.)
module: The module to set the submodule to. strict: If
False, the method will replace an existing submoduleor create a new submodule if the parent module exists. If
True, the method will only attempt to replace an existing submodule and throw an error if the submodule doesn’t already exist.- Raises:
ValueError: If the
targetstring is empty or ifmoduleis not an instance ofnn.Module. AttributeError: If at any point along the path resulting fromthe
targetstring the (sub)path resolves to a non-existent attribute name or an object that is not an instance ofnn.Module.
See
torch.Tensor.share_memory_().- Return type:
Self
- state_dict(*args, destination=None, prefix='', keep_vars=False)
Return a dictionary containing references to the whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to
Noneare not included.Note
The returned object is a shallow copy. It contains references to the module’s parameters and buffers.
Warning
Currently
state_dict()also accepts positional arguments fordestination,prefixandkeep_varsin order. However, this is being deprecated and keyword arguments will be enforced in future releases.Warning
Please avoid the use of argument
destinationas it is not designed for end-users.- Args:
- destination (dict, optional): If provided, the state of module will
be updated into the dict and the same object is returned. Otherwise, an
OrderedDictwill be created and returned. Default:None.- prefix (str, optional): a prefix added to parameter and buffer
names to compose the keys in state_dict. Default:
''.- keep_vars (bool, optional): by default the
Tensors returned in the state dict are detached from autograd. If it’s set to
True, detaching will not be performed. Default:False.
- Returns:
- dict:
a dictionary containing a whole state of the module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> module.state_dict().keys() ['bias', 'weight']
- to(*args, **kwargs)
Move and/or cast the parameters and buffers.
This can be called as
- to(device=None, dtype=None, non_blocking=False)
- to(dtype, non_blocking=False)
- to(tensor, non_blocking=False)
- to(memory_format=torch.channels_last)
Its signature is similar to
torch.Tensor.to(), but only accepts floating point or complexdtypes. In addition, this method will only cast the floating point or complex parameters and buffers todtype(if given). The integral parameters and buffers will be moveddevice, if that is given, but with dtypes unchanged. Whennon_blockingis set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.See below for examples.
Note
This method modifies the module in-place.
- Args:
- device (
torch.device): the desired device of the parameters and buffers in this module
- dtype (
torch.dtype): the desired floating point or complex dtype of the parameters and buffers in this module
- tensor (torch.Tensor): Tensor whose dtype and device are the desired
dtype and device for all parameters and buffers in this module
- memory_format (
torch.memory_format): the desired memory format for 4D parameters and buffers in this module (keyword only argument)
- device (
- Returns:
Module: self
Examples:
>>> # xdoctest: +IGNORE_WANT("non-deterministic") >>> linear = nn.Linear(2, 2) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]]) >>> linear.to(torch.double) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]], dtype=torch.float64) >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1) >>> gpu1 = torch.device("cuda:1") >>> linear.to(gpu1, dtype=torch.half, non_blocking=True) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1') >>> cpu = torch.device("cpu") >>> linear.to(cpu) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16) >>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble) >>> linear.weight Parameter containing: tensor([[ 0.3741+0.j, 0.2382+0.j], [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128) >>> linear(torch.ones(3, 2, dtype=torch.cdouble)) tensor([[0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
- to_empty(*, device, recurse=True)
Move the parameters and buffers to the specified device without copying storage.
- Return type:
Self
- Args:
- device (
torch.device): The desired device of the parameters and buffers in this module.
- recurse (bool): Whether parameters and buffers of submodules should
be recursively moved to the specified device.
- device (
- Returns:
Module: self
- train(mode=True)
Set the module in training mode.
This has an effect only on certain modules. See the documentation of particular modules for details of their behaviors in training/evaluation mode, i.e., whether they are affected, e.g.
Dropout,BatchNorm, etc.- Return type:
Self
- Args:
- mode (bool): whether to set training mode (
True) or evaluation mode (
False). Default:True.
- mode (bool): whether to set training mode (
- Returns:
Module: self
- type(dst_type)
Casts all parameters and buffers to
dst_type. :rtype:SelfNote
This method modifies the module in-place.
- Args:
dst_type (type or string): the desired type
- Returns:
Module: self
- property window_type: str
- xpu(device=None)
Move all model parameters and buffers to the XPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized. :rtype:
SelfNote
This method modifies the module in-place.
- Arguments:
- device (int, optional): if specified, all parameters will be
copied to that device
- Returns:
Module: self
- zero_grad(set_to_none=True)
Reset gradients of all model parameters.
See similar function under
torch.optim.Optimizerfor more context.- Return type:
None
- Args:
- set_to_none (bool): instead of setting to zero, set the grads to None.
See
torch.optim.Optimizer.zero_grad()for details.
- training: bool
- lhotse.features.kaldi.layers.create_mel_scale(num_filters, fft_length, sampling_rate, low_freq=0, high_freq=None, norm_filters=True)[source]
- Return type:
Tensor
- lhotse.features.kaldi.layers.create_frame_window(window_size, window_type='povey', blackman_coeff=0.42)[source]
Returns a window function with the given type and size
- lhotse.features.kaldi.layers.next_power_of_2(x)[source]
Returns the smallest power of 2 that is greater than x.
Original source: TorchAudio (torchaudio/compliance/kaldi.py)
- Return type:
int
- lhotse.features.kaldi.layers.get_mel_banks(num_bins, window_length_padded, sample_freq, low_freq, high_freq)[source]
-
- Return type:
Tuple[Tensor,Tensor]
- Returns:
(Tensor, Tensor): The tuple consists of
bins(which is melbank of size (num_bins,num_fft_bins)) andcenter_freqs(which is center frequencies of bins of size (num_bins)).
Torchaudio feature extractors
- class lhotse.features.fbank.TorchaudioFbankConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=80, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0)[source]
-
dither:
float= 0.0
-
window_type:
str= 'povey'
-
frame_length:
float= 0.025
-
frame_shift:
float= 0.01
-
remove_dc_offset:
bool= True
-
round_to_power_of_two:
bool= True
-
energy_floor:
float= 1e-10
-
min_duration:
float= 0.0
-
preemphasis_coefficient:
float= 0.97
-
raw_energy:
bool= True
-
low_freq:
float= 20.0
-
high_freq:
float= -400.0
-
num_mel_bins:
int= 80
-
use_energy:
bool= False
-
vtln_low:
float= 100.0
-
vtln_high:
float= -500.0
-
vtln_warp:
float= 1.0
- __init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=80, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0)
-
dither:
- class lhotse.features.fbank.TorchaudioFbank(config=None)[source]
Log Mel energy filter bank feature extractor based on
torchaudio.compliance.kaldi.fbankfunction.- name = 'fbank'
- config_type
alias of
TorchaudioFbankConfig
- static mix(features_a, features_b, energy_scaling_factor_b)[source]
Perform feature-domain mix of two signals,
aandb, and return the mixed signal.- Parameters:
features_a (
ndarray) – Left-hand side (reference) signal.features_b (
ndarray) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float) – A scaling factor forfeatures_benergy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_aandfeatures_benergies are 100, thefeatures_bsignal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_bto the signal is determined by the implementer.
- Return type:
ndarray- Returns:
A mixed feature matrix.
- static compute_energy(features)[source]
Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energywill never return zero.- Parameters:
features (
ndarray) – A feature matrix.- Return type:
float- Returns:
A positive float value of the signal energy.
- static scale(features, energy_scaling_factor)[source]
Scale a single feature matrix by the provided energy factor.
- Parameters:
features (
ndarray) – A feature matrix.energy_scaling_factor (
float) – The energy scaling factor to apply.
- Return type:
ndarray- Returns:
A scaled feature matrix.
- __init__(config=None)
- property device: str | device
- extract(samples, sampling_rate)
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
ndarray- Returns:
a numpy ndarray representing the feature matrix.
- extract_batch(samples, sampling_rate, lengths=None)
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
- extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)
Extract the features from a
Recordingin a full pipeline:load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features and the source data used.
- Parameters:
recording (
Recording) – aRecordingthat specifies what’s the input audio.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an optional offset in seconds for where to start reading the recording.duration (
Optional[float]) – an optional duration specifying how much audio to load from the recording.channels (
Union[int,List[int],None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix.
- extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)
Extract the features from an array of audio samples in a full pipeline:
optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features.
Note, unlike in
extract_from_recording_and_store, the returnedFeaturesobject might not be suitable to store in aFeatureSet, as it does not reference any particularRecording. Instead, this method is useful when extracting features from cuts - especiallyMixedCutinstances, which may be created from multiple recordings and channels.- Parameters:
samples (
ndarray) – a numpy ndarray with the audio samples.sampling_rate (
int) – integer sampling rate ofsamples.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an offset in seconds for where to start reading the recording - when used forCutfeature extraction, must be equal toCut.start.channel (
Union[int,List[int],None]) – an optional channel number(s) to insert intoFeaturesmanifest.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix (it is not written to disk).
- property frame_shift: float
- classmethod from_dict(data)
- Return type:
- classmethod from_yaml(path)
- Return type:
- to_dict()
- Return type:
Dict[str,Any]
- to_yaml(path)
- class lhotse.features.mfcc.TorchaudioMfccConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)[source]
-
dither:
float= 0.0
-
window_type:
str= 'povey'
-
frame_length:
float= 0.025
-
frame_shift:
float= 0.01
-
remove_dc_offset:
bool= True
-
round_to_power_of_two:
bool= True
-
energy_floor:
float= 1e-10
-
min_duration:
float= 0.0
-
preemphasis_coefficient:
float= 0.97
-
raw_energy:
bool= True
-
low_freq:
float= 20.0
-
high_freq:
float= -400.0
-
num_mel_bins:
int= 23
-
use_energy:
bool= False
-
vtln_low:
float= 100.0
-
vtln_high:
float= -500.0
-
vtln_warp:
float= 1.0
-
cepstral_lifter:
float= 22.0
-
num_ceps:
int= 13
- __init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=-400.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=-500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)
-
dither:
- class lhotse.features.mfcc.TorchaudioMfcc(config=None)[source]
MFCC feature extractor based on
torchaudio.compliance.kaldi.mfccfunction.- name = 'mfcc'
- config_type
alias of
TorchaudioMfccConfig
- __init__(config=None)
- static compute_energy(features)
Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energywill never return zero.- Parameters:
features (
ndarray) – A feature matrix.- Return type:
float- Returns:
A positive float value of the signal energy.
- property device: str | device
- extract(samples, sampling_rate)
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
ndarray- Returns:
a numpy ndarray representing the feature matrix.
- extract_batch(samples, sampling_rate, lengths=None)
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
- extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)
Extract the features from a
Recordingin a full pipeline:load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features and the source data used.
- Parameters:
recording (
Recording) – aRecordingthat specifies what’s the input audio.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an optional offset in seconds for where to start reading the recording.duration (
Optional[float]) – an optional duration specifying how much audio to load from the recording.channels (
Union[int,List[int],None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix.
- extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)
Extract the features from an array of audio samples in a full pipeline:
optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features.
Note, unlike in
extract_from_recording_and_store, the returnedFeaturesobject might not be suitable to store in aFeatureSet, as it does not reference any particularRecording. Instead, this method is useful when extracting features from cuts - especiallyMixedCutinstances, which may be created from multiple recordings and channels.- Parameters:
samples (
ndarray) – a numpy ndarray with the audio samples.sampling_rate (
int) – integer sampling rate ofsamples.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an offset in seconds for where to start reading the recording - when used forCutfeature extraction, must be equal toCut.start.channel (
Union[int,List[int],None]) – an optional channel number(s) to insert intoFeaturesmanifest.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix (it is not written to disk).
- property frame_shift: float
- classmethod from_dict(data)
- Return type:
- classmethod from_yaml(path)
- Return type:
- static mix(features_a, features_b, energy_scaling_factor_b)
Perform feature-domain mix of two signals,
aandb, and return the mixed signal.- Parameters:
features_a (
ndarray) – Left-hand side (reference) signal.features_b (
ndarray) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float) – A scaling factor forfeatures_benergy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_aandfeatures_benergies are 100, thefeatures_bsignal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_bto the signal is determined by the implementer.
- Return type:
ndarray- Returns:
A mixed feature matrix.
- static scale(features, energy_scaling_factor)
Scale a single feature matrix by the provided energy factor.
- Parameters:
features (
ndarray) – A feature matrix.energy_scaling_factor (
float) – The energy scaling factor to apply.
- Return type:
ndarray- Returns:
A scaled feature matrix.
- to_dict()
- Return type:
Dict[str,Any]
- to_yaml(path)
- class lhotse.features.spectrogram.TorchaudioSpectrogramConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)[source]
-
dither:
float= 0.0
-
window_type:
str= 'povey'
-
frame_length:
float= 0.025
-
frame_shift:
float= 0.01
-
remove_dc_offset:
bool= True
-
round_to_power_of_two:
bool= True
-
energy_floor:
float= 1e-10
-
min_duration:
float= 0.0
-
preemphasis_coefficient:
float= 0.97
-
raw_energy:
bool= True
- __init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)
-
dither:
- class lhotse.features.spectrogram.TorchaudioSpectrogram(config=None)[source]
Log spectrogram feature extractor based on
torchaudio.compliance.kaldi.spectrogramfunction.- name = 'spectrogram'
- config_type
alias of
TorchaudioSpectrogramConfig
- static mix(features_a, features_b, energy_scaling_factor_b)[source]
Perform feature-domain mix of two signals,
aandb, and return the mixed signal.- Parameters:
features_a (
ndarray) – Left-hand side (reference) signal.features_b (
ndarray) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float) – A scaling factor forfeatures_benergy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_aandfeatures_benergies are 100, thefeatures_bsignal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_bto the signal is determined by the implementer.
- Return type:
ndarray- Returns:
A mixed feature matrix.
- static compute_energy(features)[source]
Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energywill never return zero.- Parameters:
features (
ndarray) – A feature matrix.- Return type:
float- Returns:
A positive float value of the signal energy.
- static scale(features, energy_scaling_factor)[source]
Scale a single feature matrix by the provided energy factor.
- Parameters:
features (
ndarray) – A feature matrix.energy_scaling_factor (
float) – The energy scaling factor to apply.
- Return type:
ndarray- Returns:
A scaled feature matrix.
- __init__(config=None)
- property device: str | device
- extract(samples, sampling_rate)
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
ndarray- Returns:
a numpy ndarray representing the feature matrix.
- extract_batch(samples, sampling_rate, lengths=None)
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
- extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)
Extract the features from a
Recordingin a full pipeline:load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features and the source data used.
- Parameters:
recording (
Recording) – aRecordingthat specifies what’s the input audio.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an optional offset in seconds for where to start reading the recording.duration (
Optional[float]) – an optional duration specifying how much audio to load from the recording.channels (
Union[int,List[int],None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix.
- extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)
Extract the features from an array of audio samples in a full pipeline:
optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features.
Note, unlike in
extract_from_recording_and_store, the returnedFeaturesobject might not be suitable to store in aFeatureSet, as it does not reference any particularRecording. Instead, this method is useful when extracting features from cuts - especiallyMixedCutinstances, which may be created from multiple recordings and channels.- Parameters:
samples (
ndarray) – a numpy ndarray with the audio samples.sampling_rate (
int) – integer sampling rate ofsamples.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an offset in seconds for where to start reading the recording - when used forCutfeature extraction, must be equal toCut.start.channel (
Union[int,List[int],None]) – an optional channel number(s) to insert intoFeaturesmanifest.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix (it is not written to disk).
- property frame_shift: float
- classmethod from_dict(data)
- Return type:
- classmethod from_yaml(path)
- Return type:
- to_dict()
- Return type:
Dict[str,Any]
- to_yaml(path)
Librosa filter-bank
- class lhotse.features.librosa_fbank.LibrosaFbankConfig(sampling_rate=22050, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600)[source]
Default librosa config with values consistent with various TTS projects.
This config is intended for use with popular TTS projects such as [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) Warning: You may need to normalize your features.
-
sampling_rate:
int= 22050
-
fft_size:
int= 1024
-
hop_size:
int= 256
-
win_length:
int= None
-
window:
str= 'hann'
-
num_mel_bins:
int= 80
-
fmin:
int= 80
-
fmax:
int= 7600
- __init__(sampling_rate=22050, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600)
-
sampling_rate:
- lhotse.features.librosa_fbank.pad_or_truncate_features(feats, expected_num_frames, abs_tol=1, pad_value=-23.025850929940457)[source]
- lhotse.features.librosa_fbank.logmelfilterbank(audio, sampling_rate, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600, eps=1e-10)[source]
Compute log-Mel filterbank feature.
- Args:
audio (ndarray): Audio signal (T,). sampling_rate (int): Sampling rate. fft_size (int): FFT size. hop_size (int): Hop size. win_length (int): Window length. If set to None, it will be the same as fft_size. window (str): Window function type. num_mel_bins (int): Number of mel basis. fmin (int): Minimum frequency in mel basis calculation. fmax (int): Maximum frequency in mel basis calculation. eps (float): Epsilon value to avoid inf in log calculation.
- Returns:
ndarray: Log Mel filterbank feature (#source_feats, num_mel_bins).
- class lhotse.features.librosa_fbank.LibrosaFbank(config=None)[source]
Librosa fbank feature extractor
Differs from Fbank extractor in that it uses librosa backend for stft and mel scale calculations. It can be easily configured to be compatible with existing speech-related projects that use librosa features.
- name = 'librosa-fbank'
- config_type
alias of
LibrosaFbankConfig
- property frame_shift: float
- extract(samples, sampling_rate)[source]
Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.
- Return type:
ndarray- Returns:
a numpy ndarray representing the feature matrix.
- static mix(features_a, features_b, energy_scaling_factor_b)[source]
Perform feature-domain mix of two signals,
aandb, and return the mixed signal.- Parameters:
features_a (
ndarray) – Left-hand side (reference) signal.features_b (
ndarray) – Right-hand side (mixed-in) signal.energy_scaling_factor_b (
float) – A scaling factor forfeatures_benergy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when bothfeatures_aandfeatures_benergies are 100, thefeatures_bsignal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to applyenergy_scaling_factor_bto the signal is determined by the implementer.
- Return type:
ndarray- Returns:
A mixed feature matrix.
- static compute_energy(features)[source]
Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented,
compute_energywill never return zero.- Parameters:
features (
ndarray) – A feature matrix.- Return type:
float- Returns:
A positive float value of the signal energy.
- static scale(features, energy_scaling_factor)[source]
Scale a single feature matrix by the provided energy factor.
- Parameters:
features (
ndarray) – A feature matrix.energy_scaling_factor (
float) – The energy scaling factor to apply.
- Return type:
ndarray- Returns:
A scaled feature matrix.
- __init__(config=None)
- property device: str | device
- extract_batch(samples, sampling_rate, lengths=None)
Performs batch extraction. It is not guaranteed to be faster than
FeatureExtractor.extract()– it depends on whether the implementation of a particular feature extractor supports accelerated batch computation. If lengths is provided, it is assumed that the input is a batch of padded sequences, so we will not perform any further collation. :rtype:Union[ndarray,Tensor,List[ndarray],List[Tensor]]Note
Unless overridden by child classes, it defaults to sequentially calling
FeatureExtractor.extract()on the inputs.Note
This method should support variable length inputs.
- extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)
Extract the features from a
Recordingin a full pipeline:load audio from disk;
optionally, perform audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features and the source data used.
- Parameters:
recording (
Recording) – aRecordingthat specifies what’s the input audio.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an optional offset in seconds for where to start reading the recording.duration (
Optional[float]) – an optional duration specifying how much audio to load from the recording.channels (
Union[int,List[int],None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix.
- extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)
Extract the features from an array of audio samples in a full pipeline:
optional audio augmentation;
extract the features;
save them to disk in a specified directory;
return a
Featuresobject with a description of the extracted features.
Note, unlike in
extract_from_recording_and_store, the returnedFeaturesobject might not be suitable to store in aFeatureSet, as it does not reference any particularRecording. Instead, this method is useful when extracting features from cuts - especiallyMixedCutinstances, which may be created from multiple recordings and channels.- Parameters:
samples (
ndarray) – a numpy ndarray with the audio samples.sampling_rate (
int) – integer sampling rate ofsamples.storage (
FeaturesWriter) – aFeaturesWriterobject that will handle storing the feature matrices. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.offset (
float) – an offset in seconds for where to start reading the recording - when used forCutfeature extraction, must be equal toCut.start.channel (
Union[int,List[int],None]) – an optional channel number(s) to insert intoFeaturesmanifest.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optionalWavAugmenterinstance to modify the waveform before feature extraction.
- Return type:
- Returns:
a
Featuresmanifest item for the extracted feature matrix (it is not written to disk).
- classmethod from_dict(data)
- Return type:
- classmethod from_yaml(path)
- Return type:
- to_dict()
- Return type:
Dict[str,Any]
- to_yaml(path)
Feature storage
- class lhotse.features.io.FeaturesWriter[source]
FeaturesWriterdefines the interface of how to store numpy arrays in a particular storage backend. This backend could either be:separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.
Each class inheriting from
FeaturesWritermust define:- the
write()method, which defines the storing operation (accepts a
keyused to place thevaluearray in the storage);
- the
- the
storage_path()property, which is either a common directory for the files, the name of the file storing multiple arrays, name of the cloud bucket, etc.
- the
- the
name()property that is unique to this particular storage mechanism - it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.
- the
Each
FeaturesWritercan also be used as a context manager, as some implementations might need to free a resource after the writing is finalized. By default nothing happens in the context manager functions, and this can be modified by the inheriting subclasses.Example:
>>> with MyWriter('some/path') as storage: ... extractor.extract_from_recording_and_store(recording, storage)
The features loading must be defined separately in a class inheriting from
FeaturesReader.- abstract property name: str
- abstract property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)[source]
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.FeaturesReader[source]
FeaturesReaderdefines the interface of how to load numpy arrays from a particular storage backend. This backend could either be:separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.
Each class inheriting from
FeaturesReadermust define:- the
read()method, which defines the loading operation (accepts the
keyto locate the array in the storage and return it). The read method should support selecting only a subset of the feature matrix, with the bounds expressed as argumentsleft_offset_framesandright_offset_frames. It’s up to the Reader implementation to load only the required part or trim it to that range only after loading. It is assumed that the time dimension is always the first one.
- the
- the
name()property that is unique to this particular storage mechanism - it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.
- the
The features writing must be defined separately in a class inheriting from
FeaturesWriter.- abstract property name: str
- class lhotse.features.io.StorageBackendInfo(name, available, install_hint)[source]
-
name:
str Alias for field number 0
-
available:
bool Alias for field number 1
-
install_hint:
Optional[str] Alias for field number 2
- count(value, /)
Return number of occurrences of value.
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
name:
- lhotse.features.io.available_storage_backends()[source]
Return the names of all currently available feature/array storage backends.
The result depends on optional dependencies installed in the environment. To inspect all known backends together with availability status and install hints, call
storage_backend_statuses()or runlhotse list-storage-backends.- Return type:
List[str]
- lhotse.features.io.storage_backend_statuses()[source]
Return status information for all known feature/array storage backends.
Unavailable backends include a short install hint when one is known. For a CLI equivalent, run
lhotse list-storage-backends.- Return type:
List[StorageBackendInfo]
- lhotse.features.io.default_features_storage_backend()[source]
- Return type:
Type[FeaturesWriter]
- lhotse.features.io.register_reader(cls)[source]
Decorator used to add a new
FeaturesReaderto Lhotse’s registry.Example:
@register_reader class MyFeatureReader(FeatureReader): ...
- lhotse.features.io.register_writer(cls)[source]
Decorator used to add a new
FeaturesWriterto Lhotse’s registry.Example:
@register_writer class MyFeatureWriter(FeatureWriter): ...
- lhotse.features.io.get_reader(name)[source]
Find a
FeaturesReadersub-class that corresponds to the providednameand return its type.Example: :rtype:
Type[FeaturesReader]reader_type = get_reader(“lilcom_files”) reader = reader_type(“/storage/features/”)
- lhotse.features.io.get_writer(name)[source]
Find a
FeaturesWritersub-class that corresponds to the providednameand return its type.Example: :rtype:
Type[FeaturesWriter]writer_type = get_writer(“lilcom_files”) writer = writer_type(“/storage/features/”)
- class lhotse.features.io.FileIO(storage_path)[source]
Helper util for opening a file object for reading or writing in a directory on the local filesystem, or a URL to supported object store (S3, AIStore, etc.).
storage_pathcorresponds to the directory path or base URL prefix;storage_keyfor each utterance is the name of the file in that directory.- open_fileobj(key, mode, add_subdir=False)[source]
Open a file for reading or writing on local disk or URL to object store. Arg “key” should contain the extension for the file. :rtype:
Generator[tuple,None,None]Mode is either “r” or “w”. Arg “add_subdir” can be set to True, in which case on the local filesystem it will create
an extra subdirectory of
self.storage_pathwith the first three letters ofkey, preventing big datasets from exhausting the filesystem with one big directory. This arg is ignored for URLs.Yields a tuple of (open_file_object, path_or_url).
- class lhotse.features.io.LilcomFilesReader(storage_path, *args, **kwargs)[source]
Reads Lilcom-compressed files from a directory on the local filesystem, or a URL to supported object store (S3, AIStore, etc.).
storage_pathcorresponds to the directory path;storage_keyfor each utterance is the name of the file in that directory.- name = 'lilcom_files'
- class lhotse.features.io.LilcomFilesWriter(storage_path, tick_power=-5, *args, **kwargs)[source]
Writes Lilcom-compressed files to a directory on the local filesystem, or a URL to supported object store (S3, AIStore, etc.).
storage_pathcorresponds to the directory path;storage_keyfor each utterance is the name of the file in that directory.- name = 'lilcom_files'
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.NumpyFilesReader(storage_path, *args, **kwargs)[source]
Reads non-compressed numpy arrays from files in a directory on the local filesystem, or a URL to supported object store (S3, AIStore, etc.).
storage_pathcorresponds to the directory path;storage_keyfor each utterance is the name of the file in that directory.- name = 'numpy_files'
- class lhotse.features.io.NumpyFilesWriter(storage_path, *args, **kwargs)[source]
Writes non-compressed numpy arrays to files in a directory on the local filesystem, or a URL to supported object store (S3, AIStore, etc.).
storage_pathcorresponds to the directory path;storage_keyfor each utterance is the name of the file in that directory.- name = 'numpy_files'
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- lhotse.features.io.lookup_cache_or_open(storage_path)[source]
Helper internal function used in HDF5 readers. It opens the HDF files and keeps their handles open in a global program cache to avoid excessive amount of syscalls when the Reader class is instantiated and destroyed in a loop repeatedly (frequent use-case).
The file handles can be freed at any time by calling
close_cached_file_handles().
- lhotse.features.io.lookup_chunk_size(h5_file_handle)[source]
Helper internal function to retrieve the chunk size from an HDF5 file. Helps avoid unnecessary repeated disk reads.
- Return type:
int
- lhotse.features.io.close_cached_file_handles()[source]
Closes the cached file handles in
lookup_cache_or_openandlookup_reader_cache_or_open(see respective docs for more details).- Return type:
None
- class lhotse.features.io.NumpyHdf5Reader(storage_path, *args, **kwargs)[source]
Reads non-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Datasetbecause their shapes (numbers of frames) may vary.storage_pathcorresponds to the HDF5 file path;storage_keyfor each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).- name = 'numpy_hdf5'
- class lhotse.features.io.NumpyHdf5Writer(storage_path, mode='w', *args, **kwargs)[source]
Writes non-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Datasetbecause their shapes (numbers of frames) may vary.storage_pathcorresponds to the HDF5 file path;storage_keyfor each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).Internally, this class opens the file lazily so that this object can be passed between processes without issues. This simplifies the parallel feature extraction code.
- name = 'numpy_hdf5'
- __init__(storage_path, mode='w', *args, **kwargs)[source]
- Parameters:
storage_path (
Union[Path,str]) – Path under which we’ll create the HDF5 file. We will add a.h5suffix if it is not already instorage_path.mode (
str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.LilcomHdf5Reader(storage_path, *args, **kwargs)[source]
Reads lilcom-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Datasetbecause their shapes (numbers of frames) may vary.storage_pathcorresponds to the HDF5 file path;storage_keyfor each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).- name = 'lilcom_hdf5'
- class lhotse.features.io.LilcomHdf5Writer(storage_path, tick_power=-5, mode='w', *args, **kwargs)[source]
Writes lilcom-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF
Datasetbecause their shapes (numbers of frames) may vary.storage_pathcorresponds to the HDF5 file path;storage_keyfor each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).- name = 'lilcom_hdf5'
- __init__(storage_path, tick_power=-5, mode='w', *args, **kwargs)[source]
- Parameters:
storage_path (
Union[Path,str]) – Path under which we’ll create the HDF5 file. We will add a.h5suffix if it is not already instorage_path.tick_power (
int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.mode (
str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.ChunkedLilcomHdf5Reader(storage_path, *args, **kwargs)[source]
Reads lilcom-compressed numpy arrays from a HDF5 file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.
storage_pathcorresponds to the HDF5 file path;storage_keyfor each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).- name = 'chunked_lilcom_hdf5'
- class lhotse.features.io.ChunkedLilcomHdf5Writer(storage_path, tick_power=-5, chunk_size=100, mode='w', *args, **kwargs)[source]
Writes lilcom-compressed numpy arrays to a HDF5 file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.
storage_pathcorresponds to the HDF5 file path;storage_keyfor each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).- name = 'chunked_lilcom_hdf5'
- __init__(storage_path, tick_power=-5, chunk_size=100, mode='w', *args, **kwargs)[source]
- Parameters:
storage_path (
Union[Path,str]) – Path under which we’ll create the HDF5 file. We will add a.h5suffix if it is not already instorage_path.tick_power (
int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.chunk_size (
int) – How many frames to store per chunk. Too low a number will require many reads for long feature matrices, too high a number will require to read more redundant data.mode (
str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.LilcomChunkyReader(storage_path, *args, **kwargs)[source]
Reads lilcom-compressed numpy arrays from a binary file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.
storage_pathcorresponds to the binary file path.storage_keyfor each utterance is a comma separated list of offsets in the file. The first number is the offset for the whole array, and the following numbers are relative offsets for each chunk. These offsets are relative to the previous chunk start.- name = 'lilcom_chunky'
- CHUNK_SIZE = 500
- class lhotse.features.io.LilcomChunkyWriter(storage_path, tick_power=-5, mode='wb', *args, **kwargs)[source]
Writes lilcom-compressed numpy arrays to a binary file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.
storage_pathcorresponds to the binary file path.storage_keyfor each utterance is a comma separated list of offsets in the file. The first number is the offset for the whole array, and the following numbers are relative offsets for each chunk. These offsets are relative to the previous chunk start.- name = 'lilcom_chunky'
- CHUNK_SIZE = 500
- __init__(storage_path, tick_power=-5, mode='wb', *args, **kwargs)[source]
- Parameters:
storage_path (
Union[Path,str]) – Path under which we’ll create the binary file.tick_power (
int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.chunk_size – How many frames to store per chunk. Too low a number will require many reads for long feature matrices, too high a number will require to read more redundant data.
mode (
str) – Modes, one of: “w” (write) or “a” (append); can be “wb” and “ab”, “b” is implicit
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.LilcomURLReader(*args, **kwargs)[source]
Downloads Lilcom-compressed files from a URL (S3, GCP, Azure, HTTP, etc.).
storage_pathcorresponds to the root URL (e.g. “s3://my-data-bucket”)storage_keywill be concatenated tostorage_pathto form a full URL (e.g. “my-feature-file.llc”)Caution
Requires
smart_opento be installed (pip install smart_open).- name = 'lilcom_url'
- class lhotse.features.io.LilcomURLWriter(*args, **kwargs)[source]
Writes Lilcom-compressed files to a URL (S3, GCP, Azure, HTTP, etc.).
storage_pathcorresponds to the root URL (e.g. “s3://my-data-bucket”)storage_keywill be concatenated tostorage_pathto form a full URL (e.g. “my-feature-file.llc”)Caution
Requires
smart_opento be installed (pip install smart_open).- name = 'lilcom_url'
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- lhotse.features.io.lookup_reader_cache_or_open(storage_path)[source]
Helper internal function used in KaldiReader. It opens kaldi scp files and keeps their handles open in a global program cache to avoid excessive amount of syscalls when the Reader class is instantiated and destroyed in a loop repeatedly (frequent use-case).
The file handles can be freed at any time by calling
close_cached_file_handles().
- class lhotse.features.io.KaldiReader(storage_path, *args, **kwargs)[source]
Reads Kaldi’s “feats.scp” file using kaldi_native_io.
storage_pathcorresponds to the path tofeats.scp.storage_keycorresponds to the utterance-id in Kaldi.Caution
Requires
kaldi_native_ioto be installed (pip install kaldi_native_io).- name = 'kaldiio'
- class lhotse.features.io.KaldiWriter(storage_path, compression_method=1, *args, **kwargs)[source]
Write data to Kaldi’s “feats.scp” and “feats.ark” files using kaldi_native_io.
storage_pathcorresponds to a directory where we’ll create “feats.scp” and “feats.ark” files.storage_keycorresponds to the utterance-id in Kaldi.The following
compression_methodvalues are supported by kaldi_native_io:kAutomaticMethod = 1 kSpeechFeature = 2 kTwoByteAuto = 3 kTwoByteSignedInteger = 4 kOneByteAuto = 5 kOneByteUnsignedInteger = 6 kOneByteZeroOne = 7
Note
Setting compression_method works only with 2D arrays.
Example:
>>> data = np.random.randn(131, 80) >>> with KaldiWriter('featdir') as w: ... w.write('utt1', data) >>> reader = KaldiReader('featdir/feats.scp') >>> read_data = reader.read('utt1') >>> np.testing.assert_equal(data, read_data)
Caution
Requires
kaldi_native_ioto be installed (pip install kaldi_native_io).- name = 'kaldiio'
- property storage_path: str
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.MemoryLilcomWriter(*args, lilcom_tick_power=-5, **kwargs)[source]
- name = 'memory_lilcom'
- property storage_path: None
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
- class lhotse.features.io.MemoryRawWriter(*args, **kwargs)[source]
- name = 'memory_raw'
- property storage_path: None
- store_array(key, value, frame_shift=None, temporal_dim=None, start=0)
Store a numpy array in the underlying storage and return a manifest describing how to retrieve the data.
If the array contains a temporal dimension (e.g. it represents the frame-level features, alignment, posteriors, etc. of an utterance) then
temporal_dimandframe_shiftmay be specified to enable downstream padding, truncating, and partial reads of the array.- Parameters:
key (
str) – An ID that uniquely identifies the array.value (
ndarray) – The array to be stored.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.start (
float) – Float, when the array is temporal, it indicates what is the offset of the array w.r.t. the start of recording. Useful for reading subsets of an array when it represents something computed from long recordings. Ignored for non-temporal arrays.
- Return type:
Union[Array,TemporalArray]- Returns:
A manifest of type
ArrayorTemporalArray, depending on the input arguments.
Feature-domain mixing
- class lhotse.features.mixer.FeatureMixer(feature_extractor, base_feats, frame_shift, padding_value=-1000.0, reference_energy=None)[source]
Utility class to mix multiple feature matrices into a single one. It should be instantiated separately for each mixing session (i.e. each
MixedCutwill create a separateFeatureMixerto mix its tracks). It is initialized with a numpy array of features (typically float32) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using theadd_to_mixmethod. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize theFeatureMixer.It relies on the
FeatureExtractorto have definedmixandcompute_energymethods, so that theFeatureMixerknows how to scale and add two feature matrices together.- __init__(feature_extractor, base_feats, frame_shift, padding_value=-1000.0, reference_energy=None)[source]
FeatureMixer’s constructor.
- Parameters:
feature_extractor (
FeatureExtractor) – TheFeatureExtractorinstance that specifies how to mix the features.base_feats (
ndarray) – The features used to initialize theFeatureMixerare a point of reference in terms of energy and offset for all features mixed into them.frame_shift (
float) – Required to correctly compute offset and padding during the mix.padding_value (
float) – The value used to pad the shorter features during the mix. This value is adequate only for log space features. For non-log space features, e.g. energies, use either 0 or a small positive value like 1e-5.reference_energy (
Optional[float]) – Optionally pass a reference energy value to compute SNRs against. This might be required whenbase_featscorrespond to padding energies.
- property num_features
- property unmixed_feats: ndarray
Return a numpy ndarray with the shape (num_tracks, num_frames, num_features), where each track’s feature matrix is padded and scaled adequately to the offsets and SNR used in
add_to_mixcall.
- property mixed_feats: ndarray
Return a numpy ndarray with the shape (num_frames, num_features) - a mono mixed feature matrix of the tracks supplied with
add_to_mixcalls.
- add_to_mix(feats, sampling_rate, snr=None, offset=0.0)[source]
Add feature matrix of a new track into the mix. :type feats:
ndarray:param feats: A 2D feature matrix to be mixed in. :type sampling_rate:int:param sampling_rate: The sampling rate offeats:type snr:Optional[float] :param snr: Signal-to-noise ratio, assumingfeatsrepresents noise (positive SNR - lowerfeatsenergy, negative SNR - higherfeatsenergy) :type offset:float:param offset: How many seconds to shiftfeatsin time. For mixing, the signal will be padded before the start with low energy values.
Augmentation
Cuts
Data structures and tools used to create training/testing examples.
The following is the hierarchy of imports in this module (to avoid circular imports):
┌─────────────┐ │ __init__.py │─────────────┬────────────────────────────────────────────┐ └─────────────┘ │ │
│ │ │ │ ▼ │ │ ┌────────────────┐ │ ├──────────▶│ mono.MonoCut │────────────────────┐ │ │ └────────────────┘ │ │ │ ▼ │ │ ┌────────────────┐ ┌────────────────┐ │ ├──────────▶│ multi.MultiCut │──────────▶│ data.DataCut │───────┤ │ └────────────────┘ └────────────────┘ │ │ ▲ ▼ │ ┌────────────────────┐ │ ┌─────────────┐ ├──────────▶│ mixed.MixedCut │────────────────┴───────▶│ base.Cut │ │ └────────────────────┘ └─────────────┘ │ │ ▲ │ │ │ │ │ ┌────────────────────┐ │ ├──────────────────────┴────────▶│ padding.PaddingCut │───────────┤ │ └────────────────────┘ │
┌────────────────┐ ▲ │ │ set.CutSet │───────────────────────────────────┴─────────────────────┘ └────────────────┘
- class lhotse.cut.Cut[source]
Caution
Cutis just an abstract class – the actual logic is implemented by its child classes (scroll down for references).Cutis a base class for audio cuts. An “audio cut” is a subset of aRecording– it can also be thought of as a “view” or a pointer to a chunk of audio. It is not limited to audio data – cuts may also point to (sub-spans of) precomputedFeatures.Cuts are different from
SupervisionSegmentin that they may be arbitrarily longer or shorter than supervisions; cuts may even contain multiple supervisions for creating contextual training data, and unsupervised regions that provide real or synthetic acoustic background context for the supervised segments.The following example visualizes how a cut may represent a part of a single-channel recording with two utterances and some background noise in between:
Recording |-------------------------------------------| "Hey, Matt!" "Yes?" "Oh, nothing" |----------| |----| |-----------| Cut1 |------------------------|
This scenario can be represented in code, using
MonoCut, as:>>> from lhotse import Recording, SupervisionSegment, MonoCut >>> rec = Recording(id='rec1', duration=10.0, sampling_rate=8000, num_samples=80000, sources=[...]) >>> sups = [ ... SupervisionSegment(id='sup1', recording_id='rec1', start=0, duration=3.37, text='Hey, Matt!'), ... SupervisionSegment(id='sup2', recording_id='rec1', start=4.5, duration=0.9, text='Yes?'), ... SupervisionSegment(id='sup3', recording_id='rec1', start=6.9, duration=2.9, text='Oh, nothing'), ... ] >>> cut = MonoCut(id='rec1-cut1', start=0.0, duration=6.0, channel=0, recording=rec, ... supervisions=[sups[0], sups[1]])
Note
All Cut classes assume that the
SupervisionSegmenttime boundaries are relative to the beginning of the cut. E.g. if the underlyingRecordingstarts at 0s (always true), the cut starts at 100s, and the SupervisionSegment inside the cut starts at 3s, it really did start at 103rd second of the recording. In some cases, the supervision might have a negative start, or a duration exceeding the duration of the cut; this means that the supervision in the recording extends beyond the cut.Cut allows to check and read audio data or features data:
>>> assert cut.has_recording >>> samples = cut.load_audio() >>> if cut.has_features: ... feats = cut.load_features()
It can be visualized, and listened to, inside Jupyter Notebooks:
>>> cut.plot_audio() >>> cut.play_audio() >>> cut.plot_features()
Cuts can be used with Lhotse’s
FeatureExtractorto compute features.>>> from lhotse import Fbank >>> feats = cut.compute_features(extractor=Fbank())
It is also possible to use a
FeaturesWriterto store the features and attach their manifest to a copy of the cut. For best storage efficiency, preferLilcomChunkyWriterwhen the optionallilcomdependency is installed:>>> from lhotse import LilcomChunkyWriter >>> with LilcomChunkyWriter('feats.lca') as storage: ... cut_with_feats = cut.compute_and_store_features( ... extractor=Fbank(), ... storage=storage ... )
Cuts have several methods that allow their manipulation, transformation, and mixing. Some examples (see the respective methods documentation for details):
>>> cut_2_to_4s = cut.truncate(offset=2, duration=2) >>> cut_padded = cut.pad(duration=10.0) >>> cut_extended = cut.extend_by(duration=5.0, direction='both') >>> cut_mixed = cut.mix(other_cut, offset_other_by=5.0, snr=20) >>> cut_append = cut.append(other_cut) >>> cut_24k = cut.resample(24000) >>> cut_sp = cut.perturb_speed(1.1) >>> cut_vp = cut.perturb_volume(2.) >>> cut_rvb = cut.reverb_rir(rir_recording)
Note
All cut transformations are performed lazily, on-the-fly, upon calling
load_audioorload_features. The stored waveforms and features are untouched.Caution
Operations on cuts are not mutating – they return modified copies of
Cutobjects, leaving the original object unmodified.A
Cutthat contains multiple segments (SupervisionSegment) can be decayed into smaller cuts that correspond directly to supervisions:>>> smaller_cuts = cut.trim_to_supervisions()
Cuts can be detached from parts of their metadata:
>>> cut_no_feat = cut.drop_features() >>> cut_no_rec = cut.drop_recording() >>> cut_no_sup = cut.drop_supervisions()
Finally, cuts provide convenience methods to compute feature frame and audio sample masks for supervised regions:
>>> sup_frames = cut.supervisions_feature_mask() >>> sup_samples = cut.supervisions_audio_mask()
See also:
-
id:
str
-
start:
float
-
duration:
float
-
sampling_rate:
int
-
supervisions:
List[SupervisionSegment]
-
num_samples:
Optional[int]
-
num_frames:
Optional[int]
-
num_features:
Optional[int]
-
frame_shift:
Optional[float]
-
features_type:
Optional[str]
-
has_recording:
bool
-
has_features:
bool
-
has_video:
bool
-
load_audio:
Callable[[],ndarray]
-
load_video:
Callable[[],Tuple[Tensor,Optional[Tensor]]]
-
load_features:
Callable[[],ndarray]
-
compute_and_store_features:
Callable
-
drop_features:
Callable
-
drop_recording:
Callable
-
drop_supervisions:
Callable
-
drop_alignments:
Callable
-
drop_in_memory_data:
Callable
-
iter_data:
Callable
-
truncate:
Callable
-
pad:
Callable
-
extend_by:
Callable
-
resample:
Callable
-
perturb_speed:
Callable
-
perturb_tempo:
Callable
-
perturb_volume:
Callable
-
phone:
Callable
-
reverb_rir:
Callable
-
map_supervisions:
Callable
-
merge_supervisions:
Callable
-
filter_supervisions:
Callable
-
fill_supervision:
Callable
-
with_features_path_prefix:
Callable
-
with_recording_path_prefix:
Callable
- property end: float
- copy(**replace_attrs)[source]
Returns a shallow copy of self, with specified attributes overwritten.
- Example:
>>> cut = MonoCut(id="old-id", ...) ... cut2 = cut.copy(id="new-id") ... assert cut.id == "old-id" ... assert cut2.id == "new-id"
- property has_overlapping_supervisions: bool
- property trimmed_supervisions: List[SupervisionSegment]
Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.
Note that when
cut.supervisionsis called, the supervisions may have negativestartvalues that indicate the supervision actually begins before the cut, orendvalues that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).Caution
For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.
- split(timestamp)[source]
-
Split a cut into two cuts at
timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:left cut [0s - 4s]
right cut [4s - 10s]
- unmix(tag=None)[source]
Return this cut as a single-item list.
This is a compatibility no-op for cut types that are not
MixedCut, so callers can uniformly invokecut.unmix()regardless of the concrete cut type.- Parameters:
tag (
Optional[str]) – Ignored for non-mixed cuts.- Return type:
List[Cut]- Returns:
A single-item list containing
self.
- mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None, tag=None)[source]
Refer to :function:`~lhotse.cut.mix` documentation.
- Return type:
- append(other, snr=None, preserve_id=None)[source]
Append the
otherCut after the current Cut. Conceptually the same asmixbut with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) theothercut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call toload_features.- Parameters:
preserve_id (
Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.- Return type:
- compute_features(extractor, augment_fn=None)[source]
Compute the features from this cut. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – optionalWavAugmenterinstance for audio augmentation.
- Return type:
ndarray- Returns:
a numpy ndarray with the computed features.
- plot_audio(ax=None, **kwargs)[source]
Display a plot of the waveform. Requires matplotlib to be installed.
- play_audio()[source]
Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).
- plot_features()[source]
Display the feature matrix as an image. Requires matplotlib to be installed.
- plot_alignment(alignment_type='word')[source]
Display the alignment on top of a spectrogram. Requires matplotlib to be installed.
- trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)[source]
Splits the current
Cutinto as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded viakeep_overlappingflag.For example, the following cut:
Cut |-----------------| Sup1 |----| Sup2 |-----------|
is transformed into two cuts:
Cut1 |----| Sup1 |----| Sup2 |-| Cut2 |-----------| Sup1 |-| Sup2 |-----------|
For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
- Parameters:
keep_overlapping (
bool) – whenFalse, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discardSup2inCut1andSup1inCut2. In this mode, we guarantee that there will always be exactly one supervision per cut.min_duration (
Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter thanmin_durationwith actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept whenkeep_overlappingis true. If there is not enough context, the returned cut will be shorter thanmin_duration. If the supervision segment is longer thanmin_duration, the return cut will be longer.context_direction (
Literal['center','left','right','random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
- Return type:
CutSet
- Returns:
a list of cuts.
- trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)[source]
Splits the current
Cutinto its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
Hint
If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the
Cut.merge_supervisions()method first to merge the supervisions into a single one, followed by theCut.trim_to_alignments()method. For example:>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)
Hint
The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:
>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)
- Parameters:
type (
str) – The type of the alignment to trim to (e.g. “word”).max_pause (
Optional[float]) – The maximum pause allowed between the alignments to merge them. IfNone, no merging will be performed. [default: None]delimiter (
str) – The delimiter to use when joining the alignment items.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.num_jobs – Number of parallel workers to process the cuts.
- Return type:
CutSet
- Returns:
a CutSet object.
- trim_to_supervision_groups(max_pause=0.0)[source]
Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than
max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482For example, the following cut:
Cut╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝
is transformed into two cuts:
Cut 1 Cut 2
╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝
For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.
- Parameters:
max_pause (
float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.- Return type:
CutSet
- Returns:
a
CutSet.
- cut_into_windows_balanced(min_duration, max_duration, overlap=0.0, keep_excessive_supervisions=True)[source]
Return a list of shorter cuts made by splitting this cut into overlapping windows whose size is chosen within
[min_duration, max_duration]to maximise the duration of the final (potentially shorter) window, thereby minimising padding.Each resulting sub-cut carries two extra entries in its
customdict:"source_cut_id"– theidof this (parent) cut."source_cut_start"– thestarttime of this cut within its recording. Downstream code can use this to detect whether the parent was the first window of a recording (source_cut_start == 0) or a later continuation.
- Parameters:
min_duration (
float) – Minimum desired window duration in seconds.max_duration (
float) – Maximum desired window duration in seconds.overlap (
float) – Overlap between consecutive windows in seconds (default: 0).keep_excessive_supervisions (
bool) – When a window is truncated mid-supervision, should the supervision be kept.
- Return type:
CutSet
- Returns:
a
CutSetof overlapping sub-cuts.
- cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)[source]
Return a list of shorter cuts, made by traversing this cut in windows of
durationseconds byhopseconds.The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.
- Parameters:
duration (
float) – Desired duration of the new cuts in seconds.hop (
Optional[float]) – Shift between the windows in the new cuts in seconds.keep_excessive_supervisions (
bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
- Return type:
CutSet
- Returns:
a list of cuts made from shorter duration windows.
- index_supervisions(index_mixed_tracks=False, keep_ids=None)[source]
Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.
The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.
- Parameters:
index_mixed_tracks (
bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.keep_ids (
Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.
- Return type:
Dict[str,IntervalTree]- Returns:
a mapping from Cut ID to an interval tree of SupervisionSegments.
- save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)[source]
Store this cut’s waveform as audio recording to disk.
- Parameters:
storage_path (
Union[Path,str]) – The path to location where we will store the audio recordings.format (
Optional[str]) – Audio format argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.encoding (
Optional[str]) – Audio encoding argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.kwargs – additional arguments passed to
Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.
- Return type:
- Returns:
a new Cut instance.
- speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)[source]
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)[source]
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- supervisions_feature_mask(use_alignment_if_exists=None)[source]
Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- supervisions_audio_mask(use_alignment_if_exists=None)[source]
Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
-
id:
- class lhotse.cut.CutSet(cuts=None)[source]
CutSetrepresents a collection of cuts. CutSet ties together all types of data – audio, features and supervisions, and is suitable to represent training/dev/test sets.CutSet can be either “lazy” (acts as an iterable) which is best for representing full datasets, or “eager” (acts as a list), which is best for representing individual mini-batches (and sometimes test/dev datasets). Almost all operations are available for both modes, but some of them are more efficient depending on the mode (e.g. indexing an “eager” manifest is O(1)).
Note
CutSetis the basic building block of PyTorch-style Datasets for speech/audio processing tasks.When coming from Kaldi, there is really no good equivalent – the closest concept may be Kaldi’s “egs” for training neural networks, which are chunks of feature matrices and corresponding alignments used respectively as inputs and supervisions.
CutSetis different because it provides you with all kinds of metadata, and you can select just the interesting bits to feed them to your models.CutSetcan be created from any combination ofRecordingSet,SupervisionSet, andFeatureSetwithlhotse.cut.CutSet.from_manifests():>>> from lhotse import CutSet >>> cuts = CutSet.from_manifests(recordings=my_recording_set) >>> cuts2 = CutSet.from_manifests(features=my_feature_set) >>> cuts3 = CutSet.from_manifests( ... recordings=my_recording_set, ... features=my_feature_set, ... supervisions=my_supervision_set, ... )
When creating a
CutSetwithCutSet.from_manifests(), the resulting cuts will have the same duration as the input recordings or features. For long recordings, it is not viable for training. We provide several methods to transform the cuts into shorter ones.Consider the following scenario:
Recording |-------------------------------------------| "Hey, Matt!" "Yes?" "Oh, nothing" |----------| |----| |-----------| .......... CutSet.from_manifests() .......... Cut1 |-------------------------------------------| ............. Example CutSet A .............. Cut1 Cut2 Cut3 |----------| |----| |-----------| ............. Example CutSet B .............. Cut1 Cut2 |---------------------||--------------------| ............. Example CutSet C .............. Cut1 Cut2 |---| |------|
The CutSet’s A, B and C can be created like:
>>> cuts_A = cuts.trim_to_supervisions() >>> cuts_B = cuts.cut_into_windows(duration=5.0) >>> cuts_C = cuts.trim_to_unsupervised_segments()
Note
Some operations support parallel execution via an optional
num_jobsparameter. By default, all processing is single-threaded.Caution
Operations on cut sets are not mutating – they return modified copies of
CutSetobjects, leaving the original object unmodified (and all of its cuts are also unmodified).CutSetcan be stored and read from JSON, JSONL, etc. and supports optional gzip compression:>>> cuts.to_file('cuts.jsonl.gz') >>> cuts4 = CutSet.from_file('cuts.jsonl.gz')
It behaves similarly to a
dict:>>> 'rec1-1-0' in cuts True >>> cut = cuts['rec1-1-0'] >>> for cut in cuts: >>> pass >>> len(cuts) 127
CutSethas some convenience properties and methods to gather information about the dataset:>>> ids = list(cuts.ids) >>> speaker_id_set = cuts.speakers >>> # The following prints a message: >>> cuts.describe() Cuts count: 547 Total duration (hours): 326.4 Speech duration (hours): 79.6 (24.4%) *** Duration statistics (seconds): mean 2148.0 std 870.9 min 477.0 25% 1523.0 50% 2157.0 75% 2423.0 max 5415.0 dtype: float64
Manipulation examples:
>>> longer_than_5s = cuts.filter(lambda c: c.duration > 5) >>> first_100 = cuts.subset(first=100) >>> split_into_4 = cuts.split(num_splits=4) >>> shuffled = cuts.shuffle() >>> random_sample = cuts.sample(n_cuts=10) >>> new_ids = cuts.modify_ids(lambda c: c.id + '-newid')
These operations can be composed to implement more complex operations, e.g. bucketing by duration:
>>> buckets = cuts.sort_by_duration().split(num_splits=30)
Cuts in a
CutSetcan be detached from parts of their metadata:>>> cuts_no_feat = cuts.drop_features() >>> cuts_no_rec = cuts.drop_recordings() >>> cuts_no_sup = cuts.drop_supervisions()
Sometimes specific sorting patterns are useful when a small CutSet represents a mini-batch:
>>> cuts = cuts.sort_by_duration(ascending=False) >>> cuts = cuts.sort_like(other_cuts)
CutSetoffers some batch processing operations:>>> cuts = cuts.pad(num_frames=300) # or duration=30.0 >>> cuts = cuts.truncate(max_duration=30.0, offset_type='start') # truncate from start to 30.0s >>> cuts = cuts.mix(other_cuts, snr=[10, 30], mix_prob=0.5)
CutSetsupports lazy data augmentation/transformation methods which require adjusting some information in the manifest (e.g.,num_samplesorduration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:>>> cuts_sp = cuts.perturb_speed(factor=1.1) >>> cuts_vp = cuts.perturb_volume(factor=2.) >>> cuts_24k = cuts.resample(24000) >>> cuts_rvb = cuts.reverb_rir(rir_recordings)
Caution
If the
CutSetcontainedFeaturesmanifests, they will be detached after performing audio augmentations such asCutSet.perturb_speed(),CutSet.resample(),CutSet.perturb_volume(), orCutSet.reverb_rir().CutSetoffers parallel feature extraction capabilities (see meth:.CutSet.compute_and_store_features: for details), and can be used to estimate global mean and variance:>>> from lhotse import Fbank >>> cuts = CutSet() >>> # This uses the default backend (numpy_files unless overridden with >>> # LHOTSE_FEATURES_STORAGE_BACKEND). If lilcom is installed, prefer >>> # storage_type=LilcomChunkyWriter for better storage efficiency. >>> cuts = cuts.compute_and_store_features( ... extractor=Fbank(), ... storage_path='/data/feats', ... num_jobs=4 ... ) >>> mvn_stats = cuts.compute_global_feature_stats('/data/features/mvn_stats.pkl', max_cuts=10000)
See also:
- property ids: Iterable[str]
- property speakers: FrozenSet[str]
- static from_files(paths, shuffle_iters=True, seed=None)[source]
Constructor that creates a single CutSet out of many manifest files. We will iterate sequentially over each of the files, and by default we will randomize the file order every time CutSet is iterated.
This is intended primarily for large datasets which are split into many small manifests, to ensure that the order in which data is seen during training can be properly randomized.
- Parameters:
paths (
List[Union[Path,str]]) – a list of paths to cut manifests.shuffle_iters (
bool) – bool, should we shuffle paths each time we iterate the returned CutSet (enabled by default).seed (
Optional[int]) – int, random seed controlling the shuffling RNG. By default, we’ll use Python’s global RNG so the order will be different on each script execution.
- Return type:
- Returns:
a lazy CutSet instance.
- static from_cuts(cuts)[source]
Left for backward compatibility, where it implicitly created an “eager” CutSet.
- Return type:
- static from_items(cuts)
Left for backward compatibility, where it implicitly created an “eager” CutSet.
- Return type:
- static from_manifests(recordings=None, supervisions=None, features=None, output_path=None, random_ids=False, tolerance=0.001, lazy=False)[source]
Create a CutSet from any combination of supervision, feature and recording manifests. At least one of
recordingsorfeaturesis required.The created cuts will be of type
MonoCut, even when the recordings have multiple channels. TheMonoCutboundaries correspond to those found in thefeatures, when available, otherwise to those found in therecordings.When
supervisionsare provided, we’ll be searching them for matching recording IDs and attaching to created cuts, assuming they are fully within the cut’s time span.- Parameters:
recordings (
Optional[RecordingSet]) – an optionalRecordingSetmanifest.supervisions (
Optional[SupervisionSet]) – an optionalSupervisionSetmanifest.features (
Optional[FeatureSet]) – an optionalFeatureSetmanifest.output_path (
Union[Path,str,None]) – an optional path where theCutSetis stored.random_ids (
bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)tolerance (
float) – float, tolerance for supervision and feature segment boundary comparison. By default, it’s 1ms. Increasing this value can be helpful when importing Kaldi data directories with precomputed features (typically 0.02 - 0.1 should be sufficient).lazy (
bool) – boolean, whenTrue, output_path must be provided
- Return type:
- Returns:
a new
CutSetinstance.
- static from_webdataset(path, **wds_kwargs)[source]
Provides the ability to read Lhotse objects from a WebDataset tarball (or a collection of them, i.e., shards) sequentially, without reading the full contents into memory. It also supports passing a list of paths, or WebDataset-style pipes.
CutSets stored in this format are potentially much faster to read from due to sequential I/O (we observed speedups of 50-100x vs random-read mechanisms).
Since this mode does not support random access reads, some methods of CutSet might not work properly (e.g.
len()).The behaviour of the underlying
WebDatasetinstance can be customized by providing its kwargs directly to the constructor of this class. For details, seelhotse.dataset.webdataset.mini_webdataset()documentation.Examples
Read manifests and data from a single tarball:
>>> cuts = CutSet.from_webdataset("data/cuts-train.tar")
Read manifests and data from a multiple tarball shards:
>>> cuts = CutSet.from_webdataset("data/shard-{000000..004126}.tar") >>> # alternatively >>> cuts = CutSet.from_webdataset(["data/shard-000000.tar", "data/shard-000001.tar", ...])
Read manifests and data from shards in cloud storage (here AWS S3 via AWS CLI):
>>> cuts = CutSet.from_webdataset("pipe:aws s3 cp data/shard-{000000..004126}.tar -")
Read manifests and data from shards which are split between PyTorch DistributeDataParallel nodes and dataloading workers, with shard-level shuffling enabled:
>>> cuts = CutSet.from_webdataset( ... "data/shard-{000000..004126}.tar", ... split_by_worker=True, ... split_by_node=True, ... shuffle_shards=True, ... )
- Return type:
- static from_shar(fields=None, in_dir=None, split_for_dataloading=False, shuffle_shards=False, stateful_shuffle=True, seed=42, cut_map_fns=None, slice_length=None)[source]
Reads cuts and their corresponding data from multiple shards, also recognized as the Lhotse Shar format. Each shard is numbered and represented as a collection of one text manifest and one or more binary tarfiles. Each tarfile contains a single type of data, e.g., recordings, features, or custom fields.
Given an example directory named
some_dir, its expected layout issome_dir/cuts.000000.jsonl.gz,some_dir/recording.000000.tar,some_dir/features.000000.tar, and then the same names but numbered with000001, etc. There may also be other files if the cuts have custom data attached to them.The main idea behind Lhotse Shar format is to optimize dataloading with sequential reads, while keeping the data composition more flexible than e.g. WebDataset tar archives do. To achieve this, Lhotse Shar keeps each data type in a separate archive, along a single CutSet JSONL manifest. This way, the metadata can be investigated without iterating through the binary data. The format also allows iteration over a subset of fields, or extension of existing data with new fields.
As you iterate over cuts from
LazySharIterator, it keeps a file handle open for the JSONL manifest and all of the tar files that correspond to the current shard. The tar files are read item by item together, and their binary data is attached to the cuts. It can be normally accessed using methods such ascut.load_audio().We can simply load a directory created by
SharWriter. Example:>>> cuts = LazySharIterator(in_dir="some_dir") ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio() ... fbank = cut.load_features()
LazySharIteratorcan also be initialized from a dict, where the keys indicate fields to be read, and the values point to actual shard locations. This is useful when only a subset of data is needed, or it is stored in different locations. Example:>>> cuts = LazySharIterator({ ... "cuts": ["some_dir/cuts.000000.jsonl.gz"], ... "recording": ["another_dir/recording.000000.tar"], ... "features": ["yet_another_dir/features.000000.tar"], ... }) ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio() ... fbank = cut.load_features()
We also support providing shell commands as shard sources, inspired by WebDataset. The “cuts” field expects a .jsonl stream, while the other fields expect a .tar stream. Example:
>>> cuts = LazySharIterator({ ... "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl"] ... "recording": ["pipe:curl https://my.page/recording.000000.tar"], ... }) ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio()
The shell command can also contain pipes, which can be used to e.g. decompressing. Example:
>>> cuts = LazySharIterator({ ... "cuts": ["pipe:curl https://my.page/cuts.000000.jsonl.gz | gunzip -c -"], (...) ... })
Finally, we allow specifying URLs or cloud storage URIs for the shard sources. We defer to
smart_openlibrary to handle those. Example:>>> cuts = LazySharIterator({ ... "cuts": ["s3://my-bucket/cuts.000000.jsonl.gz"], ... "recording": ["s3://my-bucket/recording.000000.tar"], ... }) ... for cut in cuts: ... print("Cut", cut.id, "has duration of", cut.duration) ... audio = cut.load_audio()
- Parameters:
fields (
Optional[Dict[str,Sequence[Union[Path,str]]]]) – a dict whose keys specify which fields to load, and values are lists of shards (either paths or shell commands). The field “cuts” pointing to CutSet shards always has to be present.in_dir (
Union[Path,str,None]) – path to a directory created withSharWriterwith all the shards in a single place. Can be used instead offields.split_for_dataloading (
bool) – bool, by defaultFalsewhich does nothing. Setting it toTrueis intended for PyTorch training with multiple dataloader workers and possibly multiple DDP nodes. It results in each node+worker combination receiving a unique subset of shards from which to read data to avoid data duplication. This is mutually exclusive withseed='randomized'.shuffle_shards (
bool) – bool, by defaultFalse. WhenTrue, the shards are shuffled (in case of multi-node training, the shuffling is the same on each node given the same seed).seed (
Union[int,Literal['randomized']]) – Whenshuffle_shardsisTrue, we use this number to seed the RNG. Seed can be set to'randomized'in which case we expect that the user providedlhotse.dataset.dataloading.worker_init_fn()as DataLoader’sworker_init_fnargument. It will cause the iterator to shuffle shards differently on each node and dataloading worker in PyTorch training. This is mutually exclusive withsplit_for_dataloading=True. Seed can be set to'trng'which, like'randomized', shuffles the shards differently on each iteration, but is not possible to control (and is not reproducible).trngmode is mostly useful when the user has limited control over the training loop and may not be able to guarantee internal Shar epoch is being incremented, but needs randomness on each iteration (e.g. useful with PyTorch Lightning).stateful_shuffle (
bool) – bool, by defaultFalse. WhenTrue, every time this object is fully iterated, it increments an internal epoch counter and triggers shard reshuffling with RNG seeded byseed+epoch. Doesn’t have any effect whenshuffle_shardsisFalse.cut_map_fns (
Optional[Sequence[Callable[[Cut],Cut]]]) – optional sequence of callables that accept cuts and return cuts. It’s expected to have the same length as the number of shards, so each function corresponds to a specific shard. It can be used to attach shard-specific custom attributes to cuts.slice_length (
Optional[int]) – optional int, when set enables random slicing of shards that may improve sampling randomness for many-dataset-with-many-large-shards setups at the cost of efficiency. In this mode, we randomly select K to skip first K examples and read onlyslice_lengthexamples from each shard, then move to the next one.
- Return type:
- See also:
LazySharIterator, to_shar().
- to_shar(output_dir, fields, shard_size=1000, shard_offset=0, warn_unused_fields=True, include_cuts=True, num_jobs=1, fault_tolerant=False, verbose=False)[source]
Writes cuts and their corresponding data into multiple shards, also recognized as the Lhotse Shar format. Each shard is numbered and represented as a collection of one text manifest and one or more binary tarfiles. Each tarfile contains a single type of data, e.g., recordings, features, or custom fields.
The main idea behind Lhotse Shar format is to optimize dataloading with sequential reads, while keeping the data composition more flexible than e.g. WebDataset tar archives do. To achieve this, Lhotse Shar keeps each data type in a separate archive, along a single CutSet JSONL manifest. This way, the metadata can be investigated without iterating through the binary data. The format also allows iteration over a subset of fields, or extension of existing data with new fields.
The user has to specify which fields should be saved, and what compression to use for each of them. Currently we support
wav,flac, andmp3compression forrecordingand custom audio fields, andlilcomornumpyforfeaturesand custom array fields.Example:
>>> cuts = CutSet(...) # cuts have 'recording' and 'features' >>> output_paths = cuts.to_shar( ... "some_dir", shard_size=100, fields={"recording": "mp3", "features": "lilcom"} ... )
It would create a directory
some_dirwith files such assome_dir/cuts.000000.jsonl.gz,some_dir/recording.000000.tar,some_dir/features.000000.tar, and then the same names but numbered with000001, etc. The starting shard offset can be set usingshard_offsetparameter. The writer starts from 0 by default. The function returns a dict that maps field names to lists of saved shard paths.When
shard_sizeis set toNone, we will disable automatic sharding and the shard number suffix will be omitted from the file names.The option
warn_unused_fieldswill emit a warning when cuts have some data attached to them (e.g., recording, features, or custom arrays) but saving it was not specified viafields.The option
include_cutscontrols whether we store the cuts alongsidefields(true by default). Turning it off is useful when extending existing dataset with new fields/feature types, but the original cuts do not require any modification.When
num_jobsis greater than 1, we will first split the CutSet into shard CutSets, and then export thefieldsin parallel using multiple subprocesses. Enablingverbosewill display a progress bar. :rtype:Dict[str,List[str]]Note
It is recommended not to set
num_jobstoo high on systems with slow disks, as the export will likely be bottlenecked by I/O speed in these cases. Try experimenting with 4-8 jobs first.The option
fault_tolerantwill skip over audio files that failed to load with a warning. By default it is disabled.- See also:
SharWriter, to_shar().
- See also:
- decompose(output_dir=None, verbose=False)[source]
Return a 3-tuple of unique (recordings, supervisions, features) found in this
CutSet. Some manifest sets may also beNone, e.g., if features were not extracted.Note
MixedCutis iterated over its track cuts.- Parameters:
output_dir (
Union[Path,str,None]) – directory where the manifests will be saved. The following files will be created: ‘recordings.jsonl.gz’, ‘supervisions.jsonl.gz’, ‘features.jsonl.gz’.verbose (
bool) – whenTrue, shows a progress bar.
- Return type:
Tuple[Optional[RecordingSet],Optional[SupervisionSet],Optional[FeatureSet]]
- describe(full=False)[source]
Print a message describing details about the
CutSet- the number of cuts and the duration statistics, including the total duration and the percentage of speech segments.- Parameters:
full (
bool) – whenTrue, prints the full duration statistics, including % of speech by speaker count.- Return type:
None
Example output (for AMI train set):
>>> cs.describe(full=True)
Cut statistics: ╒═══════════════════════════╤══════════╕ │ Cuts count: │ 133 │ ├───────────────────────────┼──────────┤ │ Total duration (hh:mm:ss) │ 79:23:03 │ ├───────────────────────────┼──────────┤ │ mean │ 2148.7 │ ├───────────────────────────┼──────────┤ │ std │ 867.4 │ ├───────────────────────────┼──────────┤ │ min │ 477.9 │ ├───────────────────────────┼──────────┤ │ 25% │ 1509.8 │ ├───────────────────────────┼──────────┤ │ 50% │ 2181.7 │ ├───────────────────────────┼──────────┤ │ 75% │ 2439.9 │ ├───────────────────────────┼──────────┤ │ 99% │ 5300.7 │ ├───────────────────────────┼──────────┤ │ 99.5% │ 5355.3 │ ├───────────────────────────┼──────────┤ │ 99.9% │ 5403.2 │ ├───────────────────────────┼──────────┤ │ max │ 5415.2 │ ├───────────────────────────┼──────────┤ │ Recordings available: │ 133 │ ├───────────────────────────┼──────────┤ │ Features available: │ 0 │ ├───────────────────────────┼──────────┤ │ Supervisions available: │ 102222 │ ╘═══════════════════════════╧══════════╛ Speech duration statistics: ╒══════════════════════════════╤══════════╤═══════════════════════════╕ │ Total speech duration │ 64:59:51 │ 81.88% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total speaking time duration │ 74:33:09 │ 93.91% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Total silence duration │ 14:23:12 │ 18.12% of recording │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Single-speaker duration │ 56:18:24 │ 70.93% (86.63% of speech) │ ├──────────────────────────────┼──────────┼───────────────────────────┤ │ Overlapped speech duration │ 08:41:28 │ 10.95% (13.37% of speech) │ ╘══════════════════════════════╧══════════╧═══════════════════════════╛ Speech duration statistics by number of speakers: ╒══════════════════════╤═══════════════════════╤════════════════════════════╤═══════════════╤══════════════════════╕ │ Number of speakers │ Duration (hh:mm:ss) │ Speaking time (hh:mm:ss) │ % of speech │ % of speaking time │ ╞══════════════════════╪═══════════════════════╪════════════════════════════╪═══════════════╪══════════════════════╡ │ 1 │ 56:18:24 │ 56:18:24 │ 86.63% │ 75.53% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 2 │ 07:51:44 │ 15:43:28 │ 12.10% │ 21.09% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 3 │ 00:47:36 │ 02:22:47 │ 1.22% │ 3.19% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ 4 │ 00:02:08 │ 00:08:31 │ 0.05% │ 0.19% │ ├──────────────────────┼───────────────────────┼────────────────────────────┼───────────────┼──────────────────────┤ │ Total │ 64:59:51 │ 74:33:09 │ 100.00% │ 100.00% │ ╘══════════════════════╧═══════════════════════╧════════════════════════════╧═══════════════╧══════════════════════╛
- split(num_splits, shuffle=False, drop_last=False)[source]
Split the
CutSetintonum_splitspieces of equal size.- Parameters:
num_splits (
int) – Requested number of splits.shuffle (
bool) – Optionally shuffle the recordings order first.drop_last (
bool) – determines how to handle splitting whenlen(seq)is not divisible bynum_splits. WhenFalse(default), the splits might have unequal lengths. WhenTrue, it may discard the last element in some splits to ensure they are equally long.
- Return type:
List[CutSet]- Returns:
A list of
CutSetpieces.
- split_lazy(output_dir, chunk_size, prefix='', num_digits=8, start_idx=0)[source]
Splits a manifest (either lazily or eagerly opened) into chunks, each with
chunk_sizeitems (except for the last one, typically).In order to be memory efficient, this implementation saves each chunk to disk in a
.jsonl.gzformat as the input manifest is sampled.Note
For lowest memory usage, use
load_manifest_lazyto open the input manifest for this method.- Parameters:
it – any iterable of Lhotse manifests.
output_dir (
Union[Path,str]) – directory where the split manifests are saved. Each manifest is saved at:{output_dir}/{prefix}.{split_idx}.jsonl.gzchunk_size (
int) – the number of items in each chunk.prefix (
str) – the prefix of each manifest.num_digits (
int) – the width ofsplit_idx, which will be left padded with zeros to achieve it.start_idx (
int) – The split index to start counting from (default is0).
- Return type:
List[CutSet]- Returns:
a list of lazily opened chunk manifests.
- subset(*, supervision_ids=None, cut_ids=None, first=None, last=None)[source]
Return a new
CutSetaccording to the selected subset criterion. Only a single argument tosubsetis supported at this time.- Example:
>>> cuts = CutSet.from_yaml('path/to/cuts') >>> train_set = cuts.subset(supervision_ids=train_ids) >>> test_set = cuts.subset(supervision_ids=test_ids)
- Parameters:
supervision_ids (
Optional[Iterable[str]]) – List of supervision IDs to keep.cut_ids (
Optional[Iterable[str]]) – List of cut IDs to keep. The returnedCutSetpreserves the order of cut_ids.first (
Optional[int]) – int, the number of first cuts to keep.last (
Optional[int]) – int, the number of last cuts to keep.
- Return type:
- Returns:
a new
CutSetwith the subset results.
- map(transform_fn, apply_fn=<function is_cut>)[source]
Apply transform_fn to each item in this manifest and return a new manifest. If the manifest is opened lazy, the transform is also applied lazily.
- Parameters:
transform_fn (
Callable[[TypeVar(T)],TypeVar(T)]) – A callable (function) that accepts a single item instance and returns a new (or the same) instance of the same type. E.g. with CutSet, callable acceptsCutand returns alsoCut.- Return type:
- Returns:
a new
CutSetwith transformed cuts.
- filter_supervisions(predicate)[source]
Return a new CutSet with Cuts containing only SupervisionSegments satisfying predicate
Cuts without supervisions are preserved
- Example:
>>> cuts = CutSet.from_yaml('path/to/cuts') >>> at_least_five_second_supervisions = cuts.filter_supervisions(lambda s: s.duration >= 5)
- Parameters:
predicate (
Callable[[SupervisionSegment],bool]) – A callable that accepts SupervisionSegment and returns bool- Return type:
- Returns:
a CutSet with filtered supervisions
- merge_supervisions(merge_policy='delimiter', custom_merge_fn=None)[source]
Return a copy of the cut that has all of its supervisions merged into a single segment.
The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields of all segments are concatenated with a whitespace.
- Parameters:
merge_policy (
str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied tocustomfields. Fields with aNonevalue are omitted.custom_merge_fn (
Optional[Callable[[str,Iterable[Any]],Any]]) – a function that will be called to merge custom fields values. We expectcustom_merge_fnto handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like:custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])
- Return type:
- trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False, num_jobs=1)[source]
Return a new CutSet with Cuts that have identical spans as their supervisions.
For example, the following cut:
Cut |-----------------| Sup1 |----| Sup2 |-----------|
is transformed into two cuts:
Cut1 |----| Sup1 |----| Sup2 |-| Cut2 |-----------| Sup1 |-| Sup2 |-----------|
For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
- Parameters:
keep_overlapping (
bool) – whenFalse, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discardSup2inCut1andSup1inCut2. In this mode, we guarantee that there will always be exactly one supervision per cut.min_duration (
Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter thanmin_durationwith actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept whenkeep_overlappingis true. If there is not enough context, the returned cut will be shorter thanmin_duration. If the supervision segment is longer thanmin_duration, the return cut will be longer.context_direction (
Literal['center','left','right','random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.num_jobs (
int) – Number of parallel workers to process the cuts.
- Return type:
- Returns:
a
CutSet.
- trim_to_alignments(type, max_pause=0.0, max_segment_duration=None, delimiter=' ', keep_all_channels=False, num_jobs=1)[source]
Return a new CutSet with Cuts that have identical spans as the alignments of type type. An additional max_pause is allowed between the alignments to merge contiguous alignment items.
For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
- Parameters:
type (
str) – The type of the alignment to trim to (e.g. “word”).max_pause (
float) – The maximum pause allowed between the alignments to merge them.delimiter (
str) – The delimiter to use when concatenating the alignment items.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.num_jobs (
int) – Number of parallel workers to process the cuts.
- Return type:
- Returns:
a
CutSet.
- trim_to_unsupervised_segments()[source]
Return a new CutSet with Cuts created from segments that have no supervisions (likely silence or noise).
- Return type:
- Returns:
a
CutSet.
- trim_to_supervision_groups(max_pause=None, num_jobs=1)[source]
Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than
max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482For example, the following cut:
Cut╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝
is transformed into two cuts:
Cut 1 Cut 2
╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝
For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.
- Parameters:
max_pause (
Optional[float]) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups.num_jobs (
int) – Number of parallel workers to process the cuts.
- Return type:
- Returns:
a
CutSet.
- combine_same_recording_channels()[source]
Find cuts that come from the same recording and have matching start and end times, but represent different channels. Then, combine them together to form MultiCut’s and return a new
CutSetcontaining these MultiCut’s. This is useful for processing microphone array recordings.It is intended to be used as the first operation after creating a new
CutSet(but might also work in other circumstances, e.g. if it was cut to windows first).- Return type:
- Example:
>>> ami = prepare_ami('path/to/ami') >>> cut_set = CutSet.from_manifests(recordings=ami['train']['recordings']) >>> multi_channel_cut_set = cut_set.combine_same_recording_channels()
In the AMI example, the
multi_channel_cut_setwill yield MultiCuts that hold all single-channel Cuts together.
- sort_by_recording_id(ascending=True)[source]
Sort the CutSet alphabetically according to ‘recording_id’. Ascending by default.
This is advantageous before caling save_audios() on a trim_to_supervision() processed CutSet, also make sure that set_caching_enabled(True) was called.
- Return type:
- sort_by_duration(ascending=False)[source]
Sort the CutSet according to cuts duration and return the result. Descending by default.
- Return type:
- sort_like(other)[source]
Sort the CutSet according to the order of cut IDs in
otherand return the result.- Return type:
- index_supervisions(index_mixed_tracks=False, keep_ids=None)[source]
Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.
The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.
- Parameters:
index_mixed_tracks (
bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.keep_ids (
Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.
- Return type:
Dict[str,IntervalTree]- Returns:
a mapping from Cut ID to an interval tree of SupervisionSegments.
- pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)[source]
Return a new CutSet with Cuts padded to
duration,num_framesornum_samples. Cuts longer than the specified argument will not be affected. By default, cuts will be padded to the right (i.e. after the signal).When none of
duration,num_frames, ornum_samplesis specified, we’ll try to determine the best way to pad to the longest cut based on whether features or recordings are available.- Parameters:
duration (
Optional[float]) – The cuts minimal duration after padding. When not specified, we’ll choose the duration of the longest cut in the CutSet.num_frames (
Optional[int]) – The cut’s total number of frames after padding.num_samples (
Optional[int]) – The cut’s total number of samples after padding.pad_feat_value (
float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.preserve_id (
bool) – WhenTrue, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).pad_value_dict (
Optional[Dict[str,Union[int,float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.
- Return type:
- Returns:
A padded CutSet.
- truncate(max_duration, offset_type, keep_excessive_supervisions=True, preserve_id=False, rng=None)[source]
Return a new CutSet with the Cuts truncated so that their durations are at most max_duration. Cuts shorter than max_duration will not be changed. :type max_duration:
float:param max_duration: float, the maximum duration in seconds of a cut in the resulting manifest. :type offset_type:str:param offset_type: str, can be: - ‘start’ => cuts are truncated from their start; - ‘end’ => cuts are truncated from their end minus max_duration; - ‘random’ => cuts are truncated randomly between their start and their end minus max_duration :type keep_excessive_supervisions:bool:param keep_excessive_supervisions: bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept. :type preserve_id:bool:param preserve_id: bool. Should the truncated cut keep the same ID or get a new, random one. :type rng:Optional[Random] :param rng: optional random number generator to be used with a ‘random’offset_type. :rtype:CutSet:return: a new CutSet instance with truncated cuts.
- extend_by(duration, direction='both', preserve_id=False, pad_silence=True)[source]
Returns a new CutSet with cuts extended by duration amount.
- Parameters:
duration (
float) – float (seconds), specifies the duration by which the CutSet is extended.direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the same duration (equal to duration).preserve_id (
bool) – bool. Should the extended cut keep the same ID or get a new, random one.pad_silence (
bool) – bool. If True, the extended part of the cut will be padded with silence if required to match the specified duration.
- Return type:
- Returns:
a new CutSet instance.
- cut_into_windows(duration, hop=None, keep_excessive_supervisions=True, num_jobs=1)[source]
Return a new
CutSet, made by traversing eachDataCutin windows ofdurationseconds byhopseconds and creating newDataCutout of them.The last window might have a shorter duration if there was not enough audio, so you might want to use either
.filter()or.pad()afterwards to obtain a uniform durationCutSet.- Parameters:
duration (
float) – Desired duration of the new cuts in seconds.hop (
Optional[float]) – Shift between the windows in the new cuts in seconds.keep_excessive_supervisions (
bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.num_jobs (
int) – The number of parallel workers.
- Return type:
- Returns:
a new CutSet with cuts made from shorter duration windows.
- cut_into_windows_balanced(min_duration, max_duration, overlap=0.0, keep_excessive_supervisions=True, num_jobs=1)[source]
Return a new
CutSetby splitting every cut into overlapping windows whose duration is chosen in[min_duration, max_duration]to maximise the last chunk length.Each sub-cut has
custom["source_cut_id"]andcustom["source_cut_start"]set so that downstream merging logic can group sub-cuts from the same parent.Cuts whose duration is already
<= max_durationare returned unchanged (as a single element in the output stream).- Parameters:
min_duration (
float) – Minimum window duration in seconds.max_duration (
float) – Maximum window duration in seconds.overlap (
float) – Overlap between consecutive windows in seconds (default: 0).keep_excessive_supervisions (
bool) – Whether to keep supervisions that extend beyond the window boundary.num_jobs (
int) – Number of parallel workers (default: 1).
- Return type:
- Returns:
a new
CutSetwith overlapping sub-cuts (flat, not grouped).
- load_audio(collate=False, limit=1024)[source]
Reads the audio of all cuts in this
CutSetinto memory. Useful when this object represents a mini-batch.- Parameters:
collate (
bool) – Should we collate the read audio into a single array. Shorter cuts will be padded. False by default.limit (
int) – Maximum number of read audio examples. By default it’s 1024 which covers most frequently encountered mini-batch sizes. If you are working with larger batch sizes, increase this limit.
- Return type:
Union[List[ndarray],Tuple[ndarray,ndarray]]- Returns:
A list of numpy arrays, or a single array with batch size as the first dim.
- sample(n_cuts=1)[source]
Randomly sample this
CutSetand returnn_cutscuts. Whenn_cutsis 1, will return a single cut instance; otherwise will return aCutSet.
- resample(sampling_rate, affix_id=False, recording_field=None)[source]
Return a new
CutSetthat contains cuts resampled to the newsampling_rate. All cuts in the manifest must contain recording information. If the feature manifests are attached, they are dropped.- Parameters:
sampling_rate (
int) – The new sampling rate.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).recording_field (
Optional[str]) – which recording field to resample.
- Return type:
- Returns:
a modified copy of the
CutSet.
- perturb_speed(factor, affix_id=True)[source]
Return a new
CutSetthat contains speed perturbed cuts with a factor offactor. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are modified to reflect the speed perturbed start times and durations.- Parameters:
factor (
float) – The resulting playback speed isfactortimes the original one.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).
- Return type:
- Returns:
a modified copy of the
CutSet.
- perturb_tempo(factor, affix_id=True)[source]
Return a new
CutSetthat contains tempo perturbed cuts with a factor offactor.Compared to speed perturbation, tempo preserves pitch. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are modified to reflect the tempo perturbed start times and durations.
- Parameters:
factor (
float) – The resulting playback tempo isfactortimes the original one.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).
- Return type:
- Returns:
a modified copy of the
CutSet.
- perturb_volume(factor, affix_id=True)[source]
Return a new
CutSetthat contains volume perturbed cuts with a factor offactor. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are remaining the same.- Parameters:
factor (
float) – The resulting playback volume isfactortimes the original one.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).
- Return type:
- Returns:
a modified copy of the
CutSet.
- narrowband(codec, restore_orig_sr=True, affix_id=True)[source]
Return a new
CutSetthat contains narrowband effect cuts. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are remaining the same.- Parameters:
codec (
str) – Codec name.restore_orig_sr (
bool) – Restore original sampling rate.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).
- Return type:
- Returns:
a modified copy of the
CutSet.
- normalize_loudness(target, mix_first=True, affix_id=True)[source]
Return a new
CutSetthat will lazily apply loudness normalization to the desiredtargetloudness (in dBFS).- Parameters:
target (
float) – The target loudness in dBFS.affix_id (
bool) – When true, we will modify theCut.idfield by affixing it with “_ln{target}”.
- Return type:
- Returns:
a modified copy of the current
CutSet.
- dereverb_wpe(affix_id=True)[source]
Return a new
CutSetthat will lazily apply WPE dereverberation.- Parameters:
affix_id (
bool) – When true, we will modify theCut.idfield by affixing it with “_wpe”.- Return type:
- Returns:
a modified copy of the current
CutSet.
- reverb_rir(rir_recordings=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0])[source]
Return a new
CutSetthat contains original cuts convolved with randomly chosen impulse responses from rir_recordings. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests remain the same.If no
rir_recordingsare provided, we will generate a set of impulse responses using a fast random generator (https://arxiv.org/abs/2208.04101).- Parameters:
rir_recordings (
Optional[RecordingSet]) – RecordingSet containing the room impulse responses.normalize_output (
bool) – When true, output will be normalized to have energy as input.early_only (
bool) – When true, only the early reflections (first 50 ms) will be used.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).rir_channels (
List[int]) – The channels of the impulse response to use. By default, first channel will be used. If it is a multi-channel RIR, applying RIR will produce MixedCut. If no RIR is provided, we will generate one with as many channels as this argument specifies.
- Return type:
- Returns:
a modified copy of the
CutSet.
- mix(cuts, duration=None, allow_padding=False, snr=20, preserve_id=None, mix_prob=1.0, seed=42, random_mix_offset=False, tag=None)[source]
Mix cuts in this
CutSetwith randomly sampled cuts from anotherCutSet. A typical application would be data augmentation with noise, music, babble, etc.- Parameters:
cuts (
CutSet) – aCutSetcontaining cuts to be mixed into thisCutSet.duration (
Optional[float]) – an optional float in seconds. WhenNone, we will preserve the duration of the cuts inself(i.e. we’ll truncate the mix if it exceeded the original duration). Otherwise, we will keep sampling cuts to mix in until we reach the specifiedduration(and truncate to that value, should it be exceeded).allow_padding (
bool) – an optional bool. When it isTrue, we will allow the offset to be larger than the reference cut by padding the reference cut.snr (
Union[float,Sequence[float],None]) – an optional float, or pair (range) of floats, in decibels. When it’s a single float, we will mix all cuts with this SNR level (where cuts inselfare treated as signals, and cuts incutsare treated as noise). When it’s a pair of floats, we will uniformly sample SNR values from that range. WhenNone, we will mix the cuts without any level adjustment (could be too noisy for data augmentation).preserve_id (
Optional[str]) – optional string (“left”, “right”). when specified, append will preserve the cut id of the left- or right-hand side argument. otherwise, a new random id is generated.mix_prob (
float) – an optional float in range [0, 1]. Specifies the probability of performing a mix. Values lower than 1.0 mean that some cuts in the output will be unchanged.seed (
Union[int,Literal['trng','randomized'],Random]) – an optional int or “trng”. Random seed for choosing the cuts to mix and the SNR. If “trng” is provided, we’ll use thesecretsmodule for non-deterministic results on each iteration. You can also directly pass arandom.Randominstance here.random_mix_offset (
bool) –an optional bool. When
Trueand the duration of the to be mixed in cut in longer than the original cut,select a random sub-region from the to be mixed in cut.
tag (
Optional[str]) – Optional label attached to the mixed-in tracks.
- Return type:
- Returns:
a new
CutSetwith mixed cuts.
- drop_features()[source]
Return a new
CutSet, where eachCutis copied and detached from its extracted features.- Return type:
- drop_recordings()[source]
Return a new
CutSet, where eachCutis copied and detached from its recordings.- Return type:
- drop_supervisions()[source]
Return a new
CutSet, where eachCutis copied and detached from its supervisions.- Return type:
- drop_alignments()[source]
Return a new
CutSet, where eachCutis copied and detached from the alignments present in its supervisions.- Return type:
- drop_in_memory_data()[source]
Return a new
CutSet, where eachCutis copied and detached from any in-memory data it held. The manifests for in-memory data are converted into placeholders that can still be looked up for metadata, but will fail on attempts to load the data.- Return type:
- compute_and_store_features(extractor, storage_path, num_jobs=None, augment_fn=None, storage_type=None, executor=None, mix_eagerly=True, progress_bar=True)[source]
Extract features for all cuts, possibly in parallel, and store them using the specified storage object.
When
storage_typeis not provided, Lhotse uses the backend selected byLHOTSE_FEATURES_STORAGE_BACKENDand falls back tonumpy_files. If the optionallilcomdependency is installed, preferLilcomChunkyWriterfor better storage efficiency. To inspect the currently usable choices, calllhotse.available_storage_backends(). For a full list that also marks unavailable backends with install hints, uselhotse.storage_backend_statuses()or runlhotse list-storage-backends.Examples:
Extract fbank features on one machine using 8 processes, store arrays partitioned in 8 archive files with lilcom compression (recommended when
lilcomis installed):>>> from lhotse import LilcomChunkyWriter >>> cuts = CutSet(...) ... cuts.compute_and_store_features( ... extractor=Fbank(), ... storage_path='feats', ... num_jobs=8, ... storage_type=LilcomChunkyWriter, ... )
Extract fbank features on one machine using 8 processes, store each array in a separate file with lilcom compression:
>>> from lhotse import LilcomFilesWriter >>> cuts = CutSet(...) ... cuts.compute_and_store_features( ... extractor=Fbank(), ... storage_path='feats', ... num_jobs=8, ... storage_type=LilcomFilesWriter ... )
Extract fbank features on multiple machines using a Dask cluster with 80 jobs, store arrays partitioned in 80 archive files with lilcom compression:
>>> from lhotse import LilcomChunkyWriter >>> from distributed import Client ... cuts = CutSet(...) ... cuts.compute_and_store_features( ... extractor=Fbank(), ... storage_path='feats', ... num_jobs=80, ... storage_type=LilcomChunkyWriter, ... executor=Client(...) ... )
Extract fbank features on one machine using 8 processes, store each array in an S3 bucket (requires
smart_open):>>> cuts = CutSet(...) ... cuts.compute_and_store_features( ... extractor=Fbank(), ... storage_path='s3://my-feature-bucket/my-corpus-features', ... num_jobs=8, ... storage_type=LilcomURLWriter ... )
- Parameters:
extractor (
FeatureExtractor) – AFeatureExtractorinstance (either Lhotse’s built-in or a custom implementation).storage_path (
Union[Path,str]) – The path to location where we will store the features. The exact type and layout of stored files will be dictated by thestorage_typeargument.num_jobs (
Optional[int]) – The number of parallel processes used to extract the features. We will internally split the CutSet into this many chunks and process each chunk in parallel.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.storage_type (
Optional[Type[TypeVar(FW, bound=FeaturesWriter)]]) – aFeaturesWritersubclass type. It determines how the features are stored to disk, e.g. separate file per array, HDF5 files with multiple arrays, etc. When omitted, Lhotse usesLHOTSE_FEATURES_STORAGE_BACKENDornumpy_filesby default. Iflilcomis installed,LilcomChunkyWriterremains the preferred choice for storage efficiency.executor (
Optional[Executor]) – when provided, will be used to parallelize the feature extraction process. By default, we will instantiate a ProcessPoolExecutor. Learn more about theExecutorAPI at https://lhotse.readthedocs.io/en/latest/parallelism.htmlmix_eagerly (
bool) – Related to how the features are extracted forMixedCutinstances, if any are present. When False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a newDataCutinstance with the same ID. The returnedDataCutwill not have aRecordingattached.progress_bar (
bool) – Should a progress bar be displayed (automatically turned off for parallel computation).
- Return type:
- Returns:
Returns a new
CutSetwithFeaturesmanifests attached to the cuts.
- compute_and_store_features_batch(extractor, storage_path, manifest_path=None, batch_duration=600.0, num_workers=4, collate=False, augment_fn=None, storage_type=None, overwrite=False)[source]
Extract features for all cuts in batches. This method is intended for use with compatible feature extractors that implement an accelerated
extract_batch()method. For example,kaldifeatextractors can be used this way (see, e.g.,KaldifeatFbankorKaldifeatMfcc).When a CUDA GPU is available and enabled for the feature extractor, this can be much faster than
CutSet.compute_and_store_features(). Otherwise, the speed will be comparable to single-threaded extraction.When
storage_typeis not provided, Lhotse uses the backend selected byLHOTSE_FEATURES_STORAGE_BACKENDand falls back tonumpy_files. If the optionallilcomdependency is installed, preferLilcomChunkyWriterfor better storage efficiency. To inspect the currently usable choices, calllhotse.available_storage_backends(). For a full list that also marks unavailable backends with install hints, uselhotse.storage_backend_statuses()or runlhotse list-storage-backends.Example: extract fbank features on one GPU, using 4 dataloading workers for reading audio, and store the arrays in an archive file with lilcom compression:
>>> from lhotse import LilcomChunkyWriter >>> from lhotse import KaldifeatFbank, KaldifeatFbankConfig >>> extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda')) >>> cuts = CutSet(...) ... cuts = cuts.compute_and_store_features_batch( ... extractor=extractor, ... storage_path='feats', ... batch_duration=500, ... num_workers=4, ... storage_type=LilcomChunkyWriter, ... )
- Parameters:
extractor (
FeatureExtractor) – AFeatureExtractorinstance, which should implement an acceleratedextract_batchmethod.storage_path (
Union[Path,str]) – The path to location where we will store the features. The exact type and layout of stored files will be dictated by thestorage_typeargument.manifest_path (
Union[Path,str,None]) – Optional path where to write the CutSet manifest with attached feature manifests. If not specified, we will be keeping all manifests in memory.batch_duration (
float) – The maximum number of audio seconds in a batch. Determines batch size dynamically.num_workers (
int) – How many background dataloading workers should be used for reading the audio.collate (
bool) – IfTrue, the waveforms will be collated into a single padded tensor before being passed to the feature extractor. Some extractors can be faster this way (for e.g., seelhotse.features.kaldi.extractors). If you are usingkaldifeatextractors, you should set this toFalse.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.storage_type (
Optional[Type[TypeVar(FW, bound=FeaturesWriter)]]) – aFeaturesWritersubclass type. It determines how the features are stored to disk, e.g. separate file per array, HDF5 files with multiple arrays, etc. When omitted, Lhotse usesLHOTSE_FEATURES_STORAGE_BACKENDornumpy_filesby default. Iflilcomis installed,LilcomChunkyWriterremains the preferred choice for storage efficiency.overwrite (
bool) – should we overwrite the manifest, HDF5 files, etc. By default, this method will append to these files if they exist.
- Return type:
- Returns:
Returns a new
CutSetwithFeaturesmanifests attached to the cuts.
- save_audios(storage_path, format='wav', encoding=None, num_jobs=None, executor=None, augment_fn=None, progress_bar=True, shuffle_on_split=True, **kwargs)[source]
Store waveforms of all cuts as audio recordings to disk.
- Parameters:
storage_path (
Union[Path,str]) – The path to location where we will store the audio recordings. For each cut, a sub-directory will be created that starts with the first 3 characters of the cut’s ID. The audio recording is then stored in the sub-directory using filename{cut.id}.{format}format (
str) – Audio format argument supported bytorchaudio.saveorsoundfile.write. Tested values are:wav,flac, andopus.encoding (
Optional[str]) – Audio encoding argument supported bytorchaudio.saveorsoundfile.write. Please refer to the documentation of the relevant library used in your audio backend.num_jobs (
Optional[int]) – The number of parallel processes used to store the audio recordings. We will internally split the CutSet into this many chunks and process each chunk in parallel.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.executor (
Optional[Executor]) – when provided, will be used to parallelize the process. By default, we will instantiate a ProcessPoolExecutor. Learn more about theExecutorAPI at https://lhotse.readthedocs.io/en/latest/parallelism.htmlprogress_bar (
bool) – Should a progress bar be displayed (automatically turned off for parallel computation).shuffle_on_split (
bool) – Shuffle theCutSetbefore splitting it for the parallel workers. It is active only when num_jobs > 1. The default is True.kwargs – Deprecated arguments go here and are ignored.
- Return type:
- Returns:
Returns a new
CutSet.
- compute_global_feature_stats(storage_path=None, max_cuts=None, extractor=None)[source]
Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.
- Parameters:
storage_path (
Union[Path,str,None]) – an optional path to a file where the stats will be stored with pickle.max_cuts (
Optional[int]) – optionally, limit the number of cuts used for stats estimation. The cuts will be selected randomly in that case.extractor (
Optional[FeatureExtractor]) – optional FeatureExtractor, when provided, we ignore any pre-computed features.
- Return a dict of ``{‘norm_means’``{‘norm_means’:
np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.
- Return type:
Dict[str,ndarray]
- copy_data(output_dir, verbose=True)[source]
Copies every data item referenced by this CutSet into a new directory. The structure is as follows:
output_dir
├── audio | ├── rec1.flac | └── … ├── custom | ├── field1 | | ├── arr1-1.npy | | └── … | └── field2 | ├── arr2-1.npy | └── … ├── features.lca or features/ └── cuts.jsonl.gz
- Parameters:
output_dir (
Union[Path,str]) – The root directory where we’ll store the copied data.verbose (
bool) – Show progress bar, enabled by default.
- Return type:
- Returns:
CutSet manifest pointing to the new data.
- copy_feats(writer, output_path=None)[source]
Save a copy of every feature matrix found in this CutSet using
writerand return a new manifest with cuts referring to the new feature locations.- Parameters:
writer (
FeaturesWriter) – alhotse.features.io.FeaturesWriterinstance.output_path (
Union[Path,str,None]) – optional path where the new manifest should be stored. It’s used to write the manifest incrementally and return a lazy manifest, otherwise the copy is stored in memory.
- Return type:
- Returns:
a copy of the manifest.
- modify_ids(transform_fn)[source]
Modify the IDs of cuts in this
CutSet. Useful when combining multiple ``CutSet``s that were created from a single source, but contain features with different data augmentations techniques.- Parameters:
transform_fn (
Callable[[str],str]) – A callable (function) that accepts a string (cut ID) and returns
a new string (new cut ID). :rtype:
CutSet:return: a newCutSetwith cuts with modified IDs.
- fill_supervisions(add_empty=True, shrink_ok=False)[source]
Fills the whole duration of each cut in a
CutSetwith a supervision segment.If the cut has one supervision, its start is set to 0 and duration is set to
cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.If there are no supervisions, we will add an empty one when
add_empty==True, otherwise we won’t change anything.If there are two or more supervisions, we will raise an exception.
- Parameters:
add_empty (
bool) – should we add an empty supervision with identical time bounds as the cut.shrink_ok (
bool) – should we raise an error if a supervision would be shrank as a result of calling this method.
- Return type:
- map_supervisions(transform_fn)[source]
Modify the SupervisionSegments by transform_fn in this CutSet.
- Parameters:
transform_fn (
Callable[[SupervisionSegment],SupervisionSegment]) – a function that modifies a supervision as an argument.- Return type:
- Returns:
a new, modified CutSet.
- transform_text(transform_fn)[source]
Return a copy of this
CutSetwith allSupervisionSegmentstext transformed withtransform_fn. Useful for text normalization, phonetic transcription, etc.- Parameters:
transform_fn (
Callable[[str],str]) – a function that accepts a string and returns a string.- Return type:
- Returns:
a new, modified CutSet.
- prefetch(buffer_size=10)[source]
Pre-fetches the CutSet elements in a background process. Useful for enabling concurrent reading/processing/writing in ETL-style tasks. :rtype:
CutSetCaution
This method internally uses a PyTorch DataLoader with a single worker. It is not suitable for use in typical PyTorch training scripts.
Caution
If you run into pickling issues when using this method, you’re also likely using .filter/.map methods with a lambda function. Please set
lhotse.set_dill_enabled(True)to resolve these issues, or convert lambdas to regular functions +functools.partial
- to_huggingface_dataset()[source]
Converts a CutSet to a HuggingFace Dataset. Currently, only MonoCut with one recording source is supported. Other cut types will be supported in the future.
- Currently, two formats are supported:
If each cut has one supervision (e.g. LibriSpeech), each cut is represented as a single row (entry) in the HuggingFace dataset with all the supervision information stored along the cut information. The final HuggingFace dataset format is:
╔═══════════════════╦═══════════════════════════════╗ ║ Feature ║ Type ║ ╠═══════════════════╬═══════════════════════════════╣ ║ id ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ audio ║ Audio() ║ ╠═══════════════════╬═══════════════════════════════╣ ║ duration ║ Value(dtype=’float32’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ num_channels ║ Value(dtype=’uint16’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ text ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ speaker ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ language ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ {x}_alignment ║ Sequence(Alignment) ║ ╚═══════════════════╩═══════════════════════════════╝
where x stands for the alignment type (commonly used: “word”, “phoneme”).
- Alignment is represented as:
╔═══════════════════╦═══════════════════════════════╗ ║ Feature ║ Type ║ ╠═══════════════════╬═══════════════════════════════╣ ║ symbol ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ start ║ Value(dtype=’float32’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ end ║ Value(dtype=’float32’) ║ ╚═══════════════════╩═══════════════════════════════╝
If each cut has multiple supervisions (e.g. AMI), each cut is represented as a single row (entry) while all the supervisions are stored in a separate list of dictionaries under the ‘segments’ key. The final HuggingFace dataset format is:
╔══════════════╦════════════════════════════════════╗ ║ Feature ║ Type ║ ╠══════════════╬════════════════════════════════════╣ ║ id ║ Value(dtype=’string’) ║ ╠══════════════╬════════════════════════════════════╣ ║ audio ║ Audio() ║ ╠══════════════╬════════════════════════════════════╣ ║ duration ║ Value(dtype=’float32’) ║ ╠══════════════╬════════════════════════════════════╣ ║ num_channels ║ Value(dtype=’uint16’) ║ ╠══════════════╬════════════════════════════════════╣ ║ segments ║ Sequence(Segment) ║ ╚══════════════╩════════════════════════════════════╝
- where one Segment is represented as:
╔═══════════════════╦═══════════════════════════════╗ ║ Feature ║ Type ║ ╠═══════════════════╬═══════════════════════════════╣ ║ text ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ start ║ Value(dtype=’float32’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ end ║ Value(dtype=’float32’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ channel ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ speaker ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ language ║ Value(dtype=’string’) ║ ╠═══════════════════╬═══════════════════════════════╣ ║ {x}_alignment ║ Sequence(Alignment) ║ ╚═══════════════════╩═══════════════════════════════╝
- Returns:
A HuggingFace Dataset.
- static from_huggingface_dataset(*dataset_args, audio_key='audio', text_key='sentence', lang_key='language', gender_key='gender', **dataset_kwargs)[source]
Initializes a Lhotse CutSet from an existing HF dataset, or args/kwargs passed on to
datasets.load_dataset().Use
audio_key,text_key,lang_keyandgender_keyoptions to indicate which keys in dict examples returned from HF Dataset should be looked up for audio, transcript, language, and gender respectively. The remaining keys in HF dataset examples will be stored insidecut.customdictionary.Example with existing HF dataset:
>>> import datasets ... dataset = datasets.load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test") ... dataset = dataset.map(some_transform) ... cuts = CutSet.from_huggingface_dataset(dataset) ... for cut in cuts: ... pass
Example providing HF dataset init args/kwargs:
>>> import datasets ... cuts = CutSet.from_huggingface_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test") ... for cut in cuts: ... pass
- filter(predicate)
Return a new manifest containing only the items that satisfy
predicate. If the manifest is lazy, the filtering will also be applied lazily.- Parameters:
predicate (
Callable[[TypeVar(T)],bool]) – a function that takes a cut as an argument and returns bool.- Returns:
a filtered manifest.
- classmethod from_file(path)
- Return type:
Any
- classmethod from_json(path)
- Return type:
Any
- classmethod from_jsonl(path)
- Return type:
Any
- classmethod from_jsonl_lazy(path)
Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration. :rtype:
AnyWarning
Opening the manifest in this way might cause some methods that rely on random access to fail.
- classmethod from_yaml(path)
- Return type:
Any
- classmethod infinite_mux(*manifests, weights=None, seed=0, max_open_streams=None)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. Unlike
mux(), this method allows to limit the number of max open sub-iterators at any given time.To enable this, it performs 2-stage sampling. First, it samples with replacement the set of iterators
Ito construct a subsetI_subof sizemax_open_streams. Then, for each iteration step, it samples an iteratorifromI_sub, fetches the next item from it, and yields it. Onceibecomes exhausted, it is replaced with a new iteratorjsampled fromI_sub.Caution
Do not use this method with inputs that are infinitely iterable as they will silently break the multiplexing property by only using a subset of the input iterables.
Caution
This method is not recommended for multiplexing for a small amount of iterations, as it may be much less accurate than
mux()depending on the number of open streams, iterable sizes, and the random seed.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.max_open_streams (
Optional[int]) – the number of iterables that can be open simultaneously at any given time.
- property is_lazy: bool
Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.
- classmethod mux(*manifests, stop_early=False, weights=None, seed=0)
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time. If one of the iterables is exhausted before the others, we will keep iterating until all iterables are exhausted. This behavior can be changed with
stop_earlyparameter.- Parameters:
manifests – iterables to be multiplexed. They can be either lazy or eager, but the resulting manifest will always be lazy.
stop_early (
bool) – should we stop the iteration as soon as we exhaust one of the manifests.weights (
Optional[List[Union[int,float]]]) – an optional weight for each iterable, affects the probability of it being sampled. The weights are uniform by default. If lengths are known, it makes sense to pass them here for uniform distribution of items in the expectation.seed (
Union[int,Literal['trng','randomized']]) – the random seed, ensures deterministic order across multiple iterations.
- classmethod open_writer(path, overwrite=True)
Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (
.jsonl), with optional gzip compression (.jsonl.gz). :rtype:Union[SequentialJsonlWriter,InMemoryWriter]Note
when
pathisNone, we will return aInMemoryWriterinstead has the same API but stores the manifests in memory. It is convenient when you want to make disk saving optional.Example:
>>> from lhotse import RecordingSet ... recordings = [...] ... with RecordingSet.open_writer('recordings.jsonl.gz') as writer: ... for recording in recordings: ... writer.write(recording)
This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifests.
Example:
>>> from lhotse import RecordingSet, Recording ... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer: ... for path in Path('.').rglob('*.wav'): ... recording_id = path.stem ... if writer.contains(recording_id): ... # Item already written previously - skip processing. ... continue ... # Item doesn't exist yet - run extra work to prepare the manifest ... # and store it. ... recording = Recording.from_file(path, recording_id=recording_id) ... writer.write(recording)
- repeat(times=None, preserve_id=False)
Return a new, lazily evaluated manifest that iterates over the original elements
timesnumber of times.- Parameters:
times (
Optional[int]) – how many times to repeat (infinite by default).preserve_id (
bool) – whenTrue, we won’t update the element ID with repeat number.
- Returns:
a repeated manifest.
- shuffle(rng=None, buffer_size=10000)
Shuffles the elements and returns a shuffled variant of self. If the manifest is opened lazily, performs shuffling on-the-fly with a fixed buffer size.
- Parameters:
rng (
Optional[Random]) – an optional instance ofrandom.Randomfor precise control of randomness.- Returns:
a shuffled copy of self, or a manifest that is shuffled lazily.
- to_eager()
Evaluates all lazy operations on this manifest, if any, and returns a copy that keeps all items in memory. If the manifest was “eager” already, this is a no-op and won’t copy anything.
- to_file(path)
- Return type:
None
- to_json(path)
- Return type:
None
- to_jsonl(path)
- Return type:
None
- to_yaml(path)
- Return type:
None
- class lhotse.cut.MixedCut(id, tracks, transforms=None)[source]
MixedCutis aCutthat actually consists of multiple other cuts. Its primary purpose is to allow time-domain and feature-domain augmentation via mixing the training cuts with noise, music, and babble cuts. The actual mixing operations are performed on-the-fly.Internally,
MixedCutholds other cuts in multiple tracks (MixTrack), each with its own offset and SNR that is relative to the first track.Please refer to the documentation of
Cutto learn more about using cuts.In addition to methods available in
Cut,MixedCutprovides the methods to read all of its tracks audio and features as separate channels:>>> cut = MixedCut(...) >>> mono_features = cut.load_features() >>> assert len(mono_features.shape) == 2 >>> multi_features = cut.load_features(mixed=False) >>> # Now, the first dimension is the channel. >>> assert len(multi_features.shape) == 3
Note
MixedCut is different from MultiCut, which is intended to represent multi-channel recordings that share the same supervisions.
Note
Each track in a MixedCut can be either a MonoCut, MultiCut, or PaddingCut.
Note
The
transformsfield is a list of dictionaries that describe the transformations that should be applied to the track after mixing.See also:
-
id:
str
-
transforms:
Optional[List[AudioTransform]] = None
- property supervisions: List[SupervisionSegment]
Lists the supervisions of the underlying source cuts. Each segment start time will be adjusted by the track offset.
- property start: float
- property duration: float
- property channel: int | List[int]
- property has_features: bool
- property has_recording: bool
- property has_video: bool
- property is_in_memory: bool
- property num_frames: int | None
- property frame_shift: float | None
- property sampling_rate: int | None
- property num_samples: int | None
- property num_features: int | None
- property num_channels: int | None
- property features_type: str | None
- unmix(tag=None)[source]
Split this mixed cut into time-aligned constituent cuts.
When
tagisNone, this returns one cut per non-padding audible track. Each returned cut preserves the original offsets and overall duration, so the loaded audio/features can be summed to reconstruct the original mix.When
tagis provided, this returns exactly two cuts in order:[without_tag, with_tag]. Tracks are grouped by whether theirMixTrack.tagmatchestag. For exact SNR preservation, the grouped outputs may carry an internal muted SNR-reference track that is ignored by the public track views but retained for mixing math.- Parameters:
tag (
Optional[str]) – Optional track-group label to split on.- Return type:
List[Cut]- Returns:
A list of one cut per track, or two grouped cuts when
tagis provided.
- iter_data()[source]
Iterate over each data piece attached to this cut. Returns a generator yielding tuples of
(key, manifest), wherekeyis the name of the attribute under whichmanifestis found.manifestis of typeRecording,Features,TemporalArray,Array, orImage.For example, if
keyisrecording, thenmanifestisself.recording.
- load_custom(name)[source]
Load custom data as numpy array. The custom data is expected to have been stored in cuts
customfield as anArrayorTemporalArraymanifest.Note
It works with Array manifests stored via attribute assignments, e.g.:
cut.my_custom_data = Array(...).Note
For
MixedCutwithRecording-type custom attributes, this supports multiple non-overlapping tracks (e.g. fromCut.append()). The audio from each track’s custom Recording is loaded and placed at the correct offset in the output buffer, similar to howload_audio()works for the main recording.Warning
For
Array(non-temporal) andTemporalArraycustom attributes, this will only work if the mixed cut has a single non-padding track with that attribute.- Parameters:
name (
str) – name of the custom attribute.- Return type:
ndarray- Returns:
a numpy array with the data (after padding).
- move_to_memory(audio_format='flac', load_audio=True, load_features=True, load_custom=True)[source]
Load data (audio, features, or custom arrays) into memory and attach them to a copy of the manifest. This is useful when you want to store cuts together with the actual data in some binary format that enables sequential data reads.
Audio is encoded with
audio_format(compatible withtorchaudio.save), floating point features are encoded with lilcom, and other arrays are pickled.- Return type:
- to_mono(encoding='flac', **kwargs)[source]
Convert this MixedCut to a MonoCut by mixing all tracks and channels into a single one. The result audio array is stored in memory, and can be saved to disk by calling
cut.save_audio(path, ...)on the result.Hint
the resulting MonoCut will have
customfield populated with thecustomvalue from the first track of the MixedCut.- Parameters:
encoding (
str) – any of “wav”, “flac”, or “opus”.- Return type:
- Returns:
a new MonoCut instance.
- truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)[source]
Returns a new MixedCut that is a sub-region of the current MixedCut. This method truncates the underlying Cuts and modifies their offsets in the mix, as needed. Tracks that do not fit in the truncated cut are removed.
Note that no operation is done on the actual features - it’s only during the call to load_features() when the actual changes happen (a subset of features is loaded).
- Parameters:
offset (
float) – float (seconds), controls the start of the new cut relative to the current MixedCut’s start.duration (
Optional[float]) – optional float (seconds), controls the duration of the resulting MixedCut. By default, the duration is (end of the cut before truncation) - (offset).keep_excessive_supervisions (
bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.preserve_id (
bool) – bool. Should the truncated cut keep the same ID or get a new, random one.
- Return type:
- Returns:
a new MixedCut instance.
- extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)[source]
This raises a ValueError since extending a MixedCut is not defined.
- Parameters:
duration (
float) – float (seconds), duration (in seconds) to extend the MixedCut.direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the duration specified in duration.preserve_id (
bool) – bool. Should the extended cut keep the same ID or get a new, random one.pad_silence (
bool) – bool. See usage in lhotse.cut.MonoCut.extend_by.
- Return type:
- Returns:
a new MixedCut instance.
- pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)[source]
Return a new MixedCut, padded with zeros in the recording, and
pad_feat_valuein each feature bin.The user can choose to pad either to a specific duration; a specific number of frames num_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.
- Parameters:
duration (
Optional[float]) – The cut’s minimal duration after padding.num_frames (
Optional[int]) – The cut’s total number of frames after padding.num_samples (
Optional[int]) – The cut’s total number of samples after padding.pad_feat_value (
float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.preserve_id (
bool) – WhenTrue, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).pad_value_dict (
Optional[Dict[str,Union[int,float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.
- Return type:
- Returns:
a padded MixedCut if duration is greater than this cut’s duration, otherwise
self.
- resample(sampling_rate, affix_id=False, recording_field=None)[source]
Return a new
MixedCutthat will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.- Parameters:
sampling_rate (
int) – The new sampling rate.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).recording_field (
Optional[str]) – which recording field to resample.
- Return type:
- Returns:
a modified copy of the current
MixedCut.
- compress(codec='opus', compression_level=0.99, compress_custom_fields=False)[source]
Return a copy of this Cut that has Recordings in its sub-Cuts processed by a lossy encoding.
- Parameters:
codec (
Literal['opus','mp3','vorbis','gsm']) – The codec to use for compression. Supported codecs are “opus”, “mp3”, “vorbis”, “gsm”.compression_level (
float) – The level of compression (from 0.0 to 1.0, higher values correspond to higher compression).compress_custom_fields (
bool) – Whether to also compress any custom recording fields in sub-Cuts.
- Returns:
A modified
MixedCutcontaining audio processed by a codec
- perturb_speed(factor, affix_id=True)[source]
Return a new
MixedCutthat will lazily perturb the speed while loading audio. Thenum_samples,startanddurationfields of the underlying Cuts (and their Recordings and SupervisionSegments) are updated to reflect the shrinking/extending effect of speed. We are also updating the offsets of all underlying tracks.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theMixedCut.idfield by affixing it with “_sp{factor}”.
- Return type:
- Returns:
a modified copy of the current
MixedCut.
- perturb_tempo(factor, affix_id=True)[source]
Return a new
MixedCutthat will lazily perturb the tempo while loading audio.Compared to speed perturbation, tempo preserves pitch. The
num_samples,startanddurationfields of the underlying Cuts (and their Recordings and SupervisionSegments) are updated to reflect the shrinking/extending effect of tempo. We are also updating the offsets of all underlying tracks.- Parameters:
factor (
float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theMixedCut.idfield by affixing it with “_tp{factor}”.
- Return type:
- Returns:
a modified copy of the current
MixedCut.
- perturb_volume(factor, affix_id=True)[source]
Return a new
MixedCutthat will lazily perturb the volume while loading audio. Recordings of the underlying Cuts are updated to reflect volume change.- Parameters:
factor (
float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).affix_id (
bool) – When true, we will modify theMixedCut.idfield by affixing it with “_vp{factor}”.
- Return type:
- Returns:
a modified copy of the current
MixedCut.
- clip_amplitude(hard=False, gain_db=0.0, normalize=True, oversampling=2, affix_id=True)[source]
Return a new
MixedCutthat will lazily apply clipping while loading audio. Recordings of the underlying Cuts are updated to reflect clipping change.- Parameters:
hard (
bool) – If True, apply hard clipping (sharp cutoff); otherwise, apply soft clipping (saturation).gain_db (
float) – The amount of gain in decibels to apply before clipping.normalize (
bool) – If True, normalize the input signal to 0 dBFS before applying clipping.oversampling (
Optional[int]) – If provided, we will oversample the input signal by the given integer factor before applying saturation and then downsample back to the original sampling rate.affix_id (
bool) – When true, we will modify theMixedCut.idfield by affixing it with “_cl{gain_db}”.
- Return type:
- Returns:
a modified copy of the current
MixedCut.
- normalize_loudness(target, mix_first=True, affix_id=False)[source]
Return a new
MixedCutthat will lazily apply loudness normalization.- Parameters:
target (
float) – The target loudness in dBFS.mix_first (
bool) – If true, we will mix the underlying cuts before applying loudness normalization. If false, we cannot guarantee that the resulting cut will have the target loudness.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_ln{target}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0], room_rng_seed=None, source_rng_seed=None, mix_first=True)[source]
Return a new
MixedCutthat will convolve the audio with the provided impulse response. If norir_recordingis provided, we will generate an impulse response using a fast random generator (https://arxiv.org/abs/2208.04101).- Parameters:
rir_recording (
Optional[Recording]) – The impulse response to use for convolving.normalize_output (
bool) – When true, output will be normalized to have energy as input.early_only (
bool) – When true, only the early reflections (first 50 ms) will be used.affix_id (
bool) – When true, we will modify theMixedCut.idfield by affixing it with “_rvb”.rir_channels (
List[int]) – The channels of the impulse response to use. By default, first channel is used. If only one channel is specified, all tracks will be convolved with this channel. If a list is provided, it must contain as many channels as there are tracks such that each track will be convolved with one of the specified channels.room_rng_seed (
Optional[int]) – Seed for the room configuration.source_rng_seed (
Optional[int]) – Seed for the source position.mix_first (
bool) – When true, the mixing will be done first before convolving with the RIR. This effectively means that all tracks will be convolved with the same RIR. If you are simulating multi-speaker mixtures, you should set this to False.
- Return type:
- Returns:
a modified copy of the current
MixedCut.
- load_features(mixed=True)[source]
Loads the features of the source cuts and mixes them on-the-fly.
- Parameters:
mixed (
bool) – when True (default), the features are mixed together (as defined in the mixing function for the extractor). This could result in either a 2D or 3D array. For example, if all underlying tracks are single-channel, the output will be a 2D array of shape (num_frames, num_features). If any of the tracks are multi-channel, the output may be a 3D array of shape (num_frames, num_features, num_channels).- Return type:
Optional[ndarray]- Returns:
A numpy ndarray with features and with shape
(num_frames, num_features), or(num_tracks, num_frames, num_features)
- load_audio(mixed=True, mono_downmix=False)[source]
Loads the audios of the source cuts and mix them on-the-fly.
- Parameters:
mixed (
bool) – When True (default), returns a mix of the underlying tracks. This will return a numpy array with shape(num_channels, num_samples), wherenum_channelsis determined by thenum_channelsproperty of the MixedCut. Otherwise returns a numpy array with the number of channels equal to the total number of channels across all tracks in the MixedCut. For example, if it contains a MultiCut with 2 channels and a MonoCut with 1 channel, the returned array will have shape(3, num_samples).mono_downmix (
bool) – If the MixedCut contains > 1 channels (for e.g. when one of its tracks is a MultiCut), this parameter controls whether the returned array will be down-mixed to a single channel. This down-mixing is done by summing the channels together.
- Return type:
Optional[ndarray]- Returns:
A numpy ndarray with audio samples and with shape
(num_channels, num_samples)
- load_video(with_audio=True, mixed=True, mono_downmix=False)[source]
- Return type:
Optional[Tuple[Tensor,Optional[Tensor]]]
- plot_tracks_features()[source]
Display the feature matrix as an image. Requires matplotlib to be installed.
- plot_tracks_audio()[source]
Display plots of the individual tracks’ waveforms. Requires matplotlib to be installed.
- drop_features()[source]
Return a copy of the current
MixedCut, detached fromfeatures.- Return type:
- drop_recording()[source]
Return a copy of the current
MixedCut, detached fromrecording.- Return type:
- drop_supervisions()[source]
Return a copy of the current
MixedCut, detached fromsupervisions.- Return type:
- drop_alignments()[source]
Return a copy of the current
MixedCut, detached fromsupervisions.- Return type:
- drop_in_memory_data()[source]
Return a copy of the current
MixedCut, which doesn’t contain any in-memory data.- Return type:
- compute_and_store_features(extractor, storage, augment_fn=None, mix_eagerly=True)[source]
Compute the features from this cut, store them on disk, and create a new MonoCut object with the feature manifest attached. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.storage (
FeaturesWriter) – aFeaturesWriterinstance used to store the features. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation.mix_eagerly (
bool) – when False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a newMonoCutinstance with the same ID. The returnedMonoCutwill not have aRecordingattached.
- Return type:
DataCut- Returns:
a new
MonoCutinstance ifmix_eagerlyis True, or returnsselfwith each of the tracks containing theFeaturesmanifests.
- fill_supervision(add_empty=True, shrink_ok=False)[source]
Fills the whole duration of a cut with a supervision segment.
If the cut has one supervision, its start is set to 0 and duration is set to
cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.If there are no supervisions, we will add an empty one when
add_empty==True, otherwise we won’t change anything.If there are two or more supervisions, we will raise an exception.
Note
For
MixedCut, we expect that only one track contains a supervision. That supervision will be expanded to cover the full MixedCut’s duration.- Parameters:
add_empty (
bool) – should we add an empty supervision with identical time bounds as the cut.shrink_ok (
bool) – should we raise an error if a supervision would be shrank as a result of calling this method.
- Return type:
- map_supervisions(transform_fn)[source]
Modify the SupervisionSegments by transform_fn of this MixedCut.
- Parameters:
transform_fn (
Callable[[SupervisionSegment],SupervisionSegment]) – a function that modifies a supervision as an argument.- Return type:
- Returns:
a modified MixedCut.
- merge_supervisions(merge_policy='delimiter', custom_merge_fn=None)[source]
Return a copy of the cut that has all of its supervisions merged into a single segment.
The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields are concatenated with a whitespace.
Note
If you’re using individual tracks of a mixed cut, note that this transform drops all the supervisions in individual tracks and assigns the merged supervision in the first
DataCutfound inself.tracks.- Parameters:
merge_policy (
str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied tocustomfields. Fields with aNonevalue are omitted.custom_merge_fn (
Optional[Callable[[str,Iterable[Any]],Any]]) – a function that will be called to merge custom fields values. We expectcustom_merge_fnto handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like:custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])
- Return type:
- filter_supervisions(predicate)[source]
Modify cut to store only supervisions accepted by predicate
- Example:
>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids) >>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0) >>> cut = cut.filter_supervisions(lambda s: s.text is not None)
- Parameters:
predicate (
Callable[[SupervisionSegment],bool]) – A callable that accepts SupervisionSegment and returns bool- Return type:
- Returns:
a modified MixedCut
- property first_non_padding_cut: DataCut
- __init__(id, tracks, transforms=None)
- append(other, snr=None, preserve_id=None)
Append the
otherCut after the current Cut. Conceptually the same asmixbut with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) theothercut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call toload_features.- Parameters:
preserve_id (
Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.- Return type:
- compute_features(extractor, augment_fn=None)
Compute the features from this cut. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – optionalWavAugmenterinstance for audio augmentation.
- Return type:
ndarray- Returns:
a numpy ndarray with the computed features.
- copy(**replace_attrs)
Returns a shallow copy of self, with specified attributes overwritten.
- Example:
>>> cut = MonoCut(id="old-id", ...) ... cut2 = cut.copy(id="new-id") ... assert cut.id == "old-id" ... assert cut2.id == "new-id"
- cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)
Return a list of shorter cuts, made by traversing this cut in windows of
durationseconds byhopseconds.The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.
- Parameters:
duration (
float) – Desired duration of the new cuts in seconds.hop (
Optional[float]) – Shift between the windows in the new cuts in seconds.keep_excessive_supervisions (
bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
- Return type:
CutSet
- Returns:
a list of cuts made from shorter duration windows.
- cut_into_windows_balanced(min_duration, max_duration, overlap=0.0, keep_excessive_supervisions=True)
Return a list of shorter cuts made by splitting this cut into overlapping windows whose size is chosen within
[min_duration, max_duration]to maximise the duration of the final (potentially shorter) window, thereby minimising padding.Each resulting sub-cut carries two extra entries in its
customdict:"source_cut_id"– theidof this (parent) cut."source_cut_start"– thestarttime of this cut within its recording. Downstream code can use this to detect whether the parent was the first window of a recording (source_cut_start == 0) or a later continuation.
- Parameters:
min_duration (
float) – Minimum desired window duration in seconds.max_duration (
float) – Maximum desired window duration in seconds.overlap (
float) – Overlap between consecutive windows in seconds (default: 0).keep_excessive_supervisions (
bool) – When a window is truncated mid-supervision, should the supervision be kept.
- Return type:
CutSet
- Returns:
a
CutSetof overlapping sub-cuts.
- property end: float
- property has_overlapping_supervisions: bool
- index_supervisions(index_mixed_tracks=False, keep_ids=None)
Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.
The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.
- Parameters:
index_mixed_tracks (
bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.keep_ids (
Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.
- Return type:
Dict[str,IntervalTree]- Returns:
a mapping from Cut ID to an interval tree of SupervisionSegments.
- mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None, tag=None)
Refer to :function:`~lhotse.cut.mix` documentation.
- Return type:
- play_audio()
Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).
- plot_alignment(alignment_type='word')
Display the alignment on top of a spectrogram. Requires matplotlib to be installed.
- plot_audio(ax=None, **kwargs)
Display a plot of the waveform. Requires matplotlib to be installed.
- plot_features()
Display the feature matrix as an image. Requires matplotlib to be installed.
- save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)
Store this cut’s waveform as audio recording to disk.
- Parameters:
storage_path (
Union[Path,str]) – The path to location where we will store the audio recordings.format (
Optional[str]) – Audio format argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.encoding (
Optional[str]) – Audio encoding argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.kwargs – additional arguments passed to
Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.
- Return type:
- Returns:
a new Cut instance.
- speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- split(timestamp)
-
Split a cut into two cuts at
timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:left cut [0s - 4s]
right cut [4s - 10s]
- supervisions_audio_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- supervisions_feature_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)
Splits the current
Cutinto its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
Hint
If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the
Cut.merge_supervisions()method first to merge the supervisions into a single one, followed by theCut.trim_to_alignments()method. For example:>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)
Hint
The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:
>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)
- Parameters:
type (
str) – The type of the alignment to trim to (e.g. “word”).max_pause (
Optional[float]) – The maximum pause allowed between the alignments to merge them. IfNone, no merging will be performed. [default: None]delimiter (
str) – The delimiter to use when joining the alignment items.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.num_jobs – Number of parallel workers to process the cuts.
- Return type:
CutSet
- Returns:
a CutSet object.
- trim_to_supervision_groups(max_pause=0.0)
Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than
max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482For example, the following cut:
Cut╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝
is transformed into two cuts:
Cut 1 Cut 2
╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝
For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.
- Parameters:
max_pause (
float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.- Return type:
CutSet
- Returns:
a
CutSet.
- trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)
Splits the current
Cutinto as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded viakeep_overlappingflag.For example, the following cut:
Cut |-----------------| Sup1 |----| Sup2 |-----------|
is transformed into two cuts:
Cut1 |----| Sup1 |----| Sup2 |-| Cut2 |-----------| Sup1 |-| Sup2 |-----------|
For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
- Parameters:
keep_overlapping (
bool) – whenFalse, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discardSup2inCut1andSup1inCut2. In this mode, we guarantee that there will always be exactly one supervision per cut.min_duration (
Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter thanmin_durationwith actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept whenkeep_overlappingis true. If there is not enough context, the returned cut will be shorter thanmin_duration. If the supervision segment is longer thanmin_duration, the return cut will be longer.context_direction (
Literal['center','left','right','random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
- Return type:
CutSet
- Returns:
a list of cuts.
- property trimmed_supervisions: List[SupervisionSegment]
Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.
Note that when
cut.supervisionsis called, the supervisions may have negativestartvalues that indicate the supervision actually begins before the cut, orendvalues that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).Caution
For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.
-
phone:
Callable
-
id:
- class lhotse.cut.MixTrack(cut, type=None, offset=0.0, snr=None, tag=None, is_snr_reference=False, mute=False)[source]
Represents a single track in a mix of Cuts. Points to a specific DataCut or PaddingCut and holds information on how to mix it with other Cuts, relative to the first track in a mix.
-
type:
str= None
-
offset:
float= 0.0
-
snr:
Optional[float] = None
-
tag:
Optional[str] = None
-
is_snr_reference:
bool= False
-
mute:
bool= False
- __init__(cut, type=None, offset=0.0, snr=None, tag=None, is_snr_reference=False, mute=False)
-
type:
- class lhotse.cut.MonoCut(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)[source]
MonoCutis aCutof a single channel of aRecording. In addition to Cut, it has a specified channel attribute. This is the most commonly used type of cut.Please refer to the documentation of
Cutto learn more about using cuts.See also:
-
channel:
int
- property num_channels: int
- load_features()[source]
Load the features from the underlying storage and cut them to the relevant [begin, duration] region of the current MonoCut.
- Return type:
Optional[ndarray]
- load_audio()[source]
Load the audio by locating the appropriate recording in the supplied RecordingSet. The audio is trimmed to the [begin, end] range specified by the MonoCut.
- Return type:
Optional[ndarray]- Returns:
a numpy ndarray with audio samples, with shape (1 <channel>, N <samples>)
- load_video(with_audio=True)[source]
Load the subset of video (and audio) from attached recording. The data is trimmed to the [begin, end] range specified by the MonoCut.
- Parameters:
with_audio (
bool) – bool, whether to load and return audio alongside video. True by default.- Return type:
Optional[Tuple[Tensor,Optional[Tensor]]]- Returns:
a tuple of video tensor and optionally audio tensor (or
None), orNoneif this cut has no video.
- with_channels(channels)[source]
Select specified channels from this cut. Supports extending to other channels available in the underlying
Recording. If a single channel is provided, we’ll return aMonoCut, otherwise we’ll return aMultiCut.- Return type:
DataCut
- reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=(0,), room_rng_seed=None, source_rng_seed=None)[source]
Return a new
DataCutthat will convolve the audio with the provided impulse response. If the rir_recording is multi-channel, the rir_channels argument determines which channels will be used. By default, we use the first channel and return a MonoCut. If we reverberate with a multi-channel RIR, we return a MultiCut.If no
rir_recordingis provided, we will generate an impulse response using a fast random generator (https://arxiv.org/abs/2208.04101). Note that the generator only supports simulating reverberation with a single microphone, so we will return a MonoCut in this case.- Parameters:
rir_recording (
Union[Recording,DataCut,None]) – The impulse response to use for convolving.normalize_output (
bool) – When true, output will be normalized to have energy as input.early_only (
bool) – When true, only the early reflections (first 50 ms) will be used.affix_id (
bool) – When true, we will modify theMonoCut.idfield by affixing it with “_rvb”.rir_channels (
Sequence[int]) – The channels of the impulse response to use. First channel is used by default. If multiple channels are specified, this will produce a MultiCut instead of a MonoCut.room_rng_seed (
Optional[int]) – The seed for the room configuration.source_rng_seed (
Optional[int]) – The seed for the source position.
- Return type:
DataCut- Returns:
a modified copy of the current
MonoCut.
- merge_supervisions(merge_policy='delimiter', custom_merge_fn=None)[source]
Return a copy of the cut that has all of its supervisions merged into a single segment.
The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields of all segments are concatenated with a whitespace.
- Parameters:
merge_policy (
str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied tocustomfields. Fields with aNonevalue are omitted.custom_merge_fn (
Optional[Callable[[str,Iterable[Any]],Any]]) – a function that will be called to merge custom fields values. We expectcustom_merge_fnto handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like:custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])
- Return type:
- __init__(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)
- append(other, snr=None, preserve_id=None)
Append the
otherCut after the current Cut. Conceptually the same asmixbut with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) theothercut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call toload_features.- Parameters:
preserve_id (
Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.- Return type:
- attach_image(key, path_or_object)
Attach an image to this cut, wrapped in an Image class and stored under key in the custom dict.
The image can be specified as: - A path to an image file - A numpy array with shape (height, width, channels) - Raw bytes of an image file
Example:
>>> cut = cut.attach_image('thumbnail', 'path/to/image.jpg') >>> # Access the image later >>> img_array = cut.load_thumbnail() # Returns numpy array
- Parameters:
key (
str) – The key to store the image under in the custom dict.path_or_object (
Union[str,ndarray,bytes]) – The image as a path, numpy array, or bytes.
- Return type:
DataCut- Returns:
A new DataCut with the image attached.
- attach_tensor(name, data, frame_shift=None, temporal_dim=None, compressed=False)
Attach a tensor to this MonoCut, described with an
Arraymanifest. The attached data is stored in-memory for later use, and can be accessed by callingcut.load_<name>()orcut.load_custom().This is useful if you want actions such as truncate/pad to propagate to the tensor, e.g.:
>>> cut = MonoCut(id="c1", start=2, duration=8, ...) >>> cut = cut.attach_tensor( ... "alignment", ... torch.tensor([0, 0, 0, ...]), ... frame_shift=0.1, ... temporal_dim=0, ... ) >>> half_alignment = cut.truncate(duration=4.0).load_alignment()
Note
This object can’t be stored in JSON/JSONL manifests anymore.
- Parameters:
name (
str) – attribute under which the data can be found.data (
Union[ndarray,Tensor]) – PyTorch tensor or numpy array.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.compressed (
bool) – When True, we will apply lilcom compression to the array. Only applicable to arrays of floats.
- Return type:
- Returns:
- clip_amplitude(hard=False, gain_db=0.0, normalize=True, oversampling=2, affix_id=True)
Return a new
DataCutthat will lazily apply clipping while loading audio.- Parameters:
hard (
bool) – If True, apply hard clipping (sharp cutoff); otherwise, apply soft clipping (saturation).gain_db (
float) – The amount of gain in decibels to apply before clipping.normalize (
bool) – If True, normalize the input signal to 0 dBFS before applying clipping.oversampling (
Optional[int]) – If provided, we will oversample the input signal by the given integer factor before applying saturation and then downsample back to the original sampling rate.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_cl{gain_db}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- compress(codec='opus', compression_level=0.99, compress_custom_fields=False)
Return a copy of this Cut that has its Recordings processed by a lossy audio encoder.
- Parameters:
codec (
Literal['opus','mp3','vorbis','gsm']) – The codec to use for compression. Supported codecs are “opus”, “mp3”, “vorbis”, “gsm”.compression_level (
float) – The level of compression (from 0.0 to 1.0, higher values correspond to higher compression).compress_custom_fields (
bool) – Whether to also compress any custom recording fields in the Cut.
- Return type:
DataCut- Returns:
A modified
DataCutcontaining audio processed by a codec
- compute_and_store_features(extractor, storage, augment_fn=None, *args, **kwargs)
Compute the features from this cut, store them on disk, and attach a feature manifest to this cut. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.storage (
FeaturesWriter) – aFeaturesWriterinstance used to write the features to a storage. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation.
- Return type:
DataCut- Returns:
a new
MonoCutinstance with aFeaturesmanifest attached to it.
- compute_features(extractor, augment_fn=None)
Compute the features from this cut. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – optionalWavAugmenterinstance for audio augmentation.
- Return type:
ndarray- Returns:
a numpy ndarray with the computed features.
- copy(**replace_attrs)
Returns a shallow copy of self, with specified attributes overwritten.
- Example:
>>> cut = MonoCut(id="old-id", ...) ... cut2 = cut.copy(id="new-id") ... assert cut.id == "old-id" ... assert cut2.id == "new-id"
- custom: Optional[Dict[str, Any]] = None
- cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)
Return a list of shorter cuts, made by traversing this cut in windows of
durationseconds byhopseconds.The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.
- Parameters:
duration (
float) – Desired duration of the new cuts in seconds.hop (
Optional[float]) – Shift between the windows in the new cuts in seconds.keep_excessive_supervisions (
bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
- Return type:
CutSet
- Returns:
a list of cuts made from shorter duration windows.
- cut_into_windows_balanced(min_duration, max_duration, overlap=0.0, keep_excessive_supervisions=True)
Return a list of shorter cuts made by splitting this cut into overlapping windows whose size is chosen within
[min_duration, max_duration]to maximise the duration of the final (potentially shorter) window, thereby minimising padding.Each resulting sub-cut carries two extra entries in its
customdict:"source_cut_id"– theidof this (parent) cut."source_cut_start"– thestarttime of this cut within its recording. Downstream code can use this to detect whether the parent was the first window of a recording (source_cut_start == 0) or a later continuation.
- Parameters:
min_duration (
float) – Minimum desired window duration in seconds.max_duration (
float) – Maximum desired window duration in seconds.overlap (
float) – Overlap between consecutive windows in seconds (default: 0).keep_excessive_supervisions (
bool) – When a window is truncated mid-supervision, should the supervision be kept.
- Return type:
CutSet
- Returns:
a
CutSetof overlapping sub-cuts.
- dereverb_wpe(affix_id=True)
Return a new
DataCutthat will lazily apply WPE dereverberation.- Parameters:
affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_wpe”.- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- drop_alignments()
Return a copy of the current
DataCut, detached fromalignments.- Return type:
DataCut
- drop_custom(name)
- drop_features()
Return a copy of the current
DataCut, detached fromfeatures.- Return type:
DataCut
- drop_in_memory_data()
Return a copy of the current
DataCut, detached from any in-memory data. The manifests for in-memory data are converted into placeholders that can still be looked up for metadata, but will fail on attempts to load the data.- Return type:
DataCut
- drop_recording()
Return a copy of the current
DataCut, detached fromrecording.- Return type:
DataCut
- drop_supervisions()
Return a copy of the current
DataCut, detached fromsupervisions.- Return type:
DataCut
- property end: float
- extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)
Returns a new Cut (DataCut or MixedCut) that is an extended region of the current DataCut by extending the cut by a fixed duration in the specified direction.
Note that no operation is done on the actual features or recording - it’s only during the call to
DataCut.load_features()/DataCut.load_audio()when the actual changes happen (an extended version of features/audio is loaded).Hint
This method extends a cut by a given duration, either to the left or to the right (or both), using the “real” content of the recording that the cut is part of. For example, a DataCut spanning the region from 2s to 5s in a recording, when extended by 2s to the right, will now span the region from 2s to 7s in the same recording (provided the recording length exceeds 7s). If the recording is shorter, additional silence will be padded to achieve the desired duration by default. This behavior can be changed by setting
pad_silence=False. Also seeDataCut.pad()which pads a cut “to” a specified length. To “truncate” a cut, useDataCut.truncate().Hint
If pad_silence is set to False, then the cut will be extended only as much as allowed within the recording’s boundary.
Hint
If direction is “both”, the resulting cut will be extended by the specified duration in both directions. This is different from the usage in
MonoCut.pad()where a padding equal to 0.5*duration is added to both sides.- Parameters:
duration (
float) – float (seconds), specifies the duration by which the cut should be extended.direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the duration specified in duration.preserve_id (
bool) – bool. Should the extended cut keep the same ID or get a new, random one.pad_silence (
bool) – bool. Should the cut be padded with silence if the recording is shorter than the desired duration. If False, the cut will be extended only as much as allowed within the recording’s boundary.
- Return type:
- Returns:
a new MonoCut instance.
- features: Optional[Features] = None
- property features_type: str | None
- fill_supervision(add_empty=True, shrink_ok=False)
Fills the whole duration of a cut with a supervision segment.
If the cut has one supervision, its start is set to 0 and duration is set to
cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.If there are no supervisions, we will add an empty one when
add_empty==True, otherwise we won’t change anything.If there are two or more supervisions, we will raise an exception.
- Parameters:
add_empty (
bool) – should we add an empty supervision with identical time bounds as the cut.shrink_ok (
bool) – should we raise an error if a supervision would be shrank as a result of calling this method.
- Return type:
DataCut
- filter_supervisions(predicate)
Return a copy of the cut that only has supervisions accepted by
predicate.Example:
>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids) >>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0) >>> cut = cut.filter_supervisions(lambda s: s.text is not None)
- Parameters:
predicate (
Callable[[SupervisionSegment],bool]) – A callable that accepts SupervisionSegment and returns bool- Return type:
DataCut- Returns:
a modified MonoCut
- property frame_shift: float | None
- has(field)
- Return type:
bool
- has_custom(name)
Check if the Cut has a custom attribute with name
name.- Parameters:
name (
str) – name of the custom attribute.- Return type:
bool- Returns:
a boolean.
- property has_features: bool
- property has_overlapping_supervisions: bool
- property has_recording: bool
- property has_video: bool
- index_supervisions(index_mixed_tracks=False, keep_ids=None)
Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.
The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.
- Parameters:
index_mixed_tracks (
bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.keep_ids (
Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.
- Return type:
Dict[str,IntervalTree]- Returns:
a mapping from Cut ID to an interval tree of SupervisionSegments.
- property is_in_memory: bool
- iter_data()
Iterate over each data piece attached to this cut. Returns a generator yielding tuples of
(key, manifest), wherekeyis the name of the attribute under whichmanifestis found.manifestis of typeRecording,Features,TemporalArray,Array, orImage.For example, if
keyisrecording, thenmanifestisself.recording.
- load_custom(name, **kwargs)
Load custom data as numpy array. The custom data is expected to have been stored in cuts
customfield as anArray,TemporalArray, orImagemanifest.Note
It works with Array/Image manifests stored via attribute assignments, e.g.:
cut.my_custom_data = Array(...)orcut = cut.attach_image('img', ...).- Parameters:
name (
str) – name of the custom attribute.- Return type:
ndarray- Returns:
a numpy array with the data.
- map_supervisions(transform_fn)
Return a copy of the cut that has its supervisions transformed by
transform_fn.- Parameters:
transform_fn (
Callable[[SupervisionSegment],SupervisionSegment]) – a function that modifies a supervision as an argument.- Return type:
DataCut- Returns:
a modified MonoCut.
- mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None, tag=None)
Refer to :function:`~lhotse.cut.mix` documentation.
- Return type:
- move_to_memory(audio_format='flac', load_audio=True, load_features=True, load_custom=True)
Load data (audio, features, or custom arrays) into memory and attach them to a copy of the manifest. This is useful when you want to store cuts together with the actual data in some binary format that enables sequential data reads.
Audio is encoded with
audio_format(compatible withtorchaudio.save), floating point features are encoded with lilcom, and other arrays are pickled.- Return type:
- narrowband(codec, restore_orig_sr=True, affix_id=True)
Return a new
DataCutthat will lazily apply narrowband effect.- Parameters:
codec (
str) – Codec name.restore_orig_sr (
bool) – Restore original sampling rate.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_nb_{codec}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- normalize_loudness(target, affix_id=False, **kwargs)
Return a new
DataCutthat will lazily apply loudness normalization.- Parameters:
target (
float) – The target loudness in dBFS.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_ln{target}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- property num_features: int | None
- property num_frames: int | None
- property num_samples: int | None
- pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)
Return a new MixedCut, padded with zeros in the recording, and
pad_feat_valuein each feature bin.The user can choose to pad either to a specific duration; a specific number of frames num_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.
- Parameters:
duration (
Optional[float]) – The cut’s minimal duration after padding.num_frames (
Optional[int]) – The cut’s total number of frames after padding.num_samples (
Optional[int]) – The cut’s total number of samples after padding.pad_feat_value (
float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.preserve_id (
bool) – WhenTrue, preserves the cut ID before padding. Otherwise, a new random ID is generated for the padded cut (default).pad_value_dict (
Optional[Dict[str,Union[int,float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.
- Return type:
- Returns:
a padded MixedCut if duration is greater than this cut’s duration, otherwise
self.
- perturb_speed(factor, affix_id=True)
Return a new
DataCutthat will lazily perturb the speed while loading audio. Thenum_samples,startanddurationfields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlyingRecordingand the supervisions.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theMonoCut.idfield by affixing it with “_sp{factor}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- perturb_tempo(factor, affix_id=True)
Return a new
DataCutthat will lazily perturb the tempo while loading audio.Compared to speed perturbation, tempo preserves pitch. The
num_samples,startanddurationfields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlyingRecordingand the supervisions.- Parameters:
factor (
float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theMonoCut.idfield by affixing it with “_tp{factor}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- perturb_volume(factor, affix_id=True)
Return a new
DataCutthat will lazily perturb the volume while loading audio.- Parameters:
factor (
float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_vp{factor}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- play_audio()
Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).
- plot_alignment(alignment_type='word')
Display the alignment on top of a spectrogram. Requires matplotlib to be installed.
- plot_audio(ax=None, **kwargs)
Display a plot of the waveform. Requires matplotlib to be installed.
- plot_features()
Display the feature matrix as an image. Requires matplotlib to be installed.
- recording: Optional[Recording] = None
- property recording_id: str
- resample(sampling_rate, affix_id=False, recording_field=None)
Return a new
DataCutthat will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.- Parameters:
sampling_rate (
int) – The new sampling rate.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).recording_field (
Optional[str]) – which recording field to resample.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- property sampling_rate: int
- save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)
Store this cut’s waveform as audio recording to disk.
- Parameters:
storage_path (
Union[Path,str]) – The path to location where we will store the audio recordings.format (
Optional[str]) – Audio format argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.encoding (
Optional[str]) – Audio encoding argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.kwargs – additional arguments passed to
Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.
- Return type:
- Returns:
a new Cut instance.
- speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- split(timestamp)
-
Split a cut into two cuts at
timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:left cut [0s - 4s]
right cut [4s - 10s]
- supervisions_audio_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- supervisions_feature_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- to_dict()
- Return type:
dict
- trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)
Splits the current
Cutinto its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
Hint
If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the
Cut.merge_supervisions()method first to merge the supervisions into a single one, followed by theCut.trim_to_alignments()method. For example:>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)
Hint
The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:
>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)
- Parameters:
type (
str) – The type of the alignment to trim to (e.g. “word”).max_pause (
Optional[float]) – The maximum pause allowed between the alignments to merge them. IfNone, no merging will be performed. [default: None]delimiter (
str) – The delimiter to use when joining the alignment items.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.num_jobs – Number of parallel workers to process the cuts.
- Return type:
CutSet
- Returns:
a CutSet object.
- trim_to_supervision_groups(max_pause=0.0)
Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than
max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482For example, the following cut:
Cut╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝
is transformed into two cuts:
Cut 1 Cut 2
╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝
For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.
- Parameters:
max_pause (
float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.- Return type:
CutSet
- Returns:
a
CutSet.
- trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)
Splits the current
Cutinto as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded viakeep_overlappingflag.For example, the following cut:
Cut |-----------------| Sup1 |----| Sup2 |-----------|
is transformed into two cuts:
Cut1 |----| Sup1 |----| Sup2 |-| Cut2 |-----------| Sup1 |-| Sup2 |-----------|
For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
- Parameters:
keep_overlapping (
bool) – whenFalse, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discardSup2inCut1andSup1inCut2. In this mode, we guarantee that there will always be exactly one supervision per cut.min_duration (
Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter thanmin_durationwith actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept whenkeep_overlappingis true. If there is not enough context, the returned cut will be shorter thanmin_duration. If the supervision segment is longer thanmin_duration, the return cut will be longer.context_direction (
Literal['center','left','right','random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
- Return type:
CutSet
- Returns:
a list of cuts.
- property trimmed_supervisions: List[SupervisionSegment]
Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.
Note that when
cut.supervisionsis called, the supervisions may have negativestartvalues that indicate the supervision actually begins before the cut, orendvalues that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).Caution
For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.
- truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)
Returns a new MonoCut that is a sub-region of the current DataCut.
Note that no operation is done on the actual features or recording - it’s only during the call to
DataCut.load_features()/DataCut.load_audio()when the actual changes happen (a subset of features/audio is loaded).Hint
To extend a cut by a fixed duration, use the
DataCut.extend_by()method.- Parameters:
offset (
float) – float (seconds), controls the start of the new cut relative to the current DataCut’s start. E.g., if the current DataCut starts at 10.0, and offset is 2.0, the new start is 12.0.duration (
Optional[float]) – optional float (seconds), controls the duration of the resulting DataCut. By default, the duration is (end of the cut before truncation) - (offset).keep_excessive_supervisions (
bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.preserve_id (
bool) – bool. Should the truncated cut keep the same ID or get a new, random one._supervisions_index (
Optional[Dict[str,IntervalTree]]) – an IntervalTree; when passed, allows to speed up processing of Cuts with a very large number of supervisions. Intended as an internal parameter.
- Return type:
DataCut- Returns:
a new MonoCut instance. If the current DataCut is shorter than the duration, return None.
- unmix(tag=None)
Return this cut as a single-item list.
This is a compatibility no-op for cut types that are not
MixedCut, so callers can uniformly invokecut.unmix()regardless of the concrete cut type.- Parameters:
tag (
Optional[str]) – Ignored for non-mixed cuts.- Return type:
List[Cut]- Returns:
A single-item list containing
self.
- with_custom(name, value)
Return a copy of this object with an extra custom field assigned to it.
- with_features_path_prefix(path)
- Return type:
DataCut
- with_recording_path_prefix(path)
- Return type:
DataCut
- id: str
- start: Seconds
- duration: Seconds
- supervisions: List[SupervisionSegment]
- phone: Callable
-
channel:
- class lhotse.cut.MultiCut(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)[source]
MultiCutis aCutthat is analogous to the MonoCut. While MonoCut represents a single channel of a recording, MultiCut represents multi-channel recordings where supervisions may or may not be shared across channels. It is intended to be used to store, for example, segments of a microphone array recording. The following diagrams illustrate some examples for MultiCut usage:>>> 2-channel telephone recording with 2 supervisions, one for each channel (e.g., Switchboard):
╔══════════════════════════════ MultiCut ═════════════════╗ ║ ┌──────────────────────────┐ ║
- Channel 1 ──╬─│ Hello this is John. │──────────────────────────────╬────────
║ └──────────────────────────┘ ║ ║ ┌──────────────────────────┐║
- Channel 2 ──╬───────────────────────────────│ Hey, John. How are you? │╠────────
║ └──────────────────────────┘║ ╚═══════════════════════════════════════════════════════════╝
>>> Multi-array multi-microphone recording with shared supervisions (e.g., CHiME-6), along with close-talk microphones (A and B are distant arrays, C is close-talk):
╔═══════════════════════════════════════════════════════════════════════════╗ ║ ┌───────────────────┐ ┌───────────────────┐ ║
- A-1 ──╬─┤ ├─────────────────────────┤ ├───────╬─
║ │ What did you do? │ │I cleaned my room. │ ║
- A-2 ──╬─┤ ├─────────────────────────┤ ├───────╬─
║ └───────────────────┘ ┌───────────────────┐ └───────────────────┘ ║
- B-1 ──╬────────────────────────┤Yeah, we were going├──────────────────────────────╬─
║ │ to the mall. │ ║
- B-2 ──╬────────────────────────┤ ├──────────────────────────────╬─
║ └───────────────────┘ ┌───────────────────┐ ║
- C ──╬─────────────────────────────────────────────────────┤ Right. ├─╬─
║ └───────────────────┘ ║ ╚════════════════════════════════ MultiCut ═══════════════════════════════╝
By definition, a MultiCut has the same attributes as a MonoCut. The key difference is that the Recording object has multiple channels, and the Supervision objects may correspond to any of these channels. The channels that the MultiCut can be a subset of the Recording channels, but must be a superset of the Supervision channels.
See also:
-
channel:
List[int]
- property num_channels: int
- load_features(channel=None)[source]
Load the features from the underlying storage and cut them to the relevant [begin, duration] region of the current MultiCut.
- Parameters:
channel (
Union[int,List[int],None]) – The channel to load the features for. If None, all channels will be loaded. This is useful for the case when we have features extracted for each channel of the multi-cut, and we want to selectively load them.- Return type:
Optional[ndarray]
- load_audio(channel=None)[source]
Load the audio by locating the appropriate recording in the supplied Recording. The audio is trimmed to the [begin, end] range specified by the MultiCut.
- Parameters:
channel (
Union[int,List[int],None]) – optional int or list of int, the subset of channels to load (all by default).- Return type:
Optional[ndarray]- Returns:
a numpy ndarray with audio samples, with shape (C <channel>, N <samples>)
- load_video(channel=None, with_audio=True)[source]
Load the subset of video (and audio) from attached recording. The data is trimmed to the [begin, end] range specified by the MonoCut.
- Parameters:
channel (
Union[int,List[int],None]) – optional int or list of int, the subset of channels to load (all by default).with_audio (
bool) – bool, whether to load and return audio alongside video. True by default.
- Return type:
Optional[Tuple[Tensor,Optional[Tensor]]]- Returns:
a tuple of video tensor and optionally audio tensor (or
None), orNoneif this cut has no video.
- reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=(0,), room_rng_seed=None, source_rng_seed=None)[source]
Return a new
MultiCutthat will convolve the audio with the provided impulse response. If the rir_recording is multi-channel, the rir_channels argument determines which channels will be used. This list must be of the same length as the number of channels in the MultiCut.If no
rir_recordingis provided, we will generate an impulse response using a fast random generator (https://arxiv.org/abs/2208.04101), only if the MultiCut has exactly one channel. At the moment we do not support simulation of multi-channel impulse responses.- Parameters:
rir_recording (
Union[Recording,DataCut,None]) – The impulse response to use for convolving.normalize_output (
bool) – When true, output will be normalized to have energy as input.early_only (
bool) – When true, only the early reflections (first 50 ms) will be used.affix_id (
bool) – When true, we will modify theMonoCut.idfield by affixing it with “_rvb”.rir_channels (
Sequence[int]) – The channels of the impulse response to use. First channel is used by default. If multiple channels are specified, this will produce a MixedCut instead of a MonoCut.room_rng_seed (
Optional[int]) – The seed for the room configuration.source_rng_seed (
Optional[int]) – The seed for the source positions.
- Return type:
- Returns:
a modified copy of the current
MonoCut.
- merge_supervisions(merge_policy='delimiter', merge_channels=True, custom_merge_fn=None)[source]
Return a copy of the cut that has all of its supervisions merged into a single segment. The
channelattribute of all the segments in this case will be set to the union of all channels. Ifmerge_channelsis set toFalse, the supervisions will be merged into a single segment per channel group. Thechannelattribute will not change in this case.The new start is the start of the earliest superivion, and the new duration is a minimum spanning duration for all the supervisions. The text fields of all segments are concatenated with a whitespace.
- Parameters:
merge_policy (
str) – one of “keep_first” or “delimiter”. If “keep_first”, we keep only the first segment’s field value, otherwise all string fields (including IDs) are prefixed with “cat#” and concatenated with a hash symbol “#”. This is also applied tocustomfields. Fields with aNonevalue are omitted.merge_channels (
bool) – If true, we will merge all supervisions into a single segment. If false, we will merge supervisions per channel group. Default: True.custom_merge_fn (
Optional[Callable[[str,Iterable[Any]],Any]]) – a function that will be called to merge custom fields values. We expectcustom_merge_fnto handle all possible custom keys. When not provided, we will treat all custom values as strings. It will be called roughly like:custom_merge_fn(custom_key, [s.custom[custom_key] for s in sups])
- Return type:
- with_channels(channels)[source]
Select specified channels from this cut. Supports extending to other channels available in the underlying
Recording. If a single channel is provided, we’ll return aMonoCut, otherwise we’ll return aMultiCut.- Return type:
DataCut
- static from_mono(*cuts)[source]
Convert one or more MonoCut to a MultiCut. If multiple mono cuts are provided, they must match in all fields except the channel. Each cut must have a distinct channel.
- Parameters:
cuts (
DataCut) – the input cut(s).- Return type:
- Returns:
a MultiCut with a single track.
- to_mono(mono_downmix=False)[source]
Convert a MultiCut to either a list of MonoCuts (one per channel) or a single MonoCut obtained by downmixing all channels.
- Parameters:
mono_downmix (
bool) – If true, we will downmix all channels into a single MonoCut. If false, we will return a list of MonoCuts, one per channel.- Return type:
Union[DataCut,List[DataCut]]- Returns:
a list of MonoCuts or a single MonoCut.
- __init__(id, start, duration, channel, supervisions=<factory>, features=None, recording=None, custom=None)
- append(other, snr=None, preserve_id=None)
Append the
otherCut after the current Cut. Conceptually the same asmixbut with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) theothercut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call toload_features.- Parameters:
preserve_id (
Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.- Return type:
- attach_image(key, path_or_object)
Attach an image to this cut, wrapped in an Image class and stored under key in the custom dict.
The image can be specified as: - A path to an image file - A numpy array with shape (height, width, channels) - Raw bytes of an image file
Example:
>>> cut = cut.attach_image('thumbnail', 'path/to/image.jpg') >>> # Access the image later >>> img_array = cut.load_thumbnail() # Returns numpy array
- Parameters:
key (
str) – The key to store the image under in the custom dict.path_or_object (
Union[str,ndarray,bytes]) – The image as a path, numpy array, or bytes.
- Return type:
DataCut- Returns:
A new DataCut with the image attached.
- attach_tensor(name, data, frame_shift=None, temporal_dim=None, compressed=False)
Attach a tensor to this MonoCut, described with an
Arraymanifest. The attached data is stored in-memory for later use, and can be accessed by callingcut.load_<name>()orcut.load_custom().This is useful if you want actions such as truncate/pad to propagate to the tensor, e.g.:
>>> cut = MonoCut(id="c1", start=2, duration=8, ...) >>> cut = cut.attach_tensor( ... "alignment", ... torch.tensor([0, 0, 0, ...]), ... frame_shift=0.1, ... temporal_dim=0, ... ) >>> half_alignment = cut.truncate(duration=4.0).load_alignment()
Note
This object can’t be stored in JSON/JSONL manifests anymore.
- Parameters:
name (
str) – attribute under which the data can be found.data (
Union[ndarray,Tensor]) – PyTorch tensor or numpy array.frame_shift (
Optional[float]) – Optional float, when the array has a temporal dimension it indicates how much time has passed between the starts of consecutive frames (expressed in seconds).temporal_dim (
Optional[int]) – Optional int, when the array has a temporal dimension, it indicates which dim to interpret as temporal.compressed (
bool) – When True, we will apply lilcom compression to the array. Only applicable to arrays of floats.
- Return type:
- Returns:
- clip_amplitude(hard=False, gain_db=0.0, normalize=True, oversampling=2, affix_id=True)
Return a new
DataCutthat will lazily apply clipping while loading audio.- Parameters:
hard (
bool) – If True, apply hard clipping (sharp cutoff); otherwise, apply soft clipping (saturation).gain_db (
float) – The amount of gain in decibels to apply before clipping.normalize (
bool) – If True, normalize the input signal to 0 dBFS before applying clipping.oversampling (
Optional[int]) – If provided, we will oversample the input signal by the given integer factor before applying saturation and then downsample back to the original sampling rate.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_cl{gain_db}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- compress(codec='opus', compression_level=0.99, compress_custom_fields=False)
Return a copy of this Cut that has its Recordings processed by a lossy audio encoder.
- Parameters:
codec (
Literal['opus','mp3','vorbis','gsm']) – The codec to use for compression. Supported codecs are “opus”, “mp3”, “vorbis”, “gsm”.compression_level (
float) – The level of compression (from 0.0 to 1.0, higher values correspond to higher compression).compress_custom_fields (
bool) – Whether to also compress any custom recording fields in the Cut.
- Return type:
DataCut- Returns:
A modified
DataCutcontaining audio processed by a codec
- compute_and_store_features(extractor, storage, augment_fn=None, *args, **kwargs)
Compute the features from this cut, store them on disk, and attach a feature manifest to this cut. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.storage (
FeaturesWriter) – aFeaturesWriterinstance used to write the features to a storage. When the optionallilcomdependency is installed and on-disk size matters,LilcomChunkyWriteris the preferred backend.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation.
- Return type:
DataCut- Returns:
a new
MonoCutinstance with aFeaturesmanifest attached to it.
- compute_features(extractor, augment_fn=None)
Compute the features from this cut. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – optionalWavAugmenterinstance for audio augmentation.
- Return type:
ndarray- Returns:
a numpy ndarray with the computed features.
- copy(**replace_attrs)
Returns a shallow copy of self, with specified attributes overwritten.
- Example:
>>> cut = MonoCut(id="old-id", ...) ... cut2 = cut.copy(id="new-id") ... assert cut.id == "old-id" ... assert cut2.id == "new-id"
- custom: Optional[Dict[str, Any]] = None
- cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)
Return a list of shorter cuts, made by traversing this cut in windows of
durationseconds byhopseconds.The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.
- Parameters:
duration (
float) – Desired duration of the new cuts in seconds.hop (
Optional[float]) – Shift between the windows in the new cuts in seconds.keep_excessive_supervisions (
bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
- Return type:
CutSet
- Returns:
a list of cuts made from shorter duration windows.
- cut_into_windows_balanced(min_duration, max_duration, overlap=0.0, keep_excessive_supervisions=True)
Return a list of shorter cuts made by splitting this cut into overlapping windows whose size is chosen within
[min_duration, max_duration]to maximise the duration of the final (potentially shorter) window, thereby minimising padding.Each resulting sub-cut carries two extra entries in its
customdict:"source_cut_id"– theidof this (parent) cut."source_cut_start"– thestarttime of this cut within its recording. Downstream code can use this to detect whether the parent was the first window of a recording (source_cut_start == 0) or a later continuation.
- Parameters:
min_duration (
float) – Minimum desired window duration in seconds.max_duration (
float) – Maximum desired window duration in seconds.overlap (
float) – Overlap between consecutive windows in seconds (default: 0).keep_excessive_supervisions (
bool) – When a window is truncated mid-supervision, should the supervision be kept.
- Return type:
CutSet
- Returns:
a
CutSetof overlapping sub-cuts.
- dereverb_wpe(affix_id=True)
Return a new
DataCutthat will lazily apply WPE dereverberation.- Parameters:
affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_wpe”.- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- drop_alignments()
Return a copy of the current
DataCut, detached fromalignments.- Return type:
DataCut
- drop_custom(name)
- drop_features()
Return a copy of the current
DataCut, detached fromfeatures.- Return type:
DataCut
- drop_in_memory_data()
Return a copy of the current
DataCut, detached from any in-memory data. The manifests for in-memory data are converted into placeholders that can still be looked up for metadata, but will fail on attempts to load the data.- Return type:
DataCut
- drop_recording()
Return a copy of the current
DataCut, detached fromrecording.- Return type:
DataCut
- drop_supervisions()
Return a copy of the current
DataCut, detached fromsupervisions.- Return type:
DataCut
- property end: float
- extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)
Returns a new Cut (DataCut or MixedCut) that is an extended region of the current DataCut by extending the cut by a fixed duration in the specified direction.
Note that no operation is done on the actual features or recording - it’s only during the call to
DataCut.load_features()/DataCut.load_audio()when the actual changes happen (an extended version of features/audio is loaded).Hint
This method extends a cut by a given duration, either to the left or to the right (or both), using the “real” content of the recording that the cut is part of. For example, a DataCut spanning the region from 2s to 5s in a recording, when extended by 2s to the right, will now span the region from 2s to 7s in the same recording (provided the recording length exceeds 7s). If the recording is shorter, additional silence will be padded to achieve the desired duration by default. This behavior can be changed by setting
pad_silence=False. Also seeDataCut.pad()which pads a cut “to” a specified length. To “truncate” a cut, useDataCut.truncate().Hint
If pad_silence is set to False, then the cut will be extended only as much as allowed within the recording’s boundary.
Hint
If direction is “both”, the resulting cut will be extended by the specified duration in both directions. This is different from the usage in
MonoCut.pad()where a padding equal to 0.5*duration is added to both sides.- Parameters:
duration (
float) – float (seconds), specifies the duration by which the cut should be extended.direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether to extend on the left, right, or both sides. If ‘both’, extend on both sides by the duration specified in duration.preserve_id (
bool) – bool. Should the extended cut keep the same ID or get a new, random one.pad_silence (
bool) – bool. Should the cut be padded with silence if the recording is shorter than the desired duration. If False, the cut will be extended only as much as allowed within the recording’s boundary.
- Return type:
- Returns:
a new MonoCut instance.
- features: Optional[Features] = None
- property features_type: str | None
- fill_supervision(add_empty=True, shrink_ok=False)
Fills the whole duration of a cut with a supervision segment.
If the cut has one supervision, its start is set to 0 and duration is set to
cut.duration. Note: this may either expand a supervision that was shorter than a cut, or shrink a supervision that exceeds the cut.If there are no supervisions, we will add an empty one when
add_empty==True, otherwise we won’t change anything.If there are two or more supervisions, we will raise an exception.
- Parameters:
add_empty (
bool) – should we add an empty supervision with identical time bounds as the cut.shrink_ok (
bool) – should we raise an error if a supervision would be shrank as a result of calling this method.
- Return type:
DataCut
- filter_supervisions(predicate)
Return a copy of the cut that only has supervisions accepted by
predicate.Example:
>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids) >>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0) >>> cut = cut.filter_supervisions(lambda s: s.text is not None)
- Parameters:
predicate (
Callable[[SupervisionSegment],bool]) – A callable that accepts SupervisionSegment and returns bool- Return type:
DataCut- Returns:
a modified MonoCut
- property frame_shift: float | None
- has(field)
- Return type:
bool
- has_custom(name)
Check if the Cut has a custom attribute with name
name.- Parameters:
name (
str) – name of the custom attribute.- Return type:
bool- Returns:
a boolean.
- property has_features: bool
- property has_overlapping_supervisions: bool
- property has_recording: bool
- property has_video: bool
- index_supervisions(index_mixed_tracks=False, keep_ids=None)
Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.
The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.
- Parameters:
index_mixed_tracks (
bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.keep_ids (
Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.
- Return type:
Dict[str,IntervalTree]- Returns:
a mapping from Cut ID to an interval tree of SupervisionSegments.
- property is_in_memory: bool
- iter_data()
Iterate over each data piece attached to this cut. Returns a generator yielding tuples of
(key, manifest), wherekeyis the name of the attribute under whichmanifestis found.manifestis of typeRecording,Features,TemporalArray,Array, orImage.For example, if
keyisrecording, thenmanifestisself.recording.
- load_custom(name, **kwargs)
Load custom data as numpy array. The custom data is expected to have been stored in cuts
customfield as anArray,TemporalArray, orImagemanifest.Note
It works with Array/Image manifests stored via attribute assignments, e.g.:
cut.my_custom_data = Array(...)orcut = cut.attach_image('img', ...).- Parameters:
name (
str) – name of the custom attribute.- Return type:
ndarray- Returns:
a numpy array with the data.
- map_supervisions(transform_fn)
Return a copy of the cut that has its supervisions transformed by
transform_fn.- Parameters:
transform_fn (
Callable[[SupervisionSegment],SupervisionSegment]) – a function that modifies a supervision as an argument.- Return type:
DataCut- Returns:
a modified MonoCut.
- mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None, tag=None)
Refer to :function:`~lhotse.cut.mix` documentation.
- Return type:
- move_to_memory(audio_format='flac', load_audio=True, load_features=True, load_custom=True)
Load data (audio, features, or custom arrays) into memory and attach them to a copy of the manifest. This is useful when you want to store cuts together with the actual data in some binary format that enables sequential data reads.
Audio is encoded with
audio_format(compatible withtorchaudio.save), floating point features are encoded with lilcom, and other arrays are pickled.- Return type:
- narrowband(codec, restore_orig_sr=True, affix_id=True)
Return a new
DataCutthat will lazily apply narrowband effect.- Parameters:
codec (
str) – Codec name.restore_orig_sr (
bool) – Restore original sampling rate.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_nb_{codec}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- normalize_loudness(target, affix_id=False, **kwargs)
Return a new
DataCutthat will lazily apply loudness normalization.- Parameters:
target (
float) – The target loudness in dBFS.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_ln{target}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- property num_features: int | None
- property num_frames: int | None
- property num_samples: int | None
- pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)
Return a new MixedCut, padded with zeros in the recording, and
pad_feat_valuein each feature bin.The user can choose to pad either to a specific duration; a specific number of frames num_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.
- Parameters:
duration (
Optional[float]) – The cut’s minimal duration after padding.num_frames (
Optional[int]) – The cut’s total number of frames after padding.num_samples (
Optional[int]) – The cut’s total number of samples after padding.pad_feat_value (
float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.preserve_id (
bool) – WhenTrue, preserves the cut ID before padding. Otherwise, a new random ID is generated for the padded cut (default).pad_value_dict (
Optional[Dict[str,Union[int,float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.
- Return type:
- Returns:
a padded MixedCut if duration is greater than this cut’s duration, otherwise
self.
- perturb_speed(factor, affix_id=True)
Return a new
DataCutthat will lazily perturb the speed while loading audio. Thenum_samples,startanddurationfields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlyingRecordingand the supervisions.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theMonoCut.idfield by affixing it with “_sp{factor}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- perturb_tempo(factor, affix_id=True)
Return a new
DataCutthat will lazily perturb the tempo while loading audio.Compared to speed perturbation, tempo preserves pitch. The
num_samples,startanddurationfields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlyingRecordingand the supervisions.- Parameters:
factor (
float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify theMonoCut.idfield by affixing it with “_tp{factor}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- perturb_volume(factor, affix_id=True)
Return a new
DataCutthat will lazily perturb the volume while loading audio.- Parameters:
factor (
float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_vp{factor}”.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- play_audio()
Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).
- plot_alignment(alignment_type='word')
Display the alignment on top of a spectrogram. Requires matplotlib to be installed.
- plot_audio(ax=None, **kwargs)
Display a plot of the waveform. Requires matplotlib to be installed.
- plot_features()
Display the feature matrix as an image. Requires matplotlib to be installed.
- recording: Optional[Recording] = None
- property recording_id: str
- resample(sampling_rate, affix_id=False, recording_field=None)
Return a new
DataCutthat will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.- Parameters:
sampling_rate (
int) – The new sampling rate.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).recording_field (
Optional[str]) – which recording field to resample.
- Return type:
DataCut- Returns:
a modified copy of the current
DataCut.
- property sampling_rate: int
- save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)
Store this cut’s waveform as audio recording to disk.
- Parameters:
storage_path (
Union[Path,str]) – The path to location where we will store the audio recordings.format (
Optional[str]) – Audio format argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.encoding (
Optional[str]) – Audio encoding argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.kwargs – additional arguments passed to
Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.
- Return type:
- Returns:
a new Cut instance.
- speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- split(timestamp)
-
Split a cut into two cuts at
timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:left cut [0s - 4s]
right cut [4s - 10s]
- supervisions_audio_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- supervisions_feature_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- to_dict()
- Return type:
dict
- trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)
Splits the current
Cutinto its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
Hint
If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the
Cut.merge_supervisions()method first to merge the supervisions into a single one, followed by theCut.trim_to_alignments()method. For example:>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)
Hint
The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:
>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)
- Parameters:
type (
str) – The type of the alignment to trim to (e.g. “word”).max_pause (
Optional[float]) – The maximum pause allowed between the alignments to merge them. IfNone, no merging will be performed. [default: None]delimiter (
str) – The delimiter to use when joining the alignment items.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.num_jobs – Number of parallel workers to process the cuts.
- Return type:
CutSet
- Returns:
a CutSet object.
- trim_to_supervision_groups(max_pause=0.0)
Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than
max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482For example, the following cut:
Cut╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝
is transformed into two cuts:
Cut 1 Cut 2
╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝
For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.
- Parameters:
max_pause (
float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.- Return type:
CutSet
- Returns:
a
CutSet.
- trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)
Splits the current
Cutinto as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded viakeep_overlappingflag.For example, the following cut:
Cut |-----------------| Sup1 |----| Sup2 |-----------|
is transformed into two cuts:
Cut1 |----| Sup1 |----| Sup2 |-| Cut2 |-----------| Sup1 |-| Sup2 |-----------|
For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
- Parameters:
keep_overlapping (
bool) – whenFalse, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discardSup2inCut1andSup1inCut2. In this mode, we guarantee that there will always be exactly one supervision per cut.min_duration (
Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter thanmin_durationwith actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept whenkeep_overlappingis true. If there is not enough context, the returned cut will be shorter thanmin_duration. If the supervision segment is longer thanmin_duration, the return cut will be longer.context_direction (
Literal['center','left','right','random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
- Return type:
CutSet
- Returns:
a list of cuts.
- property trimmed_supervisions: List[SupervisionSegment]
Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.
Note that when
cut.supervisionsis called, the supervisions may have negativestartvalues that indicate the supervision actually begins before the cut, orendvalues that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).Caution
For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.
- truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)
Returns a new MonoCut that is a sub-region of the current DataCut.
Note that no operation is done on the actual features or recording - it’s only during the call to
DataCut.load_features()/DataCut.load_audio()when the actual changes happen (a subset of features/audio is loaded).Hint
To extend a cut by a fixed duration, use the
DataCut.extend_by()method.- Parameters:
offset (
float) – float (seconds), controls the start of the new cut relative to the current DataCut’s start. E.g., if the current DataCut starts at 10.0, and offset is 2.0, the new start is 12.0.duration (
Optional[float]) – optional float (seconds), controls the duration of the resulting DataCut. By default, the duration is (end of the cut before truncation) - (offset).keep_excessive_supervisions (
bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.preserve_id (
bool) – bool. Should the truncated cut keep the same ID or get a new, random one._supervisions_index (
Optional[Dict[str,IntervalTree]]) – an IntervalTree; when passed, allows to speed up processing of Cuts with a very large number of supervisions. Intended as an internal parameter.
- Return type:
DataCut- Returns:
a new MonoCut instance. If the current DataCut is shorter than the duration, return None.
- unmix(tag=None)
Return this cut as a single-item list.
This is a compatibility no-op for cut types that are not
MixedCut, so callers can uniformly invokecut.unmix()regardless of the concrete cut type.- Parameters:
tag (
Optional[str]) – Ignored for non-mixed cuts.- Return type:
List[Cut]- Returns:
A single-item list containing
self.
- with_custom(name, value)
Return a copy of this object with an extra custom field assigned to it.
- with_features_path_prefix(path)
- Return type:
DataCut
- with_recording_path_prefix(path)
- Return type:
DataCut
- id: str
- start: Seconds
- duration: Seconds
- supervisions: List[SupervisionSegment]
- phone: Callable
- class lhotse.cut.PaddingCut(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None, video=None, custom=None)[source]
PaddingCutis a dummyCutthat doesn’t refer to actual recordings or features –it simply returns zero samples in the time domain and a specified features value in the feature domain. Its main role is to be appended to other cuts to make them evenly sized.Please refer to the documentation of
Cutto learn more about using cuts.See also:
-
id:
str
-
duration:
float
-
sampling_rate:
int
-
feat_value:
float
-
num_frames:
Optional[int] = None
-
num_features:
Optional[int] = None
-
frame_shift:
Optional[float] = None
-
num_samples:
Optional[int] = None
-
custom:
Optional[dict] = None
- property start: float
- property supervisions
- property channel: int
- property has_features: bool
- property has_recording: bool
- property has_video: bool
- property num_channels: int
- property is_in_memory: bool
- property recording_id: str
- truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, **kwargs)[source]
- Return type:
- extend_by(*, duration, direction='both', preserve_id=False, pad_silence=True)[source]
Return a new PaddingCut with region extended by the specified duration.
- Parameters:
duration (
float) – The duration by which to extend the cut.direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the cut should be extended to the left, right or both sides. By default, the cut is extended by the specified duration on both sides.preserve_id (
bool) – WhenTrue, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).pad_silence (
bool) – See usage inlhotse.cut.MonoCut.extend_by(). It is ignored here.
- Return type:
- Returns:
an extended PaddingCut.
- pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=-23.025850929940457, direction='right', preserve_id=False, pad_value_dict=None)[source]
Return a new MixedCut, padded with zeros in the recording, and
pad_feat_valuein each feature bin.The user can choose to pad either to a specific duration; a specific number of frames num_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.
- Parameters:
duration (
Optional[float]) – The cut’s minimal duration after padding.num_frames (
Optional[int]) – The cut’s total number of frames after padding.num_samples (
Optional[int]) – The cut’s total number of samples after padding.pad_feat_value (
float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).direction (
str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.preserve_id (
bool) – WhenTrue, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).pad_value_dict (
Optional[Dict[str,Union[int,float]]]) – Optional dict that specifies what value should be used for padding arrays in custom attributes.
- Return type:
- Returns:
a padded MixedCut if duration is greater than this cut’s duration, otherwise
self.
- resample(sampling_rate, affix_id=False, recording_field=None)[source]
Return a new
PaddingCutthat will “mimic” the effect of resampling onsampling_rate,duration, andnum_samples.- Parameters:
sampling_rate (
int) – The new sampling rate.affix_id (
bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).recording_field (
Optional[str]) – which recording field to resample. Ignored, present for interface compatibility.
- Return type:
- Returns:
a modified copy of the current
PaddingCut.
- perturb_speed(factor, affix_id=True)[source]
Return a new
PaddingCutthat will “mimic” the effect of speed perturbation ondurationandnum_samples.- Parameters:
factor (
float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).affix_id (
bool) – When true, we will modify thePaddingCut.idfield by affixing it with “_sp{factor}”.
- Return type:
- Returns:
a modified copy of the current
PaddingCut.
- perturb_tempo(factor, affix_id=True)[source]
Return a new
PaddingCutthat will “mimic” the effect of tempo perturbation ondurationandnum_samples.Compared to speed perturbation, tempo preserves pitch. :type factor:
float:param factor: The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster). :type affix_id:bool:param affix_id: When true, we will modify thePaddingCut.idfieldby affixing it with “_tp{factor}”.
- Return type:
- Returns:
a modified copy of the current
PaddingCut.
- perturb_volume(factor, affix_id=True)[source]
Return a new
PaddingCutthat will “mimic” the effect of volume perturbation on amplitude of samples.- Parameters:
factor (
float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).affix_id (
bool) – When true, we will modify thePaddingCut.idfield by affixing it with “_vp{factor}”.
- Return type:
- Returns:
a modified copy of the current
PaddingCut.
- reverb_rir(rir_recording=None, normalize_output=True, early_only=False, affix_id=True, rir_channels=[0], room_rng_seed=None, source_rng_seed=None)[source]
Return a new
PaddingCutthat will “mimic” the effect of reverberation with impulse response on original samples.- Parameters:
rir_recording (
Optional[Recording]) – The impulse response to use for convolving.normalize_output (
bool) – When true, output will be normalized to have energy as input.early_only (
bool) – When true, only the early reflections (first 50 ms) will be used.affix_id (
bool) – When true, we will modify thePaddingCut.idfield by affixing it with “_rvb”.rir_channels (
List[int]) – The channels of the impulse response to use.
- Return type:
- Returns:
a modified copy of the current
PaddingCut.
- normalize_loudness(target, affix_id=False, **kwargs)[source]
Return a new
PaddingCutthat will “mimic” the effect of loudness normalization- Parameters:
target (
float) – The target loudness in dBFS.affix_id (
bool) – When true, we will modify theDataCut.idfield by affixing it with “_ln{target}”.
- Return type:
- Returns:
a modified copy of the current
DataCut.
- drop_features()[source]
Return a copy of the current
PaddingCut, detached fromfeatures.- Return type:
- drop_recording()[source]
Return a copy of the current
PaddingCut, detached fromrecording.- Return type:
- compute_and_store_features(extractor, *args, **kwargs)[source]
Returns a new PaddingCut with updates information about the feature dimension and number of feature frames, depending on the
extractorproperties.- Return type:
- fill_supervision(*args, **kwargs)[source]
Just for consistency with :class`.MonoCut` and
MixedCut.- Return type:
- move_to_memory(*args, **kwargs)[source]
Just for consistency with :class`.MonoCut` and
MixedCut.- Return type:
- map_supervisions(transform_fn)[source]
Just for consistency with
MonoCutandMixedCut.- Parameters:
transform_fn (
Callable[[Any],Any]) – a dummy function that would be never called actually.- Return type:
- Returns:
the PaddingCut itself.
- merge_supervisions(*args, **kwargs)[source]
Just for consistency with
MonoCutandMixedCut.- Return type:
- Returns:
the PaddingCut itself.
- filter_supervisions(predicate)[source]
Just for consistency with
MonoCutandMixedCut.- Parameters:
predicate (
Callable[[SupervisionSegment],bool]) – A callable that accepts SupervisionSegment and returns bool- Return type:
- Returns:
a modified MonoCut
- __init__(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None, video=None, custom=None)
- append(other, snr=None, preserve_id=None)
Append the
otherCut after the current Cut. Conceptually the same asmixbut with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) theothercut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call toload_features.- Parameters:
preserve_id (
Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.- Return type:
- compute_features(extractor, augment_fn=None)
Compute the features from this cut. This cut has to be able to load audio.
- Parameters:
extractor (
FeatureExtractor) – aFeatureExtractorinstance used to compute the features.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – optionalWavAugmenterinstance for audio augmentation.
- Return type:
ndarray- Returns:
a numpy ndarray with the computed features.
- copy(**replace_attrs)
Returns a shallow copy of self, with specified attributes overwritten.
- Example:
>>> cut = MonoCut(id="old-id", ...) ... cut2 = cut.copy(id="new-id") ... assert cut.id == "old-id" ... assert cut2.id == "new-id"
- cut_into_windows(duration, hop=None, keep_excessive_supervisions=True)
Return a list of shorter cuts, made by traversing this cut in windows of
durationseconds byhopseconds.The last window might have a shorter duration if there was not enough audio, so you might want to use either filter or pad the results.
- Parameters:
duration (
float) – Desired duration of the new cuts in seconds.hop (
Optional[float]) – Shift between the windows in the new cuts in seconds.keep_excessive_supervisions (
bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
- Return type:
CutSet
- Returns:
a list of cuts made from shorter duration windows.
- cut_into_windows_balanced(min_duration, max_duration, overlap=0.0, keep_excessive_supervisions=True)
Return a list of shorter cuts made by splitting this cut into overlapping windows whose size is chosen within
[min_duration, max_duration]to maximise the duration of the final (potentially shorter) window, thereby minimising padding.Each resulting sub-cut carries two extra entries in its
customdict:"source_cut_id"– theidof this (parent) cut."source_cut_start"– thestarttime of this cut within its recording. Downstream code can use this to detect whether the parent was the first window of a recording (source_cut_start == 0) or a later continuation.
- Parameters:
min_duration (
float) – Minimum desired window duration in seconds.max_duration (
float) – Maximum desired window duration in seconds.overlap (
float) – Overlap between consecutive windows in seconds (default: 0).keep_excessive_supervisions (
bool) – When a window is truncated mid-supervision, should the supervision be kept.
- Return type:
CutSet
- Returns:
a
CutSetof overlapping sub-cuts.
- property end: float
- property has_overlapping_supervisions: bool
- index_supervisions(index_mixed_tracks=False, keep_ids=None)
Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.
The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.
- Parameters:
index_mixed_tracks (
bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.keep_ids (
Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.
- Return type:
Dict[str,IntervalTree]- Returns:
a mapping from Cut ID to an interval tree of SupervisionSegments.
- mix(other, offset_other_by=0.0, allow_padding=False, snr=None, preserve_id=None, tag=None)
Refer to :function:`~lhotse.cut.mix` documentation.
- Return type:
- play_audio()
Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).
- plot_alignment(alignment_type='word')
Display the alignment on top of a spectrogram. Requires matplotlib to be installed.
- plot_audio(ax=None, **kwargs)
Display a plot of the waveform. Requires matplotlib to be installed.
- plot_features()
Display the feature matrix as an image. Requires matplotlib to be installed.
- save_audio(storage_path, format=None, encoding=None, augment_fn=None, **kwargs)
Store this cut’s waveform as audio recording to disk.
- Parameters:
storage_path (
Union[Path,str]) – The path to location where we will store the audio recordings.format (
Optional[str]) – Audio format argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.encoding (
Optional[str]) – Audio encoding argument supported bytorchaudio.saveorsoundfile.write. Please refer to the relevant library’s documentation depending on which audio backend you’re using.augment_fn (
Optional[Callable[[ndarray,int],ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, useCutSet.perturb_speed()instead.kwargs – additional arguments passed to
Cut.load_audio(). Example, if saving a MixedCut, we can specify mono_downmix=True to downmix the tracks to mono before saving.
- Return type:
- Returns:
a new Cut instance.
- speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)
Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.
This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272
- Parameters:
min_speaker_dim (
Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).speaker_to_idx_map (
Optional[Dict[str,int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.
- Return type:
ndarray
- split(timestamp)
-
Split a cut into two cuts at
timestamp, which is measured from the start of the cut. For example, a [0s - 10s] cut split at 4s yields:left cut [0s - 4s]
right cut [4s - 10s]
- supervisions_audio_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- supervisions_feature_mask(use_alignment_if_exists=None)
Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.
- Parameters:
use_alignment_if_exists (
Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.- Return type:
ndarray
- to_dict()
- Return type:
dict
- trim_to_alignments(type, max_pause=None, max_segment_duration=None, delimiter=' ', keep_all_channels=False)
Splits the current
Cutinto its constituent alignment items (AlignmentItem). These cuts have identical start times and durations as the alignment item. Additionally, the max_pause option can be used to merge alignment items that are separated by a pause shorter than max_pause. If max_segment_duration is specified, we will keep merging consecutive segments until the duration of the merged segment exceeds max_segment_duration.For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
Hint
If you have a Cut with multiple supervision segments and you want to trim it to the word-level alignment, you can use the
Cut.merge_supervisions()method first to merge the supervisions into a single one, followed by theCut.trim_to_alignments()method. For example:>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=1.0)
Hint
The above technique can also be used to segment long cuts into roughly equal duration segments, while respecting alignment boundaries. For example, to split a Cut into 10s segments, you can do:
>>> cut = cut.merge_supervisions(type='word', delimiter=' ') >>> cut = cut.trim_to_alignments(type='word', max_pause=10.0, max_segment_duration=10.0)
- Parameters:
type (
str) – The type of the alignment to trim to (e.g. “word”).max_pause (
Optional[float]) – The maximum pause allowed between the alignments to merge them. IfNone, no merging will be performed. [default: None]delimiter (
str) – The delimiter to use when joining the alignment items.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.num_jobs – Number of parallel workers to process the cuts.
- Return type:
CutSet
- Returns:
a CutSet object.
- trim_to_supervision_groups(max_pause=0.0)
Return a new CutSet with Cuts based on supervision groups. A supervision group is a set of supervisions with no gaps between them (or gaps shorter than
max_pause). This is similar to the concept of an utterance group as described in this paper: https://arxiv.org/abs/2211.00482For example, the following cut:
Cut╔═════════════════════════════════════════════════════════════════════════════════╗ ║┌──────────────────────┐ ┌────────┐ ║ ║│ Hello this is John. │ │ Hi │ ║ ║└──────────────────────┘ └────────┘ ║ ║ ┌──────────────────────────────────┐ ┌───────────────────┐║ ║ │ Hey, John. How are you? │ │ What do you do? │║ ║ └──────────────────────────────────┘ └───────────────────┘║ ╚═════════════════════════════════════════════════════════════════════════════════╝
is transformed into two cuts:
Cut 1 Cut 2
╔════════════════════════════════════════════════╗ ╔═══════════════════════════╗ ║┌──────────────────────┐ ║ ║┌────────┐ ║ ║│ Hello this is John. │ ║ ║│ Hi │ ║ ║└──────────────────────┘ ║ ║└────────┘ ║ ║ ┌──────────────────────────────────┐║ ║ ┌───────────────────┐║ ║ │ Hey, John. How are you? │║ ║ │ What do you do? │║ ║ └──────────────────────────────────┘║ ║ └───────────────────┘║ ╚════════════════════════════════════════════════╝ ╚═══════════════════════════╝
For the case of a multi-channel cut with multiple supervisions, we keep all the channels in the recording.
- Parameters:
max_pause (
float) – An optional duration in seconds; if the gap between two supervisions is longer than this, they will be treated as separate groups. By default, this is set to 0.0, which means that no gaps are allowed between supervisions.- Return type:
CutSet
- Returns:
a
CutSet.
- trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', keep_all_channels=False)
Splits the current
Cutinto as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded viakeep_overlappingflag.For example, the following cut:
Cut |-----------------| Sup1 |----| Sup2 |-----------|
is transformed into two cuts:
Cut1 |----| Sup1 |----| Sup2 |-| Cut2 |-----------| Sup1 |-| Sup2 |-----------|
For the case of a multi-channel cut with multiple supervisions, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
Hint
If the resulting trimmed cut contains a single supervision, we set the cut id to the
idof this supervision, for better compatibility with downstream tools, e.g. comparing the hypothesis of ASR with the reference in icefall.Hint
If a MultiCut is trimmed and the resulting trimmed cut contains a single channel, we convert it to a MonoCut.
- Parameters:
keep_overlapping (
bool) – whenFalse, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discardSup2inCut1andSup1inCut2. In this mode, we guarantee that there will always be exactly one supervision per cut.min_duration (
Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter thanmin_durationwith actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept whenkeep_overlappingis true. If there is not enough context, the returned cut will be shorter thanmin_duration. If the supervision segment is longer thanmin_duration, the return cut will be longer.context_direction (
Literal['center','left','right','random']) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.keep_all_channels (
bool) – IfTrue, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
- Return type:
CutSet
- Returns:
a list of cuts.
- property trimmed_supervisions: List[SupervisionSegment]
Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.
Note that when
cut.supervisionsis called, the supervisions may have negativestartvalues that indicate the supervision actually begins before the cut, orendvalues that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).Caution
For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.
- unmix(tag=None)
Return this cut as a single-item list.
This is a compatibility no-op for cut types that are not
MixedCut, so callers can uniformly invokecut.unmix()regardless of the concrete cut type.- Parameters:
tag (
Optional[str]) – Ignored for non-mixed cuts.- Return type:
List[Cut]- Returns:
A single-item list containing
self.
-
features_type:
Optional[str]
-
phone:
Callable
-
id:
- lhotse.cut.create_cut_set_eager(recordings=None, supervisions=None, features=None, output_path=None, random_ids=False, tolerance=0.001)[source]
Create a
CutSetfrom any combination of supervision, feature and recording manifests. At least one ofrecordingsorfeaturesis required.The created cuts will be of type
DataCut(MonoCut for single-channel and MultiCut for multi-channel). TheDataCutboundaries correspond to those found in thefeatures, when available, otherwise to those found in therecordings.When
supervisionsare provided, we’ll be searching them for matching recording IDs and attaching to created cuts, assuming they are fully within the cut’s time span.- Parameters:
recordings (
Optional[RecordingSet]) – an optionalRecordingSetmanifest.supervisions (
Optional[SupervisionSet]) – an optionalSupervisionSetmanifest.features (
Optional[FeatureSet]) – an optionalFeatureSetmanifest.output_path (
Union[Path,str,None]) – an optional path where theCutSetis stored.random_ids (
bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)tolerance (
float) – float, tolerance for supervision and feature segment boundary comparison. By default, it’s 1ms. Increasing this value can be helpful when importing Kaldi data directories with precomputed features.
- Return type:
- Returns:
a new
CutSetinstance.
- lhotse.cut.create_cut_set_lazy(output_path, recordings=None, supervisions=None, features=None, random_ids=False, tolerance=0.001)[source]
Create a
CutSetfrom any combination of supervision, feature and recording manifests. At least one ofrecordingsorfeaturesis required.This method is the “lazy” variant, which allows to create a
CutSetwith a minimal memory usage. It has some extra requirements:- The user must provide an
output_path, where we will write the cuts as we create them. We’ll return a lazily-opened
CutSetfrom that file.
- The user must provide an
recordingsandfeatures(if both provided) have to be of equal lengthand sorted by
recording_idattribute of their elements.
supervisions(if provided) have to be sorted byrecording_id;note that there may be multiple supervisions with the same
recording_id, which is allowed.
In addition, to prepare cuts in a fully memory-efficient way, make sure that:
- All input manifests are stored in JSONL format and opened lazily
with
<manifest_class>.from_jsonl_lazy(path)method.
For more details, see
create_cut_set_eager().- Parameters:
output_path (
Union[Path,str]) – path to which we will write the cuts.recordings (
Optional[RecordingSet]) – an optionalRecordingSetmanifest.supervisions (
Optional[SupervisionSet]) – an optionalSupervisionSetmanifest.features (
Optional[FeatureSet]) – an optionalFeatureSetmanifest.random_ids (
bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)tolerance (
float) – float, tolerance for supervision and feature segment boundary comparison. By default, it’s 1ms. Increasing this value can be helpful when importing Kaldi data directories with precomputed features.
- Return type:
- Returns:
a new
CutSetinstance.
- lhotse.cut.compute_supervisions_frame_mask(cut, frame_shift=None, use_alignment_if_exists=None)[source]
Compute a mask that indicates which frames in a cut are covered by supervisions.
- Parameters:
cut (
Cut) – a cut object.frame_shift (
Optional[float]) – optional frame shift in seconds; required when the cut does not have pre-computed features, otherwise ignored.use_alignment_if_exists (
Optional[str]) – optional str (key from alignment dict); use the specified alignment type for generating the mask
:returns a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.
Recipes
Convenience methods used to prepare recording and supervision manifests for standard corpora.
- lhotse.recipes.download_adept(target_dir='.', force_download=False)[source]
Download and untar the ADEPT dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.force_download (
bool) – Bool, if True, download the tars no matter if the tars exist.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_adept(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.dataset_parts – string or sequence of strings representing dataset part names, e.g. ‘train-clean-100’, ‘train-clean-5’, ‘dev-clean’. By default we will infer which parts are available in
corpus_dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_aishell(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Downdload and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:bool:param force_download: Bool, if True, download the tars no matter if the tars exist. :type base_url:str:param base_url: str, the url of the OpenSLR resources. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_aishell(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_aishell3(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Download and untar the dataset
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.base_url (
Optional[str]) – str, the url of the OpenSLR resources.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_aishell3(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_aishell4(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Downdload and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:Optional[bool] :param force_download: Bool, if True, download the tars no matter if the tars exist. :type base_url:Optional[str] :param base_url: str, the url of the OpenSLR resources. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_aishell4(corpus_dir, output_dir=None, normalize_text=False)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_ali_meeting(target_dir='.', force_download=False, base_url='https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/')[source]
Downdload and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:Optional[bool] :param force_download: Bool, if True, download the tars no matter if the tars exist. :type base_url:Optional[str] :param base_url: str, the url of the OpenSLR resources. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_ali_meeting(corpus_dir, output_dir=None, mic='far', normalize_text='none', save_mono=False)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :type mic:Optional[str] :param mic: str, “near” or “far”, specifies whether to prepare the near-field or far-field data. Mayalso specify “ihm”, “sdm”, “mdm” (similar to AMI recipe), where “ihm” and “mdm” are the same as “near” and “far” respectively, and “sdm” is the same as “far” with a single channel.
- Parameters:
normalize_text (
str) – str, the text normalization type. Available options: “none”, “m2met”.save_mono (
bool) – bool, if True, save the mono recordings for sdm mic. This can speed up feature extraction since all channels will not be loaded.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_ami(target_dir='.', annotations=None, force_download=False, url='http://groups.inf.ed.ac.uk/ami', mic='ihm')[source]
Download AMI audio and annotations for provided microphone setting.
Example usage: 1. Download AMI data for IHM mic setting: >>> download_ami(mic=’ihm’) 2. Download AMI data for IHM-mix mic setting, and use existing annotations: >>> download_ami(mic=’ihm-mix’, annotations=’/path/to/existing/annotations.zip’)
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path to store the data.annotations (
Union[Path,str,None]) – Pathlike (default = None), path to save annotations zip fileforce_download (
Optional[bool]) – bool (default = False), if True, download even if file is present.url (
Optional[str]) – str (default = ‘http://groups.inf.ed.ac.uk/ami’), AMI download URL.mic (
Optional[str]) – str {‘ihm’,’ihm-mix’,’sdm’,’mdm’,’mdm8-bf’}, type of mic setting.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_ami(data_dir, annotations_dir=None, output_dir=None, mic='ihm', partition='full-corpus', normalize_text='kaldi', max_words_per_segment=None, merge_consecutive=False, keep_punctuation=False)[source]
Returns the manifests which consist of the Recordings and Supervisions :type data_dir:
Union[Path,str] :param data_dir: Pathlike, the path of the data dir. :param annotations: Pathlike, the path of the annotations dir or zip file. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :type mic:Optional[str] :param mic: str {‘ihm’,’ihm-mix’,’sdm’,’mdm’,’mdm8-bf’}, type of mic to use. :type partition:Optional[str] :param partition: str {‘full-corpus’,’full-corpus-asr’,’scenario-only’}, AMI official data split :type normalize_text:str:param normalize_text: str {‘none’, ‘upper’, ‘kaldi’} normalization of text :type max_words_per_segment:Optional[int] :param max_words_per_segment: int, maximum number of words per segment. If not None, we will splitlonger segments similar to Kaldi’s data prep scripts, i.e., split on full-stop and comma.
- Parameters:
merge_consecutive (
bool) – bool, if True, merge consecutive segments split on full-stop. We will only merge segments if the number of words in the merged segment is less than max_words_per_segment.keep_punctuation (
Optional[bool]) – bool, if True, keep punctuation marks.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is (‘train’, ‘dev’, ‘eval’), and the values are dicts of manifests under keys ‘recordings’ and ‘supervisions’.
Example usage: 1. Prepare IHM-Mix data for ASR: >>> manifests = prepare_ami(‘/path/to/ami-corpus’, mic=’ihm-mix’, partition=’full-corpus-asr’) 2. Prepare SDM data: >>> manifests = prepare_ami(‘/path/to/ami-corpus’, mic=’sdm’, partition=’full-corpus’)
- lhotse.recipes.prepare_aspire(corpus_dir, output_dir=None, mic='single')[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the corpus dir (LDC2017S21). :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :type mic:str:param mic: str, the microphone type, either “single” or “multi”. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part (‘dev’ and ‘dev_test’), and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.prepare_atcosim(corpus_dir, output_dir=None, silence_sym='', breath_sym='', foreign_sym='<unk>', partial_sym='<unk>', unknown_sym='<unk>')[source]
Returns the manifests which consist of the Recordings and Supervisions
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.silence_sym (
Optional[str]) – str, silence symbolbreath_sym (
Optional[str]) – str, breath symbolforeign_sym (
Optional[str]) – str, foreign symbol.partial_sym (
Optional[str]) – str, partial symbol. When set to None, will output partial wordsunknown_sym (
Optional[str]) – str, unknown symbol
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
The RecordingSet and SupervisionSet with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_single_babel_language(corpus_dir, output_dir=None, no_eval_ok=False)[source]
Prepares manifests using a single BABEL LDC package.
This function works like the following:
- first, it will scan corpus_dir for a directory named conversational;
if there is more than once, it picks the first one (and emits a warning)
- then, it will try to find dev, eval, and training splits inside
(if any of them is not present, it will skip it with a warning)
finally, it scans the selected location for SPHERE audio files and transcripts.
- Parameters:
corpus_dir (
Union[Path,str]) – Path to the root of the LDC package with a BABEL language.output_dir (
Union[Path,str,None]) – Path where the manifests are stored.jsonno_eval_ok (
bool) – When set to True, this function won’t emit a warning that the eval set was not found.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
- lhotse.recipes.prepare_bengaliai_speech(corpus_dir, output_dir=None, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Path to the Bengali.AI Speech dataset. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.prepare_broadcast_news(audio_dir, transcripts_dir, output_dir=None, absolute_paths=False)[source]
Prepare manifests for 1997 English Broadcast News corpus. We create three manifests: one with recordings, one with segments supervisions, and one with section supervisions. The latter can be used e.g. for topic segmentation.
- Parameters:
audio_dir (
Union[Path,str]) – Path toLDC98S71package.transcripts_dir (
Union[Path,str]) – Path toLDC98T28package.output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'sections', 'segments'}.
- lhotse.recipes.download_but_reverb_db(target_dir='.', url='http://merlin.fit.vutbr.cz/ReverbDB/BUT_ReverbDB_rel_19_06_RIR-Only.tgz', force_download=False)[source]
Download and untar the BUT Reverb DB dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.url (
Optional[str]) – str, the url that downloads file called BUT_ReverbDB.tgz.force_download (
Optional[bool]) – bool, if True, download the archive even if it already exists.
- Return type:
Path
- lhotse.recipes.prepare_but_reverb_db(corpus_dir, output_dir=None, parts=('silence', 'rir'))[source]
Prepare the BUT Speech@FIT Reverb Database corpus.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.output_dir (
Union[Path,str,None]) – Pathlike, the path of the dir to write the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,CutSet]]]
- lhotse.recipes.prepare_bvcc(corpus_dir, output_dir=None, num_jobs=1)[source]
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]
- lhotse.recipes.prepare_callhome_egyptian(audio_dir, transcript_dir, output_dir=None, absolute_paths=False)[source]
Prepare manifests for the Callhome Egyptian Arabic Corpus We create two manifests: one with recordings, and the other one with text supervisions.
- Parameters:
audio_dir (
Union[Path,str]) – Path toLDC97S45package.transcript_dir (
Union[Path,str]) – Path to theLDC97T19contentoutput_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.absolute_paths (
bool) – Whether to return absolute or relative (to the corpus dir) paths for recordings.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_callhome_english(audio_dir, rttm_dir=None, transcript_dir=None, output_dir=None, absolute_paths=False)[source]
Prepare manifests for the CallHome American English corpus. We create two manifests: one with recordings, and the other one with text supervisions.
- Depending on the value of transcript_dir, will prepare either
data for ASR task (expected LDC corpora
LDC97S42andLDC97T14)or the SRE task (expected corpus
LDC2001S97)
- Parameters:
audio_dir (
Union[Path,str]) – Path toLDC97S42``or ``LDC2001S97contenttranscript_dir (
Union[Path,str,None]) – Path to theLDC97T14contentrttm_dir (
Union[Path,str,None]) – Path to the transcripts directory. If not provided, the transcripts will be downloaded.absolute_paths (
bool) – Whether to return absolute or relative (to the corpus dir) paths for recordings.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.download_chime6(target_dir='.', force_download=False)[source]
Download the dataset. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:bool:param force_download: bool, if True, the data are downloaded even if present in the target_dir. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_chime6(corpus_dir, output_dir=None, dataset_parts='all', mic='mdm', use_reference_array=False, perform_array_sync=False, verify_md5_checksums=False, num_jobs=1, num_threads_per_job=1, sox_path='/usr/bin/sox', normalize_text='kaldi', use_chime7_split=False)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir, either the original CHiME-5data or the synchronized CHiME-6 data. If former, the perform_array_sync must be True.
- Parameters:
output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.mic (
str) – str, the microphone type to use, choose from “ihm” (close-talk) or “mdm” (multi-microphone array) settings. For MDM, there are 6 array devices with 4 channels each, so the resulting recordings will have 24 channels.use_reference_array (
bool) – bool, if True, use the reference array for MDM setting. Only the supervision segments have the reference array information in the channel field. The recordings will still have all the channels in the array. Note that the train set does not have the reference array information.perform_array_sync (
bool) – Bool, if True, perform array synchronization based on: https://github.com/chimechallenge/chime6-synchronisationnum_jobs (
int) – int, the number of jobs to run in parallel for array synchronization.num_threads_per_job (
int) – int, number of threads to use per job for clock drift correction. Large values may require more memory, so we recommend using a job scheduler.sox_path (
Union[Path,str]) – Pathlike, the path to the sox v14.4.2 binary. Note that different versions of sox may produce different results.normalize_text (
str) – str, the text normalization method, choose from “none”, “upper”, “kaldi”. The “kaldi” method is the same as Kaldi’s text normalization method for CHiME-6.verify_md5_checksums (
bool) – bool, if True, verify the md5 checksums of the audio files. Note that this step is slow so we recommend only doing it once. It can be sped up by using the num_jobs argument.use_chime7_split (
bool) – bool, if True, use the new split for CHiME-7 challenge.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part (“train”, “dev” and “eval”), and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- NOTE: If perform_array_sync is True, the synchronized data will be written to
output_dir/CHiME6. This may take a long time and the output will occupy approximately 160G of storage. We will also create a temporary directory for processing, so the required storage in total will be approximately 300G.
- lhotse.recipes.download_cmu_arctic(target_dir='.', speakers=('aew', 'ahw', 'aup', 'awb', 'axb', 'bdl', 'clb', 'eey', 'fem', 'gka', 'jmk', 'ksp', 'ljm', 'lnh', 'rms', 'rxr', 'slp', 'slt'), force_download=False, base_url='http://festvox.org/cmu_arctic/packed/')[source]
Download and untar the CMU Arctic dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.speakers (
Sequence[str]) – a list of speakers to download. By default, downloads all.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.base_url (
Optional[str]) – str, the url of CMU Arctic download site.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_cmu_arctic(corpus_dir, output_dir=None)[source]
Prepares and returns the CMU Arctic manifests, which consist of Recordings and Supervisions.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
a dict of {‘recordings’: …, ‘supervisions’: …}
- lhotse.recipes.download_cmu_indic(target_dir='.', speakers=('ben_rm', 'guj_ad', 'guj_dp', 'guj_kt', 'hin_ab', 'kan_plv', 'mar_aup', 'mar_slp', 'pan_amp', 'tam_sdr', 'tel_kpn', 'tel_sk', 'tel_ss'), force_download=False, base_url='http://festvox.org/h2r_indic/')[source]
Download and untar the CMU Indic dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.speakers (
Sequence[str]) – a list of speakers to download. By default, downloads all.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.base_url (
Optional[str]) – str, the url of CMU Arctic download site.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_cmu_indic(corpus_dir, output_dir=None)[source]
Prepares and returns the CMU Indic manifests, which consist of Recordings and Supervisions.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
a dict of {‘recordings’: …, ‘supervisions’: …}
- lhotse.recipes.prepare_cmu_kids(corpus_dir, output_dir=None, absolute_paths=True)[source]
Prepare manifests for CMU Kids corpus. The prepared supervisions contain the prompt text as the text. Additionally, in the custom tag, we provide the following data: speaker grade/age, population where the speaker came from (SIM95/FP), spoken transcript, and transcription bin (1/2).
Here, bin 1 means utterances where the speaker followed the prompt and no noise/mispronunciation is present, and 2 refers to noisy utterances.
The tag spoken_transcript is the transcription that was actually spoken. It contains noise tags and phone transcription in case the pronunciation differed from that in CMU Dict.
- Parameters:
corpus_dir (
Union[Path,str]) – Path to downloaded LDC corpus.output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.absolute_paths (
Optional[bool]) – Wheter to write absolute paths to audio sources (default = False)
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_commonvoice(corpus_dir, output_dir, languages='auto', splits=('test', 'dev', 'train'), num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.This function expects the input directory structure of:
>>> metadata_path = corpus_dir / language_code / "{train,dev,test}.tsv" >>> # e.g. pl_train_metadata_path = "/path/to/cv-corpus-13.0-2023-03-09/pl/train.tsv" >>> audio_path = corpus_dir / language_code / "clips" >>> # e.g. pl_audio_path = "/path/to/cv-corpus-13.0-2023-03-09/pl/clips"
Returns a dict with 3-level structure (lang -> split -> manifest-type):
>>> {'en/fr/pl/...': {'train/dev/test': {'recordings/supervisions': manifest}}}
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path to the downloaded corpus.output_dir (
Union[Path,str]) – Pathlike, the path where to write the manifests.languages (
Union[str,Sequence[str]]) – ‘auto’ (prepare all discovered data) or a list of language codes.splits (
Union[str,Sequence[str]]) – by default['train', 'dev', 'test'], can also include'validated','invalidated', and'other'.num_jobs (
int) – How many concurrent workers to use for scanning of the audio files.
- Return type:
Dict[str,Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]]- Returns:
a dict with manifests for all specified languagues and their train/dev/test splits.
- lhotse.recipes.prepare_csj(corpus_dir, transcript_dir=None, manifest_dir=None, dataset_parts=None, nj=16)[source]
- lhotse.recipes.prepare_cslu_kids(corpus_dir, output_dir=None, absolute_paths=True, normalize_text=True)[source]
Prepare manifests for CSLU Kids corpus. The supervision contains either the prompted text, or a transcription of the spontaneous speech, depending on whether the utterance was scripted or spontaneous.
Additionally, the following information is present in the custom tag: scripted/spontaneous utterance, and verification label (rating between 1 and 4) for scripted utterances (see https://catalog.ldc.upenn.edu/docs/LDC2007S18/verification-note.txt or top documentation in this script for more information).
- Parameters:
corpus_dir (
Union[Path,str]) – Path to downloaded LDC corpus.output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.absolute_paths (
Optional[bool]) – Wheter to write absolute paths to audio sources (default = False)normalize_text (
Optional[bool]) – remove noise tags (<bn>, <bs>) from spontaneous speech transcripts (default = True)
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.download_daily_talk(target_dir, force_download=False)[source]
Downloads the DailyTalk data from the Google Drive and extracts it. :type target_dir:
Union[Path,str] :param target_dir: the directory where DailyTalk data will be saved. :type force_download:bool:param force_download: if True, it will download the DailyTalk data even if it is already present. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_daily_talk(corpus_dir, output_dir=None, num_jobs=1)[source]
Create RecordingSet and SupervisionSet manifests for DailyTalk from a raw corpus distribution.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path to the extracted corpus.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Tuple[RecordingSet,SupervisionSet]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_dihard3(dev_audio_dir, eval_audio_dir, output_dir=None, uem_manifest=True, num_jobs=1)[source]
Prepare manifests for the DIHARD III corpus. We create two manifests: one with recordings, and the other one with supervisions containing speaker id and timestamps.
- Parameters:
dev_audio_dir (
Union[Path,str]) – Path to downloaded DIHARD III dev corpus (LDC2020E12), e.g. /data/corpora/LDC/LDC2020E12eval_audio_dir (
Union[Path,str]) – Path to downloaded DIHARD III eval corpus (LDC2021E02), e.g. /data/corpora/LDC/LDC2021E02`output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.uem_manifest (
Optional[bool]) – If True, also return a SupervisionSet describing the UEM segments (see use in dataset.DiarizationDataset)num_jobs (
Optional[int]) – int (default = 1), number of jobs to scan corpus directory for recordings
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.download_dipco(target_dir='.', force_download=False)[source]
Download and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:Optional[bool] :param force_download: Bool, if True, download the tars no matter if the tars exist. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_dipco(corpus_dir, output_dir=None, mic='mdm', normalize_text='kaldi', use_chime7_offset=False)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :type mic:Optional[str] :param mic: str, the microphone type to use, choose from “ihm” (close-talk) or “mdm”(multi-microphone array) settings. For MDM, there are 5 array devices with 7 channels each, so the resulting recordings will have 35 channels.
- Parameters:
normalize_text (
Optional[str]) – str, the text normalization to apply. Choose from “none”, “upper”, or “kaldi”. “kaldi” is the default and is the same normalization used in Kaldi’s CHiME-6 recipe.use_chime7_offset (
Optional[bool]) – bool, if True, offset session IDs (from CHiME-7 challenge).
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part (“dev” and “eval”), and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_earnings21(target_dir='.', force_download=False, url='https://codeload.github.com/revdotcom/speech-datasets/zip/refs/heads/main')[source]
Download and untar the dataset. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to store the dataset.The extracted files are saved to target_dir/earnings21/ Please note that the github repository contains other additional datasets and using this call, you will be downloading all of them and then throwing them out.
- Parameters:
force_download (
Optional[bool]) – Bool, if True, download the tar file no matter whether it exists or not.url (
Optional[str]) – str, the url to download the dataset.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_earnings21(corpus_dir, output_dir=None, normalize_text=False)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir. The structure is expected to mimic the structure in the github repository, notably the mp3 files will be searched for in [corpus_dir]/media and transcriptions in the directory [corpus_dir]/transcripts/nlp_referencesoutput_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.normalize_text (
bool) – Bool, if True, normalize the text.
- Return type:
Union[RecordingSet,SupervisionSet]- Returns:
(recordings, supervisions) pair
Caution
The normalize_text option removes all punctuation and converts all upper case to lower case. This includes removing possibly important punctuations such as dashes and apostrophes.
- lhotse.recipes.download_earnings22(target_dir='.', force_download=False, url='https://github.com/revdotcom/speech-datasets')[source]
Download and untar the dataset. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to store the dataset.The extracted files are saved to target_dir/earnings22/ Please note that the github repository contains other additional datasets and using this call, you will be downloading all of them and then throwing them out.
- Parameters:
force_download (
Optional[bool]) – Bool, if True, download the tar file no matter whether it exists or not.url (
Optional[str]) – str, the url to download the dataset.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_earnings22(corpus_dir, output_dir=None, normalize_text=False)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir. The structure is expected to mimic the structure in the github repository, notably the mp3 files will be searched for in [corpus_dir]/media and transcriptions in the directory [corpus_dir]/transcripts/nlp_referencesoutput_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.normalize_text (
bool) – Bool, if True, normalize the text.
- Return type:
Union[RecordingSet,SupervisionSet]- Returns:
(recordings, supervisions) pair
Caution
The normalize_text option removes all punctuation and converts all upper case to lower case. This includes removing possibly important punctuations such as dashes and apostrophes.
- lhotse.recipes.download_ears(target_dir='.', force_download=False)[source]
Download and unzip the EARS dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.force_download (
bool) – Bool, if True, download the tars no matter if the tars exist.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_ears(corpus_dir, output_dir=None, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.num_jobs (
int) – the number of parallel workers parsing the data.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
a Dict whose keys are ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_edacc(target_dir='.', force_download=False, base_url='https://datashare.ed.ac.uk/download/')[source]
Download and extract the EDACC dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.force_download (
bool) – Bool, if True, download the data even if it exists.base_url (
str) – str, the url of the website used to fetch the archive from.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_edacc(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions.
- Parameters:
corpus_dir (
Union[Path,str]) – a path to the unzipped EDACC directory (hasedacc_v1.0inside).output_dir (
Union[Path,str,None]) – an optional path where to write the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a dict with structure
{"dev|test": {"recordings|supervisions": <manifest>}}
- lhotse.recipes.prepare_eval2000(corpus_dir, output_dir, transcript_path=None, absolute_paths=False, num_jobs=1)[source]
Prepares manifests for Eval2000.
- Parameters:
corpus_path – Path to global corpus
output_dir (
Union[Path,str]) – Directory where the manifests should be written. Can be omitted to avoid writing.absolute_paths (
bool) – Whether to return absolute or relative (to the corpus dir) paths for recordings.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_fisher_english(corpus_dir, output_dir, audio_dirs=['LDC2004S13', 'LDC2005S13'], transcript_dirs=['LDC2004T19', 'LDC2005T19'], absolute_paths=False, num_jobs=1)[source]
Prepares manifests for Fisher English Part 1, 2. Script assumes that audio_dirs and transcript_dirs are in the corpus_path. We create two manifests: one with recordings, and the other one with text supervisions.
- Parameters:
corpus_path – Path to Fisher corpus
audio_dirs (
List[str]) – List of dirs of audio corpora.transcripts_dirs – List of dirs of transcript corpora.
output_dir (
Union[Path,str]) – Directory where the manifests should be written. Can be omitted to avoid writing.absolute_paths (
bool) – Whether to return absolute or relative (to the corpus dir) paths for recordings.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_fisher_spanish(audio_dir_path, transcript_dir_path, output_dir=None, absolute_paths=False)[source]
Prepares manifests for Fisher Spanish. We create two manifests: one with recordings, and the other one with text supervisions.
- Parameters:
audio_dir_path (
Union[Path,str]) – Path to audio directory (usually LDC2010S01).transcript_dir_path (
Union[Path,str]) – Path to transcript directory (usually LDC2010T04).output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.absolute_paths (
bool) – Whether to return absolute or relative (to the corpus dir) paths for recordings.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.download_fleurs(target_dir='.', languages='all', force_download=False)[source]
Download the specified fleurs datasets.
- Parameters:
target_dir (Pathlike) – The path to which the corpus will be downloaded.
languages (Optional[Union[str, Sequence[str]]]) – Optional list of str or str specifying which languages to download. The str specifier for a language has the ISOCODE_COUNTRYCODE format, and is all lower case. By default this is set to “all”, which will download the entire set of languages.
force_download (bool) – Specifies whether to overwrite an existing archive.
- Returns:
The root path of the downloaded data
- Return type:
Path
- lhotse.recipes.prepare_fleurs(corpus_dir, output_dir=None, languages='all', num_jobs=1)[source]
Prepares the manifest for all of the FLEURS languages requested.
- Parameters:
corpus_dir (Pathlike,) – Path to the root where the FLEURS data are stored.
output_dir (Pathlike,) – The directory where the .jsonl.gz manifests will be written.
langauges – str or str sequence specifying the languages to prepare. The str ‘all’ prepares all 102 languages.
- Returns:
The manifest
- Return type:
Dict[str, Dict[str, Union[RecordingSet, Supervisions]]]]
- lhotse.recipes.prepare_gale_arabic(audio_dirs, transcript_dirs, output_dir=None, absolute_paths=True)[source]
Prepare manifests for GALE Arabic Broadcast speech corpus.
- Parameters:
audio_dirs (
List[Union[Path,str]]) – List of paths to audio corpora.transcripts_dirs – List of paths to transcript corpora.
output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_gale_mandarin(audio_dirs, transcript_dirs, output_dir=None, absolute_paths=True, segment_words=False)[source]
Prepare manifests for GALE Mandarin Broadcast speech corpus.
- Parameters:
audio_dirs (
List[Union[Path,str]]) – List of paths to audio corpora.transcripts_dirs – List of paths to transcript corpora.
output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.absolute_paths (
Optional[bool]) – Wheter to write absolute paths to audio sources (default = False)segment_words (
Optional[bool]) – Use jieba package to perform word segmentation (default = False)
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_gigaspeech(corpus_dir, output_dir, dataset_parts='auto', num_jobs=1)[source]
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]
- lhotse.recipes.prepare_gigaspeech2(corpus_dir, output_dir=None, languages='auto', num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Path to the GigaSpeech 2 dataset. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :type languages:Union[str,Sequence[str]] :param languages: ‘auto’ (prepare all discovered data) or a list of language codes. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_gigast(target_dir='.', languages='all', force_download=False)[source]
Download GigaST dataset
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.languages (
Union[str,Sequence[str]]) – one of: ‘all’ (downloads all known languages); a single language code (e.g., ‘en’), or a list of language codes.force_download (
bool) – bool, if True, download the archive even if it already exists.
- Return type:
Path- Returns:
the path to downloaded with data.
- lhotse.recipes.prepare_gigast(corpus_dir, manifests_dir, output_dir, languages='auto', dataset_parts='auto')[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Path to the GigaST dataset :type manifests_dir:Union[Path,str] :param manifests_dir: Path to the GigaSpeech manifests :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :type languages:Union[str,Sequence[str]] :param languages: ‘auto’ (prepare all languages) or a list of language codes. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_grid(target_dir='.', force_download=False)[source]
Download and untar the dataset, supporting both LibriSpeech and MiniLibrispeech
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.dataset_parts – “librispeech”, “mini_librispeech”, or a list of splits (e.g. “dev-clean”) to download.
force_download (
bool) – Bool, if True, download the tars no matter if the tars exist.alignments – should we download the alignments. The original source is: https://github.com/CorentinJ/librispeech-alignments
base_url – str, the url of the OpenSLR resources.
alignments_url – str, the url of LibriSpeech word alignments
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_grid(corpus_dir, output_dir=None, with_supervisions=True, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.with_supervisions (
bool) – bool, when False, we’ll only return recordings; when True, we’ll also return supervisions created from alignments, but might remove some recordings for which they are missing.num_jobs (
int) – int, number of parallel jobs.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_heroico(target_dir='.', force_download=False, url='http://www.openslr.org/resources/39')[source]
- Return type:
Path
- lhotse.recipes.prepare_heroico(speech_dir, transcript_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions
- Parameters:
speech_dir (
Union[Path,str]) – Pathlike, the path of the speech data dir.transcripts_dir – Pathlike, the path of the transcript data dir.
output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the fold, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_hifitts(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Download and untar the HiFi TTS dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.base_url (
Optional[str]) – str, the url of the OpenSLR resources.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_hifitts(corpus_dir, output_dir=None, num_jobs=1)[source]
Prepare manifests for the HiFiTTS dataset.
- Parameters:
corpus_dir (
Union[Path,str]) – Path or str, the path to the downloaded corpus main directory.output_dir (
Union[Path,str,None]) – Path or str, the path where to write the manifests.num_jobs (
int) – How many concurrent workers to use for preparing each dataset partition.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a dict with manifests for all the partitions (example query:
manifests['92_clean_train']['recordings']).
- lhotse.recipes.download_himia(target_dir='.', dataset_parts='auto', force_download=False, base_url='http://www.openslr.org/resources')[source]
Downdload and untar HI_MIA and HI_MIA_CW datasets. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type dataset_parts:Union[str,Sequence[str],None] :param dataset_parts: “auto”, “himia”or a list of splits (e.g. “train”, “dev”, “test”, “cw_test”) to download.
- Parameters:
force_download (
bool) – Bool, if True, download the tars no matter if the tars exist.base_url (
str) – str, the url of the OpenSLR resources.
- Return type:
Path- Returns:
the path to extracted directory with data.
- lhotse.recipes.prepare_himia(corpus_dir, dataset_parts='auto', output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type dataset_parts:Union[str,Sequence[str]] :param dataset_parts: “auto”, “himia”or a list of splits (e.g. “train”, “dev”, “test”, “cw_test”) to download.
- Parameters:
output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.prepare_icmcasr(corpus_dir, output_dir=None, mic='ihm', num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Path to the ICMC-ASR dataset. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_icsi(target_dir='.', audio_dir=None, transcripts_dir=None, force_download=False, url='http://groups.inf.ed.ac.uk/ami', mic='ihm')[source]
Download ICSI audio and annotations for provided microphone setting. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path in which audio and transcripts dir are created by default. :type audio_dir:Union[Path,str,None] :param audio_dir: Pathlike (default = ‘<target_dir>/audio’), the path to store the audio data. :type transcripts_dir:Union[Path,str,None] :param transcripts_dir: Pathlike (default = ‘<target_dir>/transcripts’), path to store the transcripts data :type force_download:Optional[bool] :param force_download: bool (default = False), if True, download even if file is present. :type url:Optional[str] :param url: str (default = ‘http://groups.inf.ed.ac.uk/ami’), download URL. :type mic:Optional[str] :param mic: str {‘ihm’,’ihm-mix’,’sdm’,’mdm’}, type of mic setting. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_icsi(audio_dir, transcripts_dir=None, output_dir=None, mic='ihm', normalize_text='kaldi', save_to_wav=False)[source]
Returns the manifests which consist of the Recordings and Supervisions :type audio_dir:
Union[Path,str] :param audio_dir: Pathlike, the path which holds the audio data :type transcripts_dir:Union[Path,str,None] :param transcripts_dir: Pathlike, the path which holds the transcripts data :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests - None means manifests aren’t stored on disk. :type mic:Optional[str] :param mic: str {‘ihm’,’ihm-mix’,’sdm’,’mdm’}, type of mic to use. :type normalize_text:str:param normalize_text: str {‘none’, ‘upper’, ‘kaldi’} normalization of text :type save_to_wav:bool:param save_to_wav: bool, whether to save the sph audio to wav format :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is (‘train’, ‘dev’, ‘test’), and the values are dicts of manifests under keys‘recordings’ and ‘supervisions’.
- lhotse.recipes.prepare_iwslt22_ta(corpus_dir, splits, output_dir=None, normalize_text=False, langs=['ta', 'eng'], num_jobs=1)[source]
Prepares manifests for the train dev and test1 splits.
- Parameters:
corpus_dir (
Union[Path,str]) – Path toLDC2022E01the path of the data dir.splits (
Union[Path,str]) – Path to splits from https://github.com/kevinduh/iwslt22-dialectnormalize_text (
bool) – Bool, if True, Arabic text normalization is performed from https://aclanthology.org/2022.iwslt-1.29.pdf.output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.langs (
Optional[List[str]]) – str, list of language abbreviations for source and target languages.num_jobs (
int) – int, the number of jobs to use for parallel processing.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_ksponspeech(corpus_dir, dataset_parts='all', output_dir=None, num_jobs=1, normalize_text='default')[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.dataset_parts (
Union[str,Sequence[str]]) – string or sequence of strings representing dataset part names, e.g. ‘train’, ‘dev’. By default, we will infer all parts.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.num_jobs (
int) – int, number of parallel threads used for ‘parse_utterance’ calls.normalize_text (
str) – str, the text normalization type, “default” or “none”.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_l2_arctic(corpus_dir, output_dir=None)[source]
Prepares and returns the L2 Arctic manifests which consist of Recordings and Supervisions.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a dict with keys “read” and “spontaneous”. Each hold another dict of {‘recordings’: …, ‘supervisions’: …}
- lhotse.recipes.download_libricss(target_dir, force_download=False)[source]
Downloads the LibriCSS data from the Google Drive and extracts it. :type target_dir:
Union[Path,str] :param target_dir: the directory where the LibriCSS data will be saved. :type force_download:bool:param force_download: if True, it will download the LibriCSS data even if it is already present. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_libricss(corpus_dir, output_dir=None, type='mdm', segmented_cuts=False)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.NOTE: The recordings contain all 7 channels. If you want to use only one channel, you can use either
recording.load_audio(channel=0)orMonoCut(id=...,recording=recording,channel=0)while creating the CutSet.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path to the extracted corpus.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.type (
str) – str, the type of data to prepare (‘mdm’, ‘sdm’, ‘ihm-mix’, or ‘ihm’). These settings are similar to the ones in AMI and ICSI recipes.segmented_cuts (
bool) – bool, if True, it will return 1-minute (as described in the original paper) in the form of a CutSet. These are saved under the indexsegmentsin the returned Dict. May be useful for evaluating multi-talker ASR systems, e.g., in this paper: https://arxiv.org/abs/2109.08555.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet,CutSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_librilight(corpus_dir, output_dir=None, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Path to the LibriLight dataset. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_librimix(target_dir='.')[source]
Download LibriMix metadata.
- Return type:
Path
- lhotse.recipes.prepare_librimix(librispeech_root_path, wham_recset_root_path, librimix_metadata_path, workdir, output_dir=None, n_src=2, num_jobs=1)[source]
Prepare LibriMix manifests for multi-speaker mixtures.
- Return type:
Dict[str,Dict[str,CutSet]]
- Args:
librispeech_root_path: Path to LibriSpeech manifests wham_recset_root_path: Path to WHAM noise manifests librimix_metadata_path: Path to LibriMix metadata output_dir: Directory to save manifests workdir: Working directory for temporary files n_src: Number of sources to for mixing num_jobs: Number of parallel threads used for processing (default: 1)
- Returns:
Dict with keys for each split containing ‘cuts’ for both clean and noisy versions
- lhotse.recipes.download_librimix_mini(target_dir='.', force_download=False, url='https://zenodo.org/record/3871592/files/MiniLibriMix.zip')[source]
- Return type:
Path
- lhotse.recipes.prepare_librimix_mini(librimix_csv, output_dir=None, with_precomputed_mixtures=False, sampling_rate=16000, min_segment_seconds=3.0)[source]
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]
- lhotse.recipes.download_librispeech(target_dir='.', dataset_parts='mini_librispeech', force_download=False, alignments=False, base_url='http://www.openslr.org/resources', alignments_url='https://drive.google.com/uc?id=1WYfgr31T-PPwMcxuAq09XZfHQO5Mw8fE')[source]
Download and untar the dataset, supporting both LibriSpeech and MiniLibrispeech
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.dataset_parts (
Union[str,Sequence[str],None]) – “librispeech”, “mini_librispeech”, or a list of splits (e.g. “dev-clean”) to download.force_download (
bool) – Bool, if True, download the tars no matter if the tars exist.alignments (
bool) – should we download the alignments. The original source is: https://github.com/CorentinJ/librispeech-alignmentsbase_url (
str) – str, the url of the OpenSLR resources.alignments_url (
str) – str, the url of LibriSpeech word alignments
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_librispeech(corpus_dir, alignments_dir=None, dataset_parts='auto', output_dir=None, normalize_text='none', num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.alignments_dir (
Union[Path,str,None]) – Pathlike, the path of the alignments dir. By default, it is the same ascorpus_dir.dataset_parts (
Union[str,Sequence[str]]) – string or sequence of strings representing dataset part names, e.g. ‘train-clean-100’, ‘train-clean-5’, ‘dev-clean’. By default we will infer which parts are available incorpus_dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.normalize_text (
str) – str, “none” or “lower”, for “lower” the transcripts are converted to lower-case.num_jobs (
int) – int, number of parallel threads used for ‘parse_utterance’ calls.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_libritts(target_dir='.', use_librittsr=False, dataset_parts='all', force_download=False, base_url='http://www.openslr.org/resources')[source]
Download and untar the LibriTTS dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.use_librittsr (
bool) – Bool, if True, we’ll download the LibriTTS-R dataset instead.dataset_parts (
Union[str,Sequence[str],None]) – “all”, or a list of splits (e.g. “dev-clean”) to download.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.base_url (
Optional[str]) – str, the url of the OpenSLR resources.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.download_librittsr(target_dir='.', dataset_parts='all', force_download=False, base_url='http://www.openslr.org/resources')[source]
Download and untar the LibriTTS-R dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.dataset_parts (
Union[str,Sequence[str],None]) – “all”, or a list of splits (e.g. “dev-clean”) to download.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.base_url (
Optional[str]) – str, the url of the OpenSLR resources.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_libritts(corpus_dir, dataset_parts='all', output_dir=None, num_jobs=1, link_previous_utt=False)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.dataset_parts (
Union[str,Sequence[str]]) – string or sequence of strings representing dataset part names, e.g. ‘train-clean-100’, ‘train-clean-5’, ‘dev-clean’. By default we will infer which parts are available incorpus_dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.num_jobs (
int) – the number of parallel workers parsing the data.link_previous_utt (
bool) – If true adds previous utterance id to supervisions. Useful for reconstructing chains of utterances as they were read. If previous utterance was skipped from LibriTTS datasets previous_utt label is None.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_librittsr(corpus_dir, dataset_parts='all', output_dir=None, num_jobs=1, link_previous_utt=False)
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.dataset_parts (
Union[str,Sequence[str]]) – string or sequence of strings representing dataset part names, e.g. ‘train-clean-100’, ‘train-clean-5’, ‘dev-clean’. By default we will infer which parts are available incorpus_dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.num_jobs (
int) – the number of parallel workers parsing the data.link_previous_utt (
bool) – If true adds previous utterance id to supervisions. Useful for reconstructing chains of utterances as they were read. If previous utterance was skipped from LibriTTS datasets previous_utt label is None.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_ljspeech(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
The RecordingSet and SupervisionSet with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_magicdata(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Downdload and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:bool:param force_download: Bool, if True, download the tars no matter if the tars exist. :type base_url:str:param base_url: str, the url of the OpenSLR resources. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_magicdata(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_medical(target_dir='.', force_download=False)[source]
Download and unzip Medical dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.force_download (
bool) – bool, if True, download the archive even if it already exists.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_medical(corpus_dir, output_dir=None, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. :type corpus_dir:
Union[Path,str] :param corpus_dir: Path to the Medical dataset. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.prepare_mgb2(corpus_dir, output_dir, text_cleaning=True, buck_walter=False, num_jobs=1, mer_thresh=80)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str]) – Pathlike, the path where to write the manifests.text_cleaning (
bool) – Bool, if True, basic text cleaning is performed (similar to ESPNet recipe).buck_walter (
bool) – Bool, use BuckWalter transliterationnum_jobs (
int) – int, the number of jobs to use for parallel processing.mer_thresh (
int) – int, filter out segments based on mer (Match Error Rate)
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
Note
Unlike other recipes, output_dir is not Optional here because we write the manifests to the output directory while processing to avoid OOM issues, since it is a large dataset.
Caution
The text_cleaning option removes all punctuation and diacritics.
- lhotse.recipes.prepare_mls(corpus_dir, output_dir=None, opus=True, num_jobs=1)[source]
Prepare Multilingual LibriSpeech corpus.
Returns a dict structured like the following:
{ 'english': { 'train': {'recordings': RecordingSet(...), 'supervisions': SupervisionSet(...)}, 'dev': ..., 'test': ... }, 'polish': { ... }, ... }
- Parameters:
corpus_dir (
Union[Path,str]) – Path to the corpus root (directories with specific languages should be inside).output_dir (
Union[Path,str,None]) – Optional path where the manifests should be stored.opus (
bool) – Should we scan for OPUS files (otherwise we’ll look for FLAC files).num_jobs (
int) – How many jobs should be used for creating recording manifests.
- Return type:
Dict[str,Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]]- Returns:
A dict with structure:
d[language][split] = {recordings, supervisions}.
- lhotse.recipes.download_mobvoihotwords(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Downdload and untar the dataset
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.base_url (
Optional[str]) – str, the url of the OpenSLR resources.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_mobvoihotwords(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_mtedx(target_dir='.', languages='all')[source]
Download and untar the dataset.
- Param:
target_dir: Pathlike, the path of the directory where the mtdex_corpus directory will be created and to which data will be downloaded.
- Param:
languages: A str or sequence of strings specifying which languages to download. The default ‘all’, downloads all available languages.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_mtedx(corpus_dir, output_dir=None, languages='all', num_jobs=1)[source]
Prepares manifest of all MTEDx languages requested.
- Parameters:
corpus_dir (
Union[Path,str]) – Path to the root where MTEDx data was downloaded. It should be called mtedx_corpus.output_dir (
Union[Path,str,None]) – Root directory where .json manifests are stored.languages (
Union[str,Sequence[str],None]) – str or str sequence specifying the languages to prepare. The str ‘all’ prepares all languages.
- Return type:
Dict[str,Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]]- Returns:
- lhotse.recipes.download_musan(target_dir='.', url='https://www.openslr.org/resources/17/musan.tar.gz', force_download=False)[source]
Download and untar the MUSAN corpus.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.url (
Optional[str]) – str, the url that downloads file called “musan.tar.gz”.force_download (
Optional[bool]) – bool, if True, download the archive even if it already exists.
- Return type:
Path
- lhotse.recipes.prepare_musan(corpus_dir, output_dir=None, parts=('music', 'speech', 'noise'), use_vocals=True)[source]
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]
- lhotse.recipes.prepare_nsc(corpus_dir, dataset_part='PART3_SameCloseMic', output_dir=None, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path to the raw corpus distribution.dataset_part (
str) – str, name of the dataset part to be prepared.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_peoples_speech(corpus_dir, output_dir=None, num_jobs=1)[source]
Prepare
RecordingSetandSupervisionSetmanifests for The People’s Speech.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the main data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
a dict with keys “recordings” and “supervisions” with lazily opened manifests.
- lhotse.recipes.download_reazonspeech(target_dir='.', dataset_parts='auto', num_jobs=1)[source]
Download the ReazonSpeech dataset. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type dataset_parts:Union[str,Sequence[str],None] :param dataset_parts: the parts of the dataset to download (e.g. small, medium, or large). :type num_jobs:int:param num_jobs: the number of processes to download and format. :rtype:Path:return: the path to downloaded data and the JSON file.
- lhotse.recipes.prepare_reazonspeech(corpus_dir, output_dir, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them. :type corpus_dir:Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :type num_jobs:int:param num_jobs: int, number of parallel threads used for ‘parse_utterance’ calls. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_radio(corpus_dir, output_dir=None, min_segment_duration=0.5, num_jobs=4)[source]
Return the manifests which consist of recordings and supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Path to the collected radio samples :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where manifests are written :rtype:Dict[str,Union[RecordingSet,SupervisionSet]] :return: A Dict whose key is the dataset part and the value is a Dict withkeys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_rir_noise(target_dir='.', url='https://www.openslr.org/resources/28/rirs_noises.zip', force_download=False)[source]
Download and untar the RIR Noise corpus.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.url (
Optional[str]) – str, the url that downloads file called “rirs_noises.zip”.force_download (
Optional[bool]) – bool, if True, download the archive even if it already exists.
- Return type:
Path
- lhotse.recipes.prepare_rir_noise(corpus_dir, output_dir=None, parts=('point_noise', 'iso_noise', 'real_rir', 'sim_rir'))[source]
Prepare the RIR Noise corpus.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.output_dir (
Union[Path,str,None]) – Pathlike, the path of the dir to write the manifests.parts (
Sequence[str]) – Sequence[str], the parts of the dataset to prepare.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,CutSet]]]
The corpus contains 4 things: point-source noises (point_noise), isotropic noises (iso_noise), real RIRs (real_rir), and simulated RIRs (sim_rir). We will prepare these parts in the corresponding dict keys.
- lhotse.recipes.prepare_slu(corpus_dir, output_dir=None)[source]
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]
- lhotse.recipes.download_speechcommands(speechcommands_version='2', target_dir='.', force_download=False)[source]
Download and unzip Speech Commands dataset
- Parameters:
speechcommands_version (
str) – str, dataset version.target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.force_download (
bool) – bool, if True, download the archive even if it already exists.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_speechcommands(speechcommands_version, corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions :type speechcommands_version:
str:param speechcommands_version: str, dataset version. :type corpus_dir:Union[Path,str] :param corpus_dir: Path to the Speech Commands dataset. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_spgispeech(target_dir='.')[source]
Download and untar the dataset.
NOTE: This function just returns with a message since SPGISpeech is not available for direct download.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.- Return type:
None
- lhotse.recipes.prepare_spgispeech(corpus_dir, output_dir, normalize_text=True, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str]) – Pathlike, the path where to write the manifests.normalize_text (
bool) – Bool, if True, normalize the text (similar to ESPNet recipe).num_jobs (
int) – int, the number of jobs to use for parallel processing.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
Note
Unlike other recipes, output_dir is not Optional here because we write the manifests to the output directory while processing to avoid OOM issues, since it is a large dataset.
Caution
The normalize_text option removes all punctuation and converts all upper case to lower case. This includes removing possibly important punctuations such as dashes and apostrophes.
- lhotse.recipes.download_stcmds(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Downdload and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:bool:param force_download: Bool, if True, download the tars no matter if the tars exist. :type base_url:str:param base_url: str, the url of the OpenSLR resources. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_stcmds(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.prepare_switchboard(audio_dir, transcripts_dir=None, sentiment_dir=None, output_dir=None, omit_silence=True, absolute_paths=False)[source]
Prepare manifests for the Switchboard corpus. We create two manifests: one with recordings, and the other one with text supervisions. When
sentiment_diris provided, we create another supervision manifest with sentiment annotations.- Parameters:
audio_dir (
Union[Path,str]) – Path toLDC97S62package.transcripts_dir (
Union[Path,str,None]) – Path to the transcripts directory (typically named “swb_ms98_transcriptions”). If not provided, the transcripts will be downloaded.sentiment_dir (
Union[Path,str,None]) – Optional path toLDC2020T14package which contains sentiment annotations for SWBD segments.output_dir (
Union[Path,str,None]) – Directory where the manifests should be written. Can be omitted to avoid writing.omit_silence (
bool) – Whether supervision segments with[silence]token should be removed or kept.absolute_paths (
bool) – Whether to return absolute or relative (to the corpus dir) paths for recordings.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
A dict with manifests. The keys are:
{'recordings', 'supervisions'}.
- lhotse.recipes.prepare_tedlium(tedlium_root, output_dir=None, dataset_parts=('train', 'dev', 'test'), num_jobs=1, normalize_text='none')[source]
Prepare manifests for the TED-LIUM v3 corpus.
The manifests are created in a dict with three splits: train, dev and test. Each split contains a RecordingSet and SupervisionSet in a dict under keys ‘recordings’ and ‘supervisions’.
- Parameters:
tedlium_root (
Union[Path,str]) – Path to the unpacked TED-LIUM data.output_dir (
Union[Path,str,None]) – Path where the manifests should be written.dataset_parts (
Union[str,Sequence[str]]) – Which parts of the dataset to prepare. By default, all parts are prepared.num_jobs (
int) – Number of parallel jobs to use.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
A dict with standard corpus splits containing the manifests.
- lhotse.recipes.download_thchs_30(target_dir='.', force_download=False, base_url='http://www.openslr.org/resources')[source]
Downdload and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :type force_download:bool:param force_download: Bool, if True, download the tars no matter if the tars exist. :type base_url:str:param base_url: str, the url of the OpenSLR resources. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_thchs_30(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_this_american_life(target_dir='.', force_download=False, metadata_url='https://ipfs.io/ipfs/bafybeidyt3ch6t4dtu2ehdriod3jvuh34qu4pwjyoba2jrjpmqwckkr6q4/this_american_life.zip', website_url='https://thisamericanlife.org')[source]
- lhotse.recipes.download_timit(target_dir='.', force_download=False, base_url='https://data.deepai.org/timit.zip')[source]
Download and unzip the dataset TIMIT. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to store the dataset. :type force_download:bool:param force_download: bool, if True, download the zips no matter if the zips exists. :type base_url:Optional[str] :param base_url: str, the URL of the TIMIT dataset to download. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_timit(corpus_dir, output_dir=None, num_phones=48, num_jobs=1)[source]
Returns the manifests which consists of the Recodings and Supervisions. :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write and save the manifests. :param supervision_lvl: str=’phone’, the level of the supervision, ‘phone’, ‘word’ or ‘text’. :type num_phones:int:param num_phones: int=48, the number of phones (60, 48 or 39) for modeling and 48 is regarded as the default value. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.prepare_uwb_atcc(corpus_dir, output_dir=None, silence_sym='', breath_sym='', noise_sym='', foreign_sym='<unk>', partial_sym='<unk>', unintelligble_sym='<unk>', unknown_sym='<unk>')[source]
Returns the manifests which consist of the Recordings and Supervisions
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.silence_sym (
Optional[str]) – str, silence symbolbreath_sym (
Optional[str]) – str, breath symbolnoise_sym (
Optional[str]) – str, noise symbolforeign_sym (
Optional[str]) – str, foreign symbol. when set to None, will output foreign wordspartial_sym (
Optional[str]) – str, partial symbol. When set to None, will output partial wordsunintelligble_sym (
Optional[str]) – str, unintellible symbol. When set to None, will output unintelligble wordsunknown_sym (
Optional[str]) – str, unknown symbol
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
The RecordingSet and SupervisionSet with the keys ‘audio’ and ‘supervisions’.
- lhotse.recipes.download_vctk(target_dir='.', force_download=False, use_edinburgh_vctk_url=False, url='http://www.udialogue.org/download/VCTK-Corpus.tar.gz')[source]
Download and untar/unzip the VCTK dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.force_download (
Optional[bool]) – Bool, if True, download the tars no matter if the tars exist.url (
Optional[str]) – str, the url of tarred/zipped VCTK corpus.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_vctk(corpus_dir, output_dir=None, use_edinburgh_vctk_url=False, mic_id='mic2')[source]
Prepares and returns the L2 Arctic manifests which consist of Recordings and Supervisions.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.use_edinburgh_vctk_url (
Optional[bool]) – Bool, if use edinburgh_vctk_url to download the dataset, please set it as True.mic_id (
Optional[str]) – str, the default of mic_id is mic2.
- Return type:
Dict[str,Union[RecordingSet,SupervisionSet]]- Returns:
a dict with keys “read” and “spontaneous”. Each hold another dict of {‘recordings’: …, ‘supervisions’: …}
- Note: when download the vctk dataset with the edinburgh url, there are some points should know:
All the speeches from speaker
p315will be skipped due to the lack of the corresponding text files.All the speeches from speaker
p280will be skipped formic_id="mic2"due to the lack of the audio files.Some of the speeches from speaker
p362will be skipped due to the lack of the audio files.
- lhotse.recipes.download_voxceleb1(target_dir='.', force_download=False)[source]
Download and unzip the VoxCeleb1 data.
Note
A “connection refused” error may occur if you are downloading without a password.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.force_download (
Optional[bool]) – bool, if True, download the archive even if it already exists.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.download_voxceleb2(target_dir='.', force_download=False)[source]
Download and unzip the VoxCeleb2 data.
Note
A “connection refused” error may occur if you are downloading without a password.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.force_download (
Optional[bool]) – bool, if True, download the archive even if it already exists.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_voxceleb(voxceleb1_root=None, voxceleb2_root=None, output_dir=None, num_jobs=1)[source]
Prepare manifests for the VoxCeleb v1 and v2 corpora.
The manifests are created in a dict with three splits: train, dev and test, for each of the two versions. Each split contains a RecordingSet and SupervisionSet in a dict under keys ‘recordings’ and ‘supervisions’.
- Parameters:
voxceleb1_root (
Union[Path,str,None]) – Path to the VoxCeleb v1 dataset.voxceleb2_root (
Union[Path,str,None]) – Path to the VoxCeleb v2 dataset.output_dir (
Union[Path,str,None]) – Path to the output directory.num_jobs (
int) – Number of parallel jobs to run.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
A dict with standard corpus splits (“train” and “test”) containing the manifests.
NOTE: We prepare the data using the Kaldi style split, i.e., the whole VoxCeleb2 (“dev” and “test”) and the training portion (“dev”) of VoxCeleb1 are put into the “train” split. The “test” split contains the “test” portion of VoxCeleb1. So if VoxCeleb1 is not provided, no “test” split is created in the output manifests.
Example usage:
>>> from lhotse.recipes.voxceleb import prepare_voxceleb >>> manifests = prepare_voxceleb(voxceleb_v1_root='/path/to/voxceleb1', ... voxceleb_v2_root='/path/to/voxceleb2', ... output_dir='/path/to/output', ... num_jobs=4)
NOTE: If VoxCeleb1 is provided, we also prepare the trials file using the list provided in http://www.openslr.org/resources/49/voxceleb1_test_v2.txt. This file is used in the Kaldi recipes for VoxCeleb speaker verification. This is prepared as 2 tuples of the form (CutSet, CutSet) with identical id’s, one for each of positive pairs and negative pairs. These are stored in the dict under keys ‘pos_trials’ and ‘neg_trials’, respectively. For evaluation purpose, the
lhotse.dataset.sampling.CutPairsSamplercan be used to sample from this tuple.
- lhotse.recipes.download_voxpopuli(target_dir='.', subset='asr')[source]
Download and untar/unzip the VoxPopuli dataset.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to storage the dataset.subset (
Optional[str]) – str, the subset of the dataset to download, can be one of “400k”, “100k”, “10k”, “asr”, or any of the languages in LANGUAGES or LANGUAGES_V2.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_voxpopuli(corpus_dir, output_dir=None, task='asr', lang='en', source_lang=None, target_lang=None, num_jobs=1)[source]
Prepares and returns the VoxPopuli manifests which consist of Recordings and Supervisions.
- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.task (
str) – str, the task to prepare the manifests for, can be one of “asr”, “s2s”, “lm”.lang (
str) – str, the language to prepare the manifests for, can be one of LANGUAGES or LANGUAGES_V2. This is used for “asr” and “lm” tasks.source_lang (
Optional[str]) – str, the source language for the s2s task, can be one of S2S_SRC_LANGUAGES.target_lang (
Optional[str]) – str, the target language for the s2s task, can be one of S2S_TGT_LANGUAGES.num_jobs (
int) – int, the number of parallel jobs to use for preparing the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
Dict[str, Dict[str, Union[RecordingSet, SupervisionSet]]], the manifests.
- lhotse.recipes.prepare_wenet_speech(corpus_dir, dataset_parts='all', output_dir=None, num_jobs=1)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type dataset_parts:Union[str,Sequence[str]] :param dataset_parts: Which parts of dataset to prepare, all for all theparts.
- Parameters:
output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
:num_jobs Number of workers to extract manifests. :rtype:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts withthe keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_wham(target_dir='.', url='https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/wham_noise.zip', force_download=False)[source]
Download and untar the WHAM corpus.
- Parameters:
target_dir (
Union[Path,str]) – Pathlike, the path of the dir to store the dataset.url (
Optional[str]) – str, the url that downloads file called “wham_noise.zip”.force_download (
Optional[bool]) – bool, if True, download the archive even if it already exists.
- Return type:
Path
- lhotse.recipes.prepare_wham(corpus_dir, output_dir=None)[source]
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]
- lhotse.recipes.download_xbmu_amdo31(target_dir='.')[source]
Downdload and untar the dataset :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to storage the dataset. :rtype:Path:return: the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_xbmu_amdo31(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions :type corpus_dir:
Union[Path,str] :param corpus_dir: Pathlike, the path of the data dir. :type output_dir:Union[Path,str,None] :param output_dir: Pathlike, the path where to write the manifests. :rtype:Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]] :return: a Dict whose key is the dataset part, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
- lhotse.recipes.download_yesno(target_dir='.', force_download=False, url='http://www.openslr.org/resources/1/waves_yesno.tar.gz')[source]
Download and untar the dataset. :type target_dir:
Union[Path,str] :param target_dir: Pathlike, the path of the dir to store the dataset.The extracted files are saved to target_dir/waves_yesno/*.wav
- Parameters:
force_download (
Optional[bool]) – Bool, if True, download the tar file no matter whether it exists or not.url (
Optional[str]) – str, the url to download the dataset.
- Return type:
Path- Returns:
the path to downloaded and extracted directory with data.
- lhotse.recipes.prepare_yesno(corpus_dir, output_dir=None)[source]
Returns the manifests which consist of the Recordings and Supervisions. When all the manifests are available in the
output_dir, it will simply read and return them.- Parameters:
corpus_dir (
Union[Path,str]) – Pathlike, the path of the data dir. It’s expected to contain wave files with the pattern x_x_x_x_x_x_x_x.wav, where there are 8 x’s and each x is either 1 or 0.output_dir (
Union[Path,str,None]) – Pathlike, the path where to write the manifests.
- Return type:
Dict[str,Dict[str,Union[RecordingSet,SupervisionSet]]]- Returns:
a Dict whose key is either “train” or “test”, and the value is Dicts with the keys ‘recordings’ and ‘supervisions’.
Kaldi conversion
Convenience methods used to interact with Kaldi data directories.
- lhotse.kaldi.floor_duration_to_milliseconds(duration)[source]
- Return type:
float
Floor the duration to multiplies of 0.001 seconds. This is to avoid float precision problems with workflows like:
lhotse kaldi import … lhotse fix … ./local/compute_fbank_imported.py (from icefall) lhotse cut trim-to-supervisions … ./local/validate_manifest.py … (from icefall)
- Without flooring, there were different lengths:
Supervision end time 1093.33995833 is larger than cut end time 1093.3399375
- This is still within the 2ms tolerance in K2SpeechRecognitionDataset::validate_for_asr():
https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/speech_recognition.py#L201
- lhotse.kaldi.get_duration(path)[source]
Read a audio file, it supports pipeline style wave path and real waveform.
- Parameters:
path (
Union[Path,str]) – Path to an audio file or a Kaldi-style pipe.- Return type:
Optional[float]- Returns:
float duration of the recording, in seconds or None in case of read error.
- lhotse.kaldi.load_kaldi_data_dir(path, sampling_rate, frame_shift=None, map_string_to_underscores=None, use_reco2dur=True, num_jobs=1, feature_type='kaldi-fbank')[source]
Load a Kaldi data directory and convert it to a Lhotse RecordingSet and SupervisionSet manifests. For this to work, at least the wav.scp file must exist. SupervisionSet is created only when a segments file exists. reco2dur is used by default when exists (to enforce reading the duration from the audio files themselves, please set use_reco2dur = False. All the other files (text, utt2spk, etc.) are optional, and some of them might not be handled yet. In particular, feats.scp files are ignored.
- Parameters:
path (
Union[Path,str]) – Path to the Kaldi data directory.sampling_rate (
int) – Sampling rate of the recordings.frame_shift (
Optional[float]) – Optional, if specified, we will create a Features manifest and store the frame_shift value in it.map_string_to_underscores (
Optional[str]) – optional string, when specified, we will replace all instances of this string in SupervisonSegment IDs to underscores. This is to help with handling underscores in Kaldi (seeexport_to_kaldi()). This is also done for speaker IDs.use_reco2dur (
bool) – If True, we will use the reco2dur file to read the durations of the recordings. If False, we will read the durations from the audio files themselves.num_jobs (
int) – Number of parallel jobs to use when reading the audio files.
- Return type:
Tuple[RecordingSet,Optional[SupervisionSet],Optional[FeatureSet]]
- lhotse.kaldi.export_to_kaldi(recordings, supervisions, output_dir, map_underscores_to=None, prefix_spk_id=False)[source]
Export a pair of
RecordingSetandSupervisionSetto a Kaldi data directory. It even supports recordings that have multiple channels but the recordings will still have to have a singleAudioSource.The
RecordingSetandSupervisionSetmust be compatible, i.e. it must be possible to create aCutSetout of them.- Parameters:
recordings (
RecordingSet) – aRecordingSetmanifest.supervisions (
SupervisionSet) – aSupervisionSetmanifest.output_dir (
Union[Path,str]) – path where the Kaldi-style data directory will be created.map_underscores_to (
Optional[str]) – optional string with which we will replace all underscores. This helps avoid issues with Kaldi data dir sorting.prefix_spk_id (
Optional[bool]) – add speaker_id as a prefix of utterance_id (this is to ensure correct sorting inside files which is required by Kaldi)
Note
If you export a
RecordingSetwith multiple channels, then the resulting Kaldi data directory may not be back-compatible with Lhotse (i.e. you won’t be able to import it back to Lhotse in the same form). This is because Kaldi does not inherently support multi-channel recordings, so we have to break them down into single-channel recordings.
- lhotse.kaldi.load_start_and_duration(segments_path=None, feats_path=None, frame_shift=None)[source]
Load start time from segments and duration from feats, when both segments and feats.scp are available.
- Return type:
Dict[Tuple,None]
- lhotse.kaldi.load_kaldi_text_file(path, allow_empty_ref=True)[source]
Load Kaldi file text as a dict. Allow entry with empty ref. text (default).
- lhotse.kaldi.load_kaldi_text_mapping(path, must_exist=False, float_vals=False)[source]
Load Kaldi files such as utt2spk, spk2gender, etc. as a dict.
- Return type:
Dict[str,Optional[str]]
Others
Helper methods used throughout the codebase.
- lhotse.manipulation.combine(*manifests)[source]
Combine multiple manifests of the same type into one.
- Return type:
TypeVar(Manifest,RecordingSet,SupervisionSet,FeatureSet,CutSet)
- Examples:
>>> # Pass several arguments >>> combine(recording_set1, recording_set2, recording_set3) >>> # Or pass a single list/tuple of manifests >>> combine([supervision_set1, supervision_set2])
- lhotse.manipulation.split_parallelize_combine(num_jobs, manifest, fn, *args, **kwargs)[source]
Convenience wrapper that parallelizes the execution of functions that transform manifests. It splits the manifests into
num_jobspieces, applies the function to each split, and then combines the splits.This function is used internally in Lhotse to implement some parallel ops.
Example:
>>> from lhotse import CutSet, split_parallelize_combine >>> cuts = CutSet(...) >>> window_cuts = split_parallelize_combine( ... 16, ... cuts, ... CutSet.cut_into_windows, ... duration=30.0 ... )
- Parameters:
num_jobs (
int) – The number of parallel jobs.manifest (
TypeVar(Manifest,RecordingSet,SupervisionSet,FeatureSet,CutSet)) – The manifest to be processed.fn (
Callable) – Function or method that transforms the manifest; the first parameter has to bemanifest(for methods, they have to be methods on that manifests type,args – positional arguments to
fn.
- Return type:
TypeVar(Manifest,RecordingSet,SupervisionSet,FeatureSet,CutSet) e.g.CutSet.cut_into_windows.
:param kwargs keyword arguments to
fn.
- lhotse.manipulation.to_manifest(items)[source]
Take an iterable of data types in Lhotse such as Recording, SupervisonSegment or Cut, and create the manifest of the corresponding type. When the iterable is empty, returns None.
- Return type:
Optional[TypeVar(Manifest,RecordingSet,SupervisionSet,FeatureSet,CutSet)]