API Reference

This page contains a comprehensive list of all classes and functions within lhotse.

Recording manifests

Data structures used for describing audio recordings in a dataset.

class lhotse.audio.AudioSource(type, channels, source)[source]

AudioSource represents audio data that can be retrieved from somewhere. Supported sources of audio are currently: - ‘file’ (formats supported by soundfile, possibly multi-channel) - ‘command’ [unix pipe] (must be WAVE, possibly multi-channel) - ‘url’ (any URL type that is supported by “smart_open” library, e.g. http/https/s3/gcp/azure/etc.)

type: str
channels: List[int]
source: str
load_audio(offset=0.0, duration=None, force_opus_sampling_rate=None)[source]

Load the AudioSource (from files, commands, or URLs) with soundfile, accounting for many audio formats and multi-channel inputs. Returns numpy array with shapes: (n_samples,) for single-channel, (n_channels, n_samples) for multi-channel.

Note: The elements in the returned array are in the range [-1.0, 1.0] and are of dtype np.floatt32.

Parameters

force_opus_sampling_rate (Optional[int]) – This parameter is only used when we detect an OPUS file. It will tell ffmpeg to resample OPUS to this sampling rate.

Return type

ndarray

with_path_prefix(path)[source]
Return type

AudioSource

to_dict()[source]
Return type

dict

static from_dict(data)[source]
Return type

AudioSource

__init__(type, channels, source)
class lhotse.audio.Recording(id, sources, sampling_rate, num_samples, duration, transforms=None)[source]

The Recording manifest describes the recordings in a given corpus. It contains information about the recording, such as its path(s), duration, the number of samples, etc. It allows to represent multiple channels coming from one or more files.

This manifest does not specify any segmentation information or supervision such as the transcript or the speaker – we use SupervisionSegment for that.

Note that Recording can represent both a single utterance (e.g., in LibriSpeech) and a 1-hour session with multiple channels and speakers (e.g., in AMI). In the latter case, it is partitioned into data suitable for model training using Cut.

Hint

Lhotse reads audio recordings using `pysoundfile`_ and `audioread`_, similarly to librosa, to support multiple audio formats. For OPUS files we require ffmpeg to be installed.

Hint

Since we support importing Kaldi data dirs, if wav.scp contains unix pipes, Recording will also handle them correctly.

Examples

A Recording can be simply created from a local audio file:

>>> from lhotse import RecordingSet, Recording, AudioSource
>>> recording = Recording.from_file('meeting.wav')
>>> recording
Recording(
    id='meeting',
    sources=[AudioSource(type='file', channels=[0], source='meeting.wav')],
    sampling_rate=16000,
    num_samples=57600000,
    duration=3600.0,
    transforms=None
)

This manifest can be easily converted to a Python dict and serialized to JSON/JSONL/YAML/etc:

>>> recording.to_dict()
{'id': 'meeting',
 'sources': [{'type': 'file',
   'channels': [0],
   'source': 'meeting.wav'}],
 'sampling_rate': 16000,
 'num_samples': 57600000,
 'duration': 3600.0}

Recordings can be also created programatically, e.g. when they refer to URLs stored in S3 or somewhere else:

>>> s3_audio_files = ['s3://my-bucket/123-5678.flac', ...]
>>> recs = RecordingSet.from_recordings(
...     Recording(
...         id=url.split('/')[-1].replace('.flac', ''),
...         sources=[AudioSource(type='url', source=url, channels=[0])],
...         sampling_rate=16000,
...         num_samples=get_num_samples(url),
...         duration=get_duration(url)
...     )
...     for url in s3_audio_files
... )

It allows reading a subset of the audio samples as a numpy array:

>>> samples = recording.load_audio()
>>> assert samples.shape == (1, 16000)
>>> samples2 = recording.load_audio(offset=0.5)
>>> assert samples2.shape == (1, 8000)
id: str
sources: List[lhotse.audio.AudioSource]
sampling_rate: int
num_samples: int
duration: float
transforms: Optional[List[Dict]] = None
static from_file(path, recording_id=None, relative_path_depth=None, force_opus_sampling_rate=None)[source]

Read an audio file’s header and create the corresponding Recording. Suitable to use when each physical file represents a separate recording session.

Caution

If a recording session consists of multiple files (e.g. one per channel), it is advisable to create the Recording object manually, with each file represented as a separate AudioSource object.

Parameters
  • path (Union[Path, str]) – Path to an audio file supported by libsoundfile (pysoundfile).

  • recording_id (Optional[str]) – recording id, when not specified ream the filename’s stem (“x.wav” -> “x”).

  • relative_path_depth (Optional[int]) – optional int specifying how many last parts of the file path should be retained in the AudioSource. By default writes the path as is.

  • force_opus_sampling_rate (Optional[int]) – when specified, this value will be used as the sampling rate instead of the one we read from the manifest. This is useful for OPUS files that always have 48kHz rate and need to be resampled to the real one – we will perform that operation “under-the-hood”. For non-OPUS files this input is undefined.

Return type

Recording

Returns

a new Recording instance pointing to the audio file.

to_dict()[source]
Return type

dict

property num_channels
property channel_ids
load_audio(channels=None, offset=0.0, duration=None)[source]

Read the audio samples from the underlying audio source (path, URL, unix pipe/command).

Parameters
  • channels (Union[int, List[int], None]) – int or iterable of ints, a subset of channel IDs to read (reads all by default).

  • offset (float) – seconds, where to start reading the audio (at offset 0 by default). Note that it is only efficient for local filesystem files, i.e. URLs and commands will read all the samples first and discard the unneeded ones afterwards.

  • duration (Optional[float]) – seconds, indicates the total audio time to read (starting from offset).

Return type

ndarray

Returns

a numpy array of audio samples with shape (num_channels, num_samples).

with_path_prefix(path)[source]
Return type

Recording

perturb_speed(factor, affix_id=True)[source]

Return a new Recording that will lazily perturb the speed while loading audio. The num_samples and duration fields are updated to reflect the shrinking/extending effect of speed.

Parameters
  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_sp{factor}”.

Return type

Recording

Returns

a modified copy of the current Recording.

perturb_tempo(factor, affix_id=True)[source]

Return a new Recording that will lazily perturb the tempo while loading audio.

Compared to speed perturbation, tempo preserves pitch. The num_samples and duration fields are updated to reflect the shrinking/extending effect of tempo.

Parameters
  • factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_tp{factor}”.

Return type

Recording

Returns

a modified copy of the current Recording.

perturb_volume(factor, affix_id=True)[source]

Return a new Recording that will lazily perturb the volume while loading audio.

Parameters
  • factor (float) – The volume scale to be applied (e.g. factor=1.1 means 1.1x louder).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_tp{factor}”.

Return type

Recording

Returns

a modified copy of the current Recording.

resample(sampling_rate)[source]

Return a new Recording that will be lazily resampled while loading audio. :type sampling_rate: int :param sampling_rate: The new sampling rate. :rtype: Recording :return: A resampled Recording.

static from_dict(data)[source]
Return type

Recording

__init__(id, sources, sampling_rate, num_samples, duration, transforms=None)
class lhotse.audio.RecordingSet(recordings=None)[source]

RecordingSet represents a collection of recordings, indexed by recording IDs. It does not contain any annotation such as the transcript or the speaker identity – just the information needed to retrieve a recording such as its path, URL, number of channels, and some recording metadata (duration, number of samples).

It also supports (de)serialization to/from YAML/JSON/etc. and takes care of mapping between rich Python classes and YAML/JSON/etc. primitives during conversion.

When coming from Kaldi, think of it as wav.scp on steroids: RecordingSet also has the information from reco2dur and reco2num_samples, is able to represent multi-channel recordings and read a specified subset of channels, and support reading audio files directly, via a unix pipe, or downloading them on-the-fly from a URL (HTTPS/S3/Azure/GCP/etc.).

Examples:

RecordingSet can be created from an iterable of Recording objects:

>>> from lhotse import RecordingSet
>>> audio_paths = ['123-5678.wav', ...]
>>> recs = RecordingSet.from_recordings(Recording.from_file(p) for p in audio_paths)

As well as from a directory, which will be scanned recursively for files with parallel processing:

>>> recs2 = RecordingSet.from_dir('/data/audio', pattern='*.flac', num_jobs=4)

It behaves similarly to a dict:

>>> '123-5678' in recs
True
>>> recording = recs['123-5678']
>>> for recording in recs:
>>>    pass
>>> len(recs)
127

It also provides some utilities for I/O:

>>> recs.to_file('recordings.jsonl')
>>> recs.to_file('recordings.json.gz')  # auto-compression
>>> recs2 = RecordingSet.from_file('recordings.jsonl')

Manipulation:

>>> longer_than_5s = recs.filter(lambda r: r.duration > 5)
>>> first_100 = recs.subset(first=100)
>>> split_into_4 = recs.split(num_splits=4)
>>> shuffled = recs.shuffle()

And lazy data augmentation/transformation, that requires to adjust some information in the manifest (e.g., num_samples or duration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:

>>> recs_sp = recs.perturb_speed(factor=1.1)
>>> recs_vp = recs.perturb_volume(factor=2.)
>>> recs_24k = recs.resample(24000)
__init__(recordings=None)[source]
property is_lazy: bool

Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.

Return type

bool

property ids: Iterable[str]
Return type

Iterable[str]

static from_recordings(recordings)[source]
Return type

RecordingSet

static from_dir(path, pattern, num_jobs=1, force_opus_sampling_rate=None)[source]

Recursively scan a directory path for audio files that match the given pattern and create a RecordingSet manifest for them. Suitable to use when each physical file represents a separate recording session.

Caution

If a recording session consists of multiple files (e.g. one per channel), it is advisable to create each Recording object manually, with each file represented as a separate AudioSource object, and then a RecordingSet that contains all the recordings.

Parameters
  • path (Union[Path, str]) – Path to a directory of audio of files (possibly with sub-directories).

  • pattern (str) – A bash-like pattern specifying allowed filenames, e.g. *.wav or session1-*.flac.

  • num_jobs (int) – The number of parallel workers for reading audio files to get their metadata.

  • force_opus_sampling_rate (Optional[int]) – when specified, this value will be used as the sampling rate instead of the one we read from the manifest. This is useful for OPUS files that always have 48kHz rate and need to be resampled to the real one – we will perform that operation “under-the-hood”. For non-OPUS files this input does nothing.

Returns

a new Recording instance pointing to the audio file.

static from_dicts(data)[source]
Return type

RecordingSet

to_dicts()[source]
Return type

Iterable[dict]

filter(predicate)[source]

Return a new RecordingSet with the Recordings that satisfy the predicate.

Parameters

predicate (Callable[[Recording], bool]) – a function that takes a recording as an argument and returns bool.

Return type

RecordingSet

Returns

a filtered RecordingSet.

shuffle(rng=None)[source]

Shuffle the recording IDs in the current RecordingSet and return a shuffled copy of self.

Parameters

rng (Optional[Random]) – an optional instance of random.Random for precise control of randomness.

Return type

RecordingSet

Returns

a shuffled copy of self.

split(num_splits, shuffle=False, drop_last=False)[source]

Split the RecordingSet into num_splits pieces of equal size.

Parameters
  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the recordings order first.

  • drop_last (bool) – determines how to handle splitting when len(seq) is not divisible by num_splits. When False (default), the splits might have unequal lengths. When True, it may discard the last element in some splits to ensure they are equally long.

Return type

List[RecordingSet]

Returns

A list of RecordingSet pieces.

subset(first=None, last=None)[source]

Return a new RecordingSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

Parameters
  • first (Optional[int]) – int, the number of first recordings to keep.

  • last (Optional[int]) – int, the number of last recordings to keep.

Return type

RecordingSet

Returns

a new RecordingSet with the subset results.

load_audio(recording_id, channels=None, offset_seconds=0.0, duration_seconds=None)[source]
Return type

ndarray

with_path_prefix(path)[source]
Return type

RecordingSet

num_channels(recording_id)[source]
Return type

int

sampling_rate(recording_id)[source]
Return type

int

num_samples(recording_id)[source]
Return type

int

duration(recording_id)[source]
Return type

float

perturb_speed(factor, affix_id=True)[source]

Return a new RecordingSet that will lazily perturb the speed while loading audio. The num_samples and duration fields are updated to reflect the shrinking/extending effect of speed.

Parameters
  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_sp{factor}”.

Return type

RecordingSet

Returns

a RecordingSet containing the perturbed Recording objects.

perturb_tempo(factor, affix_id=True)[source]

Return a new RecordingSet that will lazily perturb the tempo while loading audio. The num_samples and duration fields are updated to reflect the shrinking/extending effect of tempo.

Parameters
  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_sp{factor}”.

Return type

RecordingSet

Returns

a RecordingSet containing the perturbed Recording objects.

perturb_volume(factor, affix_id=True)[source]

Return a new RecordingSet that will lazily perturb the volume while loading audio.

Parameters
  • factor (float) – The volume scale to be applied (e.g. factor=1.1 means 1.1x louder).

  • affix_id (bool) – When true, we will modify the Recording.id field by affixing it with “_sp{factor}”.

Return type

RecordingSet

Returns

a RecordingSet containing the perturbed Recording objects.

resample(sampling_rate)[source]

Apply resampling to all recordings in the RecordingSet and return a new RecordingSet. :type sampling_rate: int :param sampling_rate: The new sampling rate. :rtype: RecordingSet :return: a new RecordingSet with lazily resampled Recording objects.

count(value) integer -- return number of occurrences of value
classmethod from_file(path)
Return type

Any

classmethod from_json(path)
Return type

Any

classmethod from_jsonl(path)
Return type

Any

classmethod from_jsonl_lazy(path)

Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration.

Warning

Opening the manifest in this way might cause some methods that rely on random access to fail.

Return type

Any

classmethod from_yaml(path)
Return type

Any

index(value[, start[, stop]]) integer -- return first index of value.

Raises ValueError if the value is not present.

Supporting start and stop arguments is optional, but recommended.

classmethod open_writer(path, overwrite=True)

Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (.jsonl), with optional gzip compression (.jsonl.gz).

Example:

>>> from lhotse import RecordingSet
... recordings = [...]
... with RecordingSet.open_writer('recordings.jsonl.gz') as writer:
...     for recording in recordings:
...         writer.write(recording)

This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifets.

Example:

>>> from lhotse import RecordingSet, Recording
... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer:
...     for path in Path('.').rglob('*.wav'):
...         recording_id = path.stem
...         if writer.contains(recording_id):
...             # Item already written previously - skip processing.
...             continue
...         # Item doesn't exist yet - run extra work to prepare the manifest
...         # and store it.
...         recording = Recording.from_file(path, recording_id=recording_id)
...         writer.write(recording)
Return type

SequentialJsonlWriter

to_file(path)
Return type

None

to_json(path)
Return type

None

to_jsonl(path)
Return type

None

to_yaml(path)
Return type

None

class lhotse.audio.AudioMixer(base_audio, sampling_rate)[source]

Utility class to mix multiple waveforms into a single one. It should be instantiated separately for each mixing session (i.e. each MixedCut will create a separate AudioMixer to mix its tracks). It is initialized with a numpy array of audio samples (typically float32 in [-1, 1] range) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using the add_to_mix method. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize the AudioMixer.

__init__(base_audio, sampling_rate)[source]
Parameters
  • base_audio (ndarray) – A numpy array with the audio samples for the base signal (all the other signals will be mixed to it).

  • sampling_rate (int) – Sampling rate of the audio.

property unmixed_audio: numpy.ndarray

Return a numpy ndarray with the shape (num_tracks, num_samples), where each track is zero padded and scaled adequately to the offsets and SNR used in add_to_mix call.

Return type

ndarray

property mixed_audio: numpy.ndarray

Return a numpy ndarray with the shape (1, num_samples) - a mono mix of the tracks supplied with add_to_mix calls.

Return type

ndarray

add_to_mix(audio, snr=None, offset=0.0)[source]

Add audio (only support mono-channel) of a new track into the mix. :type audio: ndarray :param audio: An array of audio samples to be mixed in. :type snr: Optional[float] :param snr: Signal-to-noise ratio, assuming audio represents noise (positive SNR - lower audio energy, negative SNR - higher audio energy) :type offset: float :param offset: How many seconds to shift audio in time. For mixing, the signal will be padded before the start with low energy values. :return:

lhotse.audio.audio_energy(audio)[source]
Return type

float

lhotse.audio.read_audio(path_or_fd, offset=0.0, duration=None, force_opus_sampling_rate=None)[source]
Return type

Tuple[ndarray, int]

class lhotse.audio.LibsndfileCompatibleAudioInfo(channels, frames, samplerate, duration)[source]
property channels

Alias for field number 0

property frames

Alias for field number 1

property samplerate

Alias for field number 2

property duration

Alias for field number 3

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

lhotse.audio.info(path, force_opus_sampling_rate=None)[source]
Return type

LibsndfileCompatibleAudioInfo

lhotse.audio.torchaudio_info(path)[source]

Return an audio info data structure that’s a compatible subset of pysoundfile.info() that we need to create a Recording manifest.

Return type

LibsndfileCompatibleAudioInfo

lhotse.audio.torchaudio_load(path_or_fd, offset=0, duration=None)[source]
Return type

Tuple[ndarray, int]

lhotse.audio.soundfile_load(path_or_fd, offset=0, duration=None)[source]
Return type

Tuple[ndarray, int]

lhotse.audio.audioread_info(path)[source]

Return an audio info data structure that’s a compatible subset of pysoundfile.info() that we need to create a Recording manifest.

Return type

LibsndfileCompatibleAudioInfo

lhotse.audio.audioread_load(path_or_file, offset=0.0, duration=None, dtype=<class 'numpy.float32'>)[source]

Load an audio buffer using audioread. This loads one block at a time, and then concatenates the results.

This function is based on librosa: https://github.com/librosa/librosa/blob/main/librosa/core/audio.py#L180

lhotse.audio.assert_and_maybe_fix_num_samples(audio, offset, duration, recording)[source]
Return type

ndarray

lhotse.audio.opus_info(path, force_opus_sampling_rate=None)[source]
Return type

LibsndfileCompatibleAudioInfo

lhotse.audio.read_opus(path, offset=0.0, duration=None, force_opus_sampling_rate=None)[source]

Reads OPUS files either using torchaudio or ffmpeg. Torchaudio is faster, but if unavailable for some reason, we fallback to a slower ffmpeg-based implemention.

Return type

Tuple[ndarray, int]

Returns

a tuple of audio samples and the sampling rate.

lhotse.audio.read_opus_torchaudio(path, offset=0.0, duration=None, force_opus_sampling_rate=None)[source]

Reads OPUS files using torchaudio. This is just running tochaudio.load(), but we take care of extra resampling if needed.

Return type

Tuple[ndarray, int]

Returns

a tuple of audio samples and the sampling rate.

lhotse.audio.read_opus_ffmpeg(path, offset=0.0, duration=None, force_opus_sampling_rate=None)[source]

Reads OPUS files using ffmpeg in a shell subprocess. Unlike audioread, correctly supports offsets and durations for reading short chunks. Optionally, we can force ffmpeg to resample to the true sampling rate (if we know it up-front).

Return type

Tuple[ndarray, int]

Returns

a tuple of audio samples and the sampling rate.

lhotse.audio.parse_channel_from_ffmpeg_output(ffmpeg_stderr)[source]
Return type

str

lhotse.audio.sph_info(path)[source]
Return type

LibsndfileCompatibleAudioInfo

lhotse.audio.read_sph(sph_path, offset=0.0, duration=None)[source]

Reads SPH files using sph2pipe in a shell subprocess. Unlike audioread, correctly supports offsets and durations for reading short chunks.

Return type

Tuple[ndarray, int]

Returns

a tuple of audio samples and the sampling rate.

Supervision manifests

Data structures used for describing supervisions in a dataset.

class lhotse.supervision.AlignmentItem(symbol, start, duration)[source]

This class contains an alignment item, for example a word, along with its start time (w.r.t. the start of recording) and duration. It can potentially be used to store other kinds of alignment items, such as subwords, pdfid’s etc.

We use dataclasses instead of namedtuples (even though they are potentially slower) because of a serialization bug in nested namedtuples and dataclasses in Python 3.7 (see this: https://alexdelorenzo.dev/programming/2018/08/09/bug-in-dataclass.html). We can revert to namedtuples if we bump up the Python requirement to 3.8+.

symbol: str
start: float
duration: float
property end: float
Return type

float

with_offset(offset)[source]

Return an identical AlignmentItem, but with the offset added to the start field.

Return type

AlignmentItem

perturb_speed(factor, sampling_rate)[source]

Return an AlignmentItem that has time boundaries matching the recording/cut perturbed with the same factor. See SupervisionSegment.perturb_speed() for details.

Return type

AlignmentItem

trim(end, start=0)[source]

See SupervisionSegment.trim().

Return type

AlignmentItem

transform(transform_fn)[source]

Perform specified transformation on the alignment content.

Return type

AlignmentItem

__init__(symbol, start, duration)
class lhotse.supervision.SupervisionSegment(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None)[source]

SupervisionSegment represents a time interval (segment) annotated with some supervision labels and/or metadata, such as the transcription, the speaker identity, the language, etc.

Each supervision has unique id and always refers to a specific recording (via recording_id) and a specific channel (by default, 0). It’s also characterized by the start time (relative to the beginning of a Recording or a Cut) and a duration, both expressed in seconds.

The remaining fields are all optional, and their availability depends on specific corpora. Since it is difficult to predict all possible types of metadata, the custom field (a dict) can be used to insert types of supervisions that are not supported out of the box.

SupervisionSegment may contain multiple types of alignments. The alignment field is a dict, indexed by alignment’s type (e.g., word or phone), and contains a list of AlignmentItem objects – simple structures that contain a given symbol and its time interval. Alignments can be read from CTM files or created programatically.

Examples

A simple segment with no supervision information:

>>> from lhotse import SupervisionSegment
>>> sup0 = SupervisionSegment(
...     id='rec00001-sup00000', recording_id='rec00001',
...     start=0.5, duration=5.0, channel=0
... )

Typical supervision containing transcript, speaker ID, gender, and language:

>>> sup1 = SupervisionSegment(
...     id='rec00001-sup00001', recording_id='rec00001',
...     start=5.5, duration=3.0, channel=0,
...     text='transcript of the second segment',
...     speaker='Norman Dyhrentfurth', language='English', gender='M'
... )

Two supervisions denoting overlapping speech on two separate channels in a microphone array/multiple headsets (pay attention to start, duration, and channel):

>>> sup2 = SupervisionSegment(
...     id='rec00001-sup00002', recording_id='rec00001',
...     start=15.0, duration=5.0, channel=0,
...     text="i have incredibly good news for you",
...     speaker='Norman Dyhrentfurth', language='English', gender='M'
... )
>>> sup3 = SupervisionSegment(
...     id='rec00001-sup00003', recording_id='rec00001',
...     start=18.0, duration=3.0, channel=1,
...     text="say what",
...     speaker='Hervey Arman', language='English', gender='M'
... )

A supervision with a phone alignment:

>>> from lhotse.supervision import AlignmentItem
>>> sup4 = SupervisionSegment(
...     id='rec00001-sup00004', recording_id='rec00001',
...     start=33.0, duration=1.0, channel=0,
...     text="ice",
...     speaker='Maryla Zechariah', language='English', gender='F'
...     alignment={
...         'phone': [
...             AlignmentItem(symbol='AY0', start=33.0, duration=0.6),
...             AlignmentItem(symbol='S', start=33.6, duration=0.4)
...         ]
...     }
... )

Converting SupervisionSegment to a dict:

>>> sup0.to_dict()
{'id': 'rec00001-sup00000', 'recording_id': 'rec00001', 'start': 0.5, 'duration': 5.0, 'channel': 0}
id: str
recording_id: str
start: float
duration: float
channel: int = 0
text: Optional[str] = None
language: Optional[str] = None
speaker: Optional[str] = None
gender: Optional[str] = None
custom: Optional[Dict[str, Any]] = None
alignment: Optional[Dict[str, List[lhotse.supervision.AlignmentItem]]] = None
property end: float
Return type

float

with_offset(offset)[source]

Return an identical SupervisionSegment, but with the offset added to the start field.

Return type

SupervisionSegment

perturb_speed(factor, sampling_rate, affix_id=True)[source]

Return a SupervisionSegment that has time boundaries matching the recording/cut perturbed with the same factor.

Parameters
  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • sampling_rate (int) – The sampling rate is necessary to accurately perturb the start and duration (going through the sample counts).

  • affix_id (bool) – When true, we will modify the id and recording_id fields by affixing it with “_sp{factor}”.

Return type

SupervisionSegment

Returns

a modified copy of the current SupervisionSegment.

perturb_tempo(factor, sampling_rate, affix_id=True)[source]

Return a SupervisionSegment that has time boundaries matching the recording/cut perturbed with the same factor.

Parameters
  • factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • sampling_rate (int) – The sampling rate is necessary to accurately perturb the start and duration (going through the sample counts).

  • affix_id (bool) – When true, we will modify the id and recording_id fields by affixing it with “_tp{factor}”.

Return type

SupervisionSegment

Returns

a modified copy of the current SupervisionSegment.

perturb_volume(factor, affix_id=True)[source]

Return a SupervisionSegment with modified ids.

Parameters
  • factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).

  • affix_id (bool) – When true, we will modify the id and recording_id fields by affixing it with “_vp{factor}”.

Return type

SupervisionSegment

Returns

a modified copy of the current SupervisionSegment.

trim(end, start=0)[source]

Return an identical SupervisionSegment, but ensure that self.start is not negative (in which case it’s set to 0) and self.end does not exceed the end parameter. If a start is optionally provided, the supervision is trimmed from the left (note that start should be relative to the cut times).

This method is useful for ensuring that the supervision does not exceed a cut’s bounds, in which case pass cut.duration as the end argument, since supervision times are relative to the cut.

Return type

SupervisionSegment

map(transform_fn)[source]

Return a copy of the current segment, transformed with transform_fn.

Parameters

transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that takes a segment as input, transforms it and returns a new segment.

Return type

SupervisionSegment

Returns

a modified SupervisionSegment.

transform_text(transform_fn)[source]

Return a copy of the current segment with transformed text field. Useful for text normalization, phonetic transcription, etc.

Parameters

transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type

SupervisionSegment

Returns

a SupervisionSegment with adjusted text.

transform_alignment(transform_fn, type='word')[source]

Return a copy of the current segment with transformed alignment field. Useful for text normalization, phonetic transcription, etc.

Parameters
  • type (Optional[str]) – alignment type to transform (key for alignment dict).

  • transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type

SupervisionSegment

Returns

a SupervisionSegment with adjusted alignments.

to_dict()[source]
Return type

dict

static from_dict(data)[source]
Return type

SupervisionSegment

__init__(id, recording_id, start, duration, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None)
class lhotse.supervision.SupervisionSet(segments)[source]

SupervisionSet represents a collection of segments containing some supervision information (see SupervisionSegment), that are indexed by segment IDs.

It acts as a Python dict, extended with an efficient find operation that indexes and caches the supervision segments in an interval tree. It allows to quickly find supervision segments that correspond to a specific time interval.

When coming from Kaldi, think of SupervisionSet as a segments file on steroids, that may also contain text, utt2spk, utt2gender, utt2dur, etc.

Examples

Building a SupervisionSet:

>>> from lhotse import SupervisionSet, SupervisionSegment
>>> sups = SupervisionSet.from_segments([SupervisionSegment(...), ...])

Writing/reading a SupervisionSet:

>>> sups.to_file('supervisions.jsonl.gz')
>>> sups2 = SupervisionSet.from_file('supervisions.jsonl.gz')

Using SupervisionSet like a dict:

>>> 'rec00001-sup00000' in sups
True
>>> sups['rec00001-sup00000']
SupervisionSegment(id='rec00001-sup00000', recording_id='rec00001', start=0.5, ...)
>>> for segment in sups:
...     pass

Searching by recording_id and time interval:

>>> matched_segments = sups.find(recording_id='rec00001', start_after=17.0, end_before=25.0)

Manipulation:

>>> longer_than_5s = sups.filter(lambda s: s.duration > 5)
>>> first_100 = sups.subset(first=100)
>>> split_into_4 = sups.split(num_splits=4)
>>> shuffled = sups.shuffle()
__init__(segments)[source]
property is_lazy: bool

Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.

Return type

bool

property ids: Iterable[str]
Return type

Iterable[str]

static from_segments(segments)[source]
Return type

SupervisionSet

static from_dicts(data)[source]
Return type

SupervisionSet

with_alignment_from_ctm(ctm_file, type='word', match_channel=False)[source]

Add alignments from CTM file to the supervision set.

Parameters
  • ctm – Path to CTM file.

  • type (str) – Alignment type (optional, default = word).

  • match_channel (bool) – if True, also match channel between CTM and SupervisionSegment

Return type

SupervisionSet

Returns

A new SupervisionSet with AlignmentItem objects added to the segments.

write_alignment_to_ctm(ctm_file, type='word')[source]

Write alignments to CTM file.

Parameters
  • ctm_file (Union[Path, str]) – Path to output CTM file (will be created if not exists)

  • type (str) – Alignment type to write (default = word)

Return type

None

to_dicts()[source]
Return type

Iterable[dict]

shuffle(rng=None)[source]

Shuffle the supervision IDs in the current SupervisionSet and return a shuffled copy of self.

Parameters

rng (Optional[Random]) – an optional instance of random.Random for precise control of randomness.

Return type

SupervisionSet

Returns

a shuffled copy of self.

split(num_splits, shuffle=False, drop_last=False)[source]

Split the SupervisionSet into num_splits pieces of equal size.

Parameters
  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the recordings order first.

  • drop_last (bool) – determines how to handle splitting when len(seq) is not divisible by num_splits. When False (default), the splits might have unequal lengths. When True, it may discard the last element in some splits to ensure they are equally long.

Return type

List[SupervisionSet]

Returns

A list of SupervisionSet pieces.

subset(first=None, last=None)[source]

Return a new SupervisionSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

Parameters
  • first (Optional[int]) – int, the number of first supervisions to keep.

  • last (Optional[int]) – int, the number of last supervisions to keep.

Return type

SupervisionSet

Returns

a new SupervisionSet with the subset results.

filter(predicate)[source]

Return a new SupervisionSet with the SupervisionSegments that satisfy the predicate.

Parameters

predicate (Callable[[SupervisionSegment], bool]) – a function that takes a supervision as an argument and returns bool.

Return type

SupervisionSet

Returns

a filtered SupervisionSet.

map(transform_fn)[source]

Map a transform_fn to the SupervisionSegments and return a new SupervisionSet.

Parameters

transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type

SupervisionSet

Returns

a new SupervisionSet with modified segments.

transform_text(transform_fn)[source]

Return a copy of the current SupervisionSet with the segments having a transformed text field. Useful for text normalization, phonetic transcription, etc.

Parameters

transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type

SupervisionSet

Returns

a SupervisionSet with adjusted text.

transform_alignment(transform_fn, type='word')[source]

Return a copy of the current SupervisionSet with the segments having a transformed alignment field. Useful for text normalization, phonetic transcription, etc.

Parameters
  • transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

  • type (str) – alignment type to transform (key for alignment dict).

Return type

SupervisionSet

Returns

a SupervisionSet with adjusted text.

find(recording_id, channel=None, start_after=0, end_before=None, adjust_offset=False, tolerance=0.001)[source]

Return an iterable of segments that match the provided recording_id.

Parameters
  • recording_id (str) – Desired recording ID.

  • channel (Optional[int]) – When specified, return supervisions in that channel - otherwise, in all channels.

  • start_after (float) – When specified, return segments that start after the given value.

  • end_before (Optional[float]) – When specified, return segments that end before the given value.

  • adjust_offset (bool) – When true, return segments as if the recordings had started at start_after. This is useful for creating Cuts. From a user perspective, when dealing with a Cut, it is no longer helpful to know when the supervisions starts in a recording - instead, it’s useful to know when the supervision starts relative to the start of the Cut. In the anticipated use-case, start_after and end_before would be the beginning and end of a cut; this option converts the times to be relative to the start of the cut.

  • tolerance (float) – Additional margin to account for floating point rounding errors when comparing segment boundaries.

Return type

Iterable[SupervisionSegment]

Returns

An iterator over supervision segments satisfying all criteria.

count(value) integer -- return number of occurrences of value
classmethod from_file(path)
Return type

Any

classmethod from_json(path)
Return type

Any

classmethod from_jsonl(path)
Return type

Any

classmethod from_jsonl_lazy(path)

Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration.

Warning

Opening the manifest in this way might cause some methods that rely on random access to fail.

Return type

Any

classmethod from_yaml(path)
Return type

Any

index(value[, start[, stop]]) integer -- return first index of value.

Raises ValueError if the value is not present.

Supporting start and stop arguments is optional, but recommended.

classmethod open_writer(path, overwrite=True)

Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (.jsonl), with optional gzip compression (.jsonl.gz).

Example:

>>> from lhotse import RecordingSet
... recordings = [...]
... with RecordingSet.open_writer('recordings.jsonl.gz') as writer:
...     for recording in recordings:
...         writer.write(recording)

This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifets.

Example:

>>> from lhotse import RecordingSet, Recording
... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer:
...     for path in Path('.').rglob('*.wav'):
...         recording_id = path.stem
...         if writer.contains(recording_id):
...             # Item already written previously - skip processing.
...             continue
...         # Item doesn't exist yet - run extra work to prepare the manifest
...         # and store it.
...         recording = Recording.from_file(path, recording_id=recording_id)
...         writer.write(recording)
Return type

SequentialJsonlWriter

to_file(path)
Return type

None

to_json(path)
Return type

None

to_jsonl(path)
Return type

None

to_yaml(path)
Return type

None

Feature extraction and manifests

Data structures and tools used for feature extraction and description.

Features API - extractor and manifests

class lhotse.features.base.FeatureExtractor(config=None)[source]

The base class for all feature extractors in Lhotse. It is initialized with a config object, specific to a particular feature extraction method. The config is expected to be a dataclass so that it can be easily serialized.

All derived feature extractors must implement at least the following:

  • a name class attribute (how are these features called, e.g. ‘mfcc’)

  • a config_type class attribute that points to the configuration dataclass type

  • the extract method,

  • the frame_shift property.

Feature extractors that support feature-domain mixing should additionally specify two static methods:

  • compute_energy, and

  • mix.

By itself, the FeatureExtractor offers the following high-level methods that are not intended for overriding:

  • extract_from_samples_and_store

  • extract_from_recording_and_store

These methods run a larger feature extraction pipeline that involves data augmentation and disk storage.

name = None
config_type = None
__init__(config=None)[source]
abstract extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

ndarray

Returns

a numpy ndarray representing the feature matrix.

abstract property frame_shift: float
Return type

float

abstract feature_dim(sampling_rate)[source]
Return type

int

property device: Union[str, torch.device]
Return type

Union[str, device]

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters
  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type

ndarray

Returns

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters

features (ndarray) – A feature matrix.

Return type

float

Returns

A positive float value of the signal energy.

extract_batch(samples, sampling_rate)[source]

Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation.

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

Return type

Union[ndarray, List[ndarray]]

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)[source]

Extract the features from an array of audio samples in a full pipeline:

  • optional audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters
  • samples (ndarray) – a numpy ndarray with the audio samples.

  • sampling_rate (int) – integer sampling rate of samples.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.

  • channel (Optional[int]) – an optional channel number to insert into Features manifest.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix (it is not written to disk).

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)[source]

Extract the features from a Recording in a full pipeline:

  • load audio from disk;

  • optionally, perform audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features and the source data used.

Parameters
  • recording (Recording) – a Recording that specifies what’s the input audio.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an optional offset in seconds for where to start reading the recording.

  • duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.

  • channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix.

classmethod from_dict(data)[source]
Return type

FeatureExtractor

to_dict()[source]
Return type

Dict[str, Any]

classmethod from_yaml(path)[source]
Return type

FeatureExtractor

to_yaml(path)[source]
lhotse.features.base.get_extractor_type(name)[source]

Return the feature extractor type corresponding to the given name.

Parameters

name (str) – specifies which feature extractor should be used.

Return type

Type

Returns

A feature extractors type.

lhotse.features.base.create_default_feature_extractor(name)[source]

Create a feature extractor object with a default configuration.

Parameters

name (str) – specifies which feature extractor should be used.

Return type

Optional[FeatureExtractor]

Returns

A new feature extractor instance.

lhotse.features.base.register_extractor(cls)[source]

This decorator is used to register feature extractor classes in Lhotse so they can be easily created just by knowing their name.

An example of usage:

@register_extractor class MyFeatureExtractor: …

Parameters

cls – A type (class) that is being registered.

Returns

Registered type.

class lhotse.features.base.TorchaudioFeatureExtractor(config=None)[source]

Common abstract base class for all torchaudio based feature extractors.

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

ndarray

Returns

a numpy ndarray representing the feature matrix.

property frame_shift: float
Return type

float

__init__(config=None)
static compute_energy(features)

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters

features (ndarray) – A feature matrix.

Return type

float

Returns

A positive float value of the signal energy.

config_type = None
property device: Union[str, torch.device]
Return type

Union[str, device]

extract_batch(samples, sampling_rate)

Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation.

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

Return type

Union[ndarray, List[ndarray]]

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

  • load audio from disk;

  • optionally, perform audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features and the source data used.

Parameters
  • recording (Recording) – a Recording that specifies what’s the input audio.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an optional offset in seconds for where to start reading the recording.

  • duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.

  • channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

  • optional audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters
  • samples (ndarray) – a numpy ndarray with the audio samples.

  • sampling_rate (int) – integer sampling rate of samples.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.

  • channel (Optional[int]) – an optional channel number to insert into Features manifest.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix (it is not written to disk).

abstract feature_dim(sampling_rate)
Return type

int

classmethod from_dict(data)
Return type

FeatureExtractor

classmethod from_yaml(path)
Return type

FeatureExtractor

static mix(features_a, features_b, energy_scaling_factor_b)

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters
  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type

ndarray

Returns

A mixed feature matrix.

name = None
to_dict()
Return type

Dict[str, Any]

to_yaml(path)
class lhotse.features.base.Features(type, num_frames, num_features, frame_shift, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)[source]

Represents features extracted for some particular time range in a given recording and channel. It contains metadata about how it’s stored: storage_type describes “how to read it”, for now it supports numpy arrays serialized with np.save, as well as arrays compressed with lilcom; storage_path is the path to the file on the local filesystem.

type: str
num_frames: int
num_features: int
frame_shift: float
sampling_rate: int
start: float
duration: float
storage_type: str
storage_path: str
storage_key: str
recording_id: Optional[str] = None
channels: Optional[Union[int, List[int]]] = None
property end: float
Return type

float

load(start=None, duration=None)[source]
Return type

ndarray

with_path_prefix(path)[source]
Return type

Features

to_dict()[source]
Return type

dict

static from_dict(data)[source]
Return type

Features

__init__(type, num_frames, num_features, frame_shift, sampling_rate, start, duration, storage_type, storage_path, storage_key, recording_id=None, channels=None)
class lhotse.features.base.FeatureSet(features=None)[source]

Represents a feature manifest, and allows to read features for given recordings within particular channels and time ranges. It also keeps information about the feature extractor parameters used to obtain this set. When a given recording/time-range/channel is unavailable, raises a KeyError.

__init__(features=None)[source]
static from_features(features)[source]
Return type

FeatureSet

static from_dicts(data)[source]
Return type

FeatureSet

to_dicts()[source]
Return type

Iterable[dict]

with_path_prefix(path)[source]
Return type

FeatureSet

split(num_splits, shuffle=False, drop_last=False)[source]

Split the FeatureSet into num_splits pieces of equal size.

Parameters
  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the recordings order first.

  • drop_last (bool) – determines how to handle splitting when len(seq) is not divisible by num_splits. When False (default), the splits might have unequal lengths. When True, it may discard the last element in some splits to ensure they are equally long.

Return type

List[FeatureSet]

Returns

A list of FeatureSet pieces.

subset(first=None, last=None)[source]

Return a new FeatureSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

Parameters
  • first (Optional[int]) – int, the number of first supervisions to keep.

  • last (Optional[int]) – int, the number of last supervisions to keep.

Return type

FeatureSet

Returns

a new FeatureSet with the subset results.

find(recording_id, channel_id=0, start=0.0, duration=None, leeway=0.05)[source]

Find and return a Features object that best satisfies the search criteria. Raise a KeyError when no such object is available.

Parameters
  • recording_id (str) – str, requested recording ID.

  • channel_id (int) – int, requested channel.

  • start (float) – float, requested start time in seconds for the feature chunk.

  • duration (Optional[float]) – optional float, requested duration in seconds for the feature chunk. By default, return everything from the start.

  • leeway (float) – float, controls how strictly we have to match the requested start and duration criteria. It is necessary to keep a small positive value here (default 0.05s), as there might be differences between the duration of recording/supervision segment, and the duration of features. The latter one is constrained to be a multiple of frame_shift, while the former can be arbitrary.

Return type

Features

Returns

a Features object satisfying the search criteria.

load(recording_id, channel_id=0, start=0.0, duration=None)[source]

Find a Features object that best satisfies the search criteria and load the features as a numpy ndarray. Raise a KeyError when no such object is available.

Return type

ndarray

compute_global_stats(storage_path=None)[source]

Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

Parameters

storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.

Return a dict of ``{‘norm_means’``{‘norm_means’

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type

Dict[str, ndarray]

count(value) integer -- return number of occurrences of value
classmethod from_file(path)
Return type

Any

classmethod from_json(path)
Return type

Any

classmethod from_jsonl(path)
Return type

Any

classmethod from_jsonl_lazy(path)

Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration.

Warning

Opening the manifest in this way might cause some methods that rely on random access to fail.

Return type

Any

classmethod from_yaml(path)
Return type

Any

index(value[, start[, stop]]) integer -- return first index of value.

Raises ValueError if the value is not present.

Supporting start and stop arguments is optional, but recommended.

classmethod open_writer(path, overwrite=True)

Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (.jsonl), with optional gzip compression (.jsonl.gz).

Example:

>>> from lhotse import RecordingSet
... recordings = [...]
... with RecordingSet.open_writer('recordings.jsonl.gz') as writer:
...     for recording in recordings:
...         writer.write(recording)

This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifets.

Example:

>>> from lhotse import RecordingSet, Recording
... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer:
...     for path in Path('.').rglob('*.wav'):
...         recording_id = path.stem
...         if writer.contains(recording_id):
...             # Item already written previously - skip processing.
...             continue
...         # Item doesn't exist yet - run extra work to prepare the manifest
...         # and store it.
...         recording = Recording.from_file(path, recording_id=recording_id)
...         writer.write(recording)
Return type

SequentialJsonlWriter

to_file(path)
Return type

None

to_json(path)
Return type

None

to_jsonl(path)
Return type

None

to_yaml(path)
Return type

None

class lhotse.features.base.FeatureSetBuilder(feature_extractor, storage, augment_fn=None)[source]

An extended constructor for the FeatureSet. Think of it as a class wrapper for a feature extraction script. It consumes an iterable of Recordings, extracts the features specified by the FeatureExtractor config, and saves stores them on the disk.

Eventually, we plan to extend it with the capability to extract only the features in specified regions of recordings and to perform some time-domain data augmentation.

__init__(feature_extractor, storage, augment_fn=None)[source]
process_and_store_recordings(recordings, output_manifest=None, num_jobs=1)[source]
Return type

FeatureSet

lhotse.features.base.store_feature_array(feats, storage)[source]

Store feats array on disk, using lilcom compression by default.

Parameters
  • feats (ndarray) – a numpy ndarray containing features.

  • storage (FeaturesWriter) – a FeaturesWriter object to use for array storage.

Return type

str

Returns

a path to the file containing the stored array.

lhotse.features.base.compute_global_stats(feature_manifests, storage_path=None)[source]

Compute the global means and standard deviations for each feature bin in the manifest. It performs only a single pass over the data and iteratively updates the estimate of the means and variances.

We follow the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

Parameters
  • feature_manifests (Iterable[Features]) – an iterable of Features objects.

  • storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.

Return a dict of ``{‘norm_means’``{‘norm_means’

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type

Dict[str, ndarray]

Lhotse’s feature extractors

class lhotse.features.kaldi.extractors.KaldiFbank(config=None)[source]
name = 'kaldi-fbank'
config_type

alias of lhotse.features.kaldi.extractors.KaldiFbankConfig

__init__(config=None)[source]
property frame_shift: float
Return type

float

feature_dim(sampling_rate)[source]
Return type

int

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

Union[ndarray, Tensor]

Returns

a numpy ndarray representing the feature matrix.

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters
  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type

ndarray

Returns

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters

features (ndarray) – A feature matrix.

Return type

float

Returns

A positive float value of the signal energy.

class lhotse.features.kaldi.extractors.KaldiMfcc(config=None)[source]
name = 'kaldi-mfcc'
config_type

alias of lhotse.features.kaldi.extractors.KaldiMfccConfig

__init__(config=None)[source]
property frame_shift: float
Return type

float

feature_dim(sampling_rate)[source]
Return type

int

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

Union[ndarray, Tensor]

Returns

a numpy ndarray representing the feature matrix.

Kaldi feature extractors as network layers

This whole module is authored and contributed by Jesus Villalba, with minor changes by Piotr Żelasko to make it more consistent with Lhotse.

It contains a PyTorch implementation of feature extractors that is very close to Kaldi’s – notably, it differs in that the preemphasis and DC offset removal are applied in the time, rather than frequency domain. This should not significantly affect any results, as confirmed by Jesus.

This implementation works well with autograd and batching, and can be used neural network layers.

class lhotse.features.kaldi.layers.Wav2Win(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, pad_length=None, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, return_log_energy=False)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and partition them into overlapping frames (of audio samples). Note: no feature extraction happens in here, the output is still a time-domain signal.

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2Win()
>>> t(x).shape
torch.Size([1, 100, 400])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, window_length). When return_log_energy==True, returns a tuple where the second element is a log-energy tensor of shape (batch_size, num_frames).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, pad_length=None, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, return_log_energy=False)[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

Union[Tensor, Tuple[Tensor, Tensor]]

T_destination

alias of TypeVar(‘T_destination’, bound=Mapping[str, torch.Tensor])

add_module(name, module)

Adds a child module to the current module.

The module can be accessed as an attribute using the given name.

Args:
name (string): name of the child module. The child module can be

accessed from this module using the given name

module (Module): child module to be added to the module.

Return type

None

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also nn-init-doc).

Args:

fn (Module -> None): function to be applied to each submodule

Returns:

Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Return type

~T

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

buffers(recurse=True)

Returns an iterator over module buffers.

Args:
recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Tensor]

children()

Returns an iterator over immediate children modules.

Yields:

Module: a child module

Return type

Iterator[Module]

cpu()

Moves all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

cuda(device=None)

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Args:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

double()

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

dump_patches: bool = False

This allows better BC support for load_state_dict(). In state_dict(), the version number will be saved as in the attribute _metadata of the returned state dict, and thus pickled. _metadata is a dictionary with keys that follow the naming convention of state dict. See _load_from_state_dict on how to use this information in loading.

If new parameters/buffers are added/removed from a module, this number shall be bumped, and the module’s _load_from_state_dict method can compare the version number and do appropriate changes if the state dict is from before the change.

eval()

Sets the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

Module: self

Return type

~T

extra_repr()

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type

str

float()

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the buffer

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not a buffer

Return type

Tensor

get_extra_state()

Returns any extra state to include in the module’s state_dict. Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be pickleable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

object: Any extra state to store in the module’s state_dict

Return type

Any

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the Parameter

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Parameter

Return type

Parameter

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

For example, let’s say you have an nn.Module A that looks like this:

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Args:
target: The fully-qualified string name of the submodule

to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Module

Return type

Module

half()

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Returns an iterator over all modules in the network.

Yields:

Module: a module in the network

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
        print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)
Return type

Iterator[Module]

named_buffers(prefix='', recurse=True)

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

(string, torch.Tensor): Tuple containing the name and buffer

Example:

>>> for name, buf in self.named_buffers():
>>>    if name in ['running_var']:
>>>        print(buf.size())
Return type

Iterator[Tuple[str, Tensor]]

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Yields:

(string, Module): Tuple containing a name and child module

Example:

>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)
Return type

Iterator[Tuple[str, Module]]

named_modules(memo=None, prefix='', remove_duplicate=True)

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:

memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result or not

Yields:

(string, Module): Tuple of name and module

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
        print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
named_parameters(prefix='', recurse=True)

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

(string, Parameter): Tuple containing the name and parameter

Example:

>>> for name, param in self.named_parameters():
>>>    if name in ['bias']:
>>>        print(param.size())
Return type

Iterator[Tuple[str, Parameter]]

parameters(recurse=True)

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Parameter]

register_backward_hook(hook)

Registers a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_buffer(name, tensor, persistent=True)

Adds a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Args:
name (string): name of the buffer. The buffer can be accessed

from this module using the given name

tensor (Tensor or None): buffer to be registered. If None, then operations

that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.

persistent (bool): whether the buffer is part of this module’s

state_dict.

Example:

>>> self.register_buffer('running_mean', torch.zeros(num_features))
Return type

None

register_forward_hook(hook)

Registers a forward hook on the module.

The hook will be called every time after forward() has computed an output. It should have the following signature:

hook(module, input, output) -> None or modified output

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

The hook will be called every time before forward() is invoked. It should have the following signature:

hook(module, input) -> None or modified input

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned(unless that value is already a tuple).

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_full_backward_hook(hook)

Registers a backward hook on the module.

The hook will be called every time the gradients with respect to module inputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_parameter(name, param)

Adds a parameter to the module.

The parameter can be accessed as an attribute using given name.

Args:
name (string): name of the parameter. The parameter can be accessed

from this module using the given name

param (Parameter or None): parameter to be added to the module. If

None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

Return type

None

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Args:
requires_grad (bool): whether autograd should record operations on

parameters in this module. Default: True.

Returns:

Module: self

Return type

~T

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:

state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_()

Return type

~T

state_dict(destination=None, prefix='', keep_vars=False)

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> module.state_dict().keys()
['bias', 'weight']
to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:
device (torch.device): the desired device of the parameters

and buffers in this module

dtype (torch.dtype): the desired floating point or complex dtype of

the parameters and buffers in this module

tensor (torch.Tensor): Tensor whose dtype and device are the desired

dtype and device for all parameters and buffers in this module

memory_format (torch.memory_format): the desired memory

format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

Args:
device (torch.device): The desired device of the parameters

and buffers in this module.

Returns:

Module: self

Return type

~T

train(mode=True)

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Args:
mode (bool): whether to set training mode (True) or evaluation

mode (False). Default: True.

Returns:

Module: self

Return type

~T

type(dst_type)

Casts all parameters and buffers to dst_type.

Note

This method modifies the module in-place.

Args:

dst_type (type or string): the desired type

Returns:

Module: self

Return type

~T

xpu(device=None)

Moves all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

zero_grad(set_to_none=False)

Sets gradients of all model parameters to zero. See similar function under torch.optim.Optimizer for more context.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

See torch.optim.Optimizer.zero_grad() for details.

Return type

None

training: bool
class lhotse.features.kaldi.layers.Wav2FFT(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The output is a complex-valued tensor.

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2FFT()
>>> t(x).shape
torch.Size([1, 100, 257])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_fft_bins) with dtype torch.complex64.

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True)[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

property sampling_rate: int
Return type

int

property frame_length: float
Return type

float

property frame_shift: float
Return type

float

property remove_dc_offset: bool
Return type

bool

property preemph_coeff: float
Return type

float

property window_type: str
Return type

str

property dither: float
Return type

float

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

Tensor

T_destination

alias of TypeVar(‘T_destination’, bound=Mapping[str, torch.Tensor])

add_module(name, module)

Adds a child module to the current module.

The module can be accessed as an attribute using the given name.

Args:
name (string): name of the child module. The child module can be

accessed from this module using the given name

module (Module): child module to be added to the module.

Return type

None

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also nn-init-doc).

Args:

fn (Module -> None): function to be applied to each submodule

Returns:

Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Return type

~T

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

buffers(recurse=True)

Returns an iterator over module buffers.

Args:
recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Tensor]

children()

Returns an iterator over immediate children modules.

Yields:

Module: a child module

Return type

Iterator[Module]

cpu()

Moves all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

cuda(device=None)

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Args:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

double()

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

dump_patches: bool = False

This allows better BC support for load_state_dict(). In state_dict(), the version number will be saved as in the attribute _metadata of the returned state dict, and thus pickled. _metadata is a dictionary with keys that follow the naming convention of state dict. See _load_from_state_dict on how to use this information in loading.

If new parameters/buffers are added/removed from a module, this number shall be bumped, and the module’s _load_from_state_dict method can compare the version number and do appropriate changes if the state dict is from before the change.

eval()

Sets the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

Module: self

Return type

~T

extra_repr()

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type

str

float()

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the buffer

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not a buffer

Return type

Tensor

get_extra_state()

Returns any extra state to include in the module’s state_dict. Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be pickleable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

object: Any extra state to store in the module’s state_dict

Return type

Any

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the Parameter

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Parameter

Return type

Parameter

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

For example, let’s say you have an nn.Module A that looks like this:

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Args:
target: The fully-qualified string name of the submodule

to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Module

Return type

Module

half()

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Returns an iterator over all modules in the network.

Yields:

Module: a module in the network

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
        print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)
Return type

Iterator[Module]

named_buffers(prefix='', recurse=True)

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

(string, torch.Tensor): Tuple containing the name and buffer

Example:

>>> for name, buf in self.named_buffers():
>>>    if name in ['running_var']:
>>>        print(buf.size())
Return type

Iterator[Tuple[str, Tensor]]

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Yields:

(string, Module): Tuple containing a name and child module

Example:

>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)
Return type

Iterator[Tuple[str, Module]]

named_modules(memo=None, prefix='', remove_duplicate=True)

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:

memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result or not

Yields:

(string, Module): Tuple of name and module

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
        print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
named_parameters(prefix='', recurse=True)

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

(string, Parameter): Tuple containing the name and parameter

Example:

>>> for name, param in self.named_parameters():
>>>    if name in ['bias']:
>>>        print(param.size())
Return type

Iterator[Tuple[str, Parameter]]

parameters(recurse=True)

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Parameter]

register_backward_hook(hook)

Registers a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_buffer(name, tensor, persistent=True)

Adds a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Args:
name (string): name of the buffer. The buffer can be accessed

from this module using the given name

tensor (Tensor or None): buffer to be registered. If None, then operations

that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.

persistent (bool): whether the buffer is part of this module’s

state_dict.

Example:

>>> self.register_buffer('running_mean', torch.zeros(num_features))
Return type

None

register_forward_hook(hook)

Registers a forward hook on the module.

The hook will be called every time after forward() has computed an output. It should have the following signature:

hook(module, input, output) -> None or modified output

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

The hook will be called every time before forward() is invoked. It should have the following signature:

hook(module, input) -> None or modified input

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned(unless that value is already a tuple).

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_full_backward_hook(hook)

Registers a backward hook on the module.

The hook will be called every time the gradients with respect to module inputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_parameter(name, param)

Adds a parameter to the module.

The parameter can be accessed as an attribute using given name.

Args:
name (string): name of the parameter. The parameter can be accessed

from this module using the given name

param (Parameter or None): parameter to be added to the module. If

None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

Return type

None

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Args:
requires_grad (bool): whether autograd should record operations on

parameters in this module. Default: True.

Returns:

Module: self

Return type

~T

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:

state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_()

Return type

~T

state_dict(destination=None, prefix='', keep_vars=False)

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> module.state_dict().keys()
['bias', 'weight']
to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:
device (torch.device): the desired device of the parameters

and buffers in this module

dtype (torch.dtype): the desired floating point or complex dtype of

the parameters and buffers in this module

tensor (torch.Tensor): Tensor whose dtype and device are the desired

dtype and device for all parameters and buffers in this module

memory_format (torch.memory_format): the desired memory

format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

Args:
device (torch.device): The desired device of the parameters

and buffers in this module.

Returns:

Module: self

Return type

~T

train(mode=True)

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Args:
mode (bool): whether to set training mode (True) or evaluation

mode (False). Default: True.

Returns:

Module: self

Return type

~T

type(dst_type)

Casts all parameters and buffers to dst_type.

Note

This method modifies the module in-place.

Args:

dst_type (type or string): the desired type

Returns:

Module: self

Return type

~T

xpu(device=None)

Moves all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

zero_grad(set_to_none=False)

Sets gradients of all model parameters to zero. See similar function under torch.optim.Optimizer for more context.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

See torch.optim.Optimizer.zero_grad() for details.

Return type

None

training: bool
class lhotse.features.kaldi.layers.Wav2Spec(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The STFT is transformed either to a magnitude spectrum (use_fft_mag=True) or a power spectrum (use_fft_mag=False).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2Spec()
>>> t(x).shape
torch.Size([1, 100, 257])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_fft_bins).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

Tensor

T_destination

alias of TypeVar(‘T_destination’, bound=Mapping[str, torch.Tensor])

add_module(name, module)

Adds a child module to the current module.

The module can be accessed as an attribute using the given name.

Args:
name (string): name of the child module. The child module can be

accessed from this module using the given name

module (Module): child module to be added to the module.

Return type

None

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also nn-init-doc).

Args:

fn (Module -> None): function to be applied to each submodule

Returns:

Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Return type

~T

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

buffers(recurse=True)

Returns an iterator over module buffers.

Args:
recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Tensor]

children()

Returns an iterator over immediate children modules.

Yields:

Module: a child module

Return type

Iterator[Module]

cpu()

Moves all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

cuda(device=None)

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Args:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

property dither: float
Return type

float

double()

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

dump_patches: bool = False

This allows better BC support for load_state_dict(). In state_dict(), the version number will be saved as in the attribute _metadata of the returned state dict, and thus pickled. _metadata is a dictionary with keys that follow the naming convention of state dict. See _load_from_state_dict on how to use this information in loading.

If new parameters/buffers are added/removed from a module, this number shall be bumped, and the module’s _load_from_state_dict method can compare the version number and do appropriate changes if the state dict is from before the change.

eval()

Sets the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

Module: self

Return type

~T

extra_repr()

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type

str

float()

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

property frame_length: float
Return type

float

property frame_shift: float
Return type

float

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the buffer

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not a buffer

Return type

Tensor

get_extra_state()

Returns any extra state to include in the module’s state_dict. Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be pickleable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

object: Any extra state to store in the module’s state_dict

Return type

Any

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the Parameter

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Parameter

Return type

Parameter

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

For example, let’s say you have an nn.Module A that looks like this:

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Args:
target: The fully-qualified string name of the submodule

to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Module

Return type

Module

half()

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Returns an iterator over all modules in the network.

Yields:

Module: a module in the network

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
        print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)
Return type

Iterator[Module]

named_buffers(prefix='', recurse=True)

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

(string, torch.Tensor): Tuple containing the name and buffer

Example:

>>> for name, buf in self.named_buffers():
>>>    if name in ['running_var']:
>>>        print(buf.size())
Return type

Iterator[Tuple[str, Tensor]]

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Yields:

(string, Module): Tuple containing a name and child module

Example:

>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)
Return type

Iterator[Tuple[str, Module]]

named_modules(memo=None, prefix='', remove_duplicate=True)

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:

memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result or not

Yields:

(string, Module): Tuple of name and module

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
        print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
named_parameters(prefix='', recurse=True)

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

(string, Parameter): Tuple containing the name and parameter

Example:

>>> for name, param in self.named_parameters():
>>>    if name in ['bias']:
>>>        print(param.size())
Return type

Iterator[Tuple[str, Parameter]]

parameters(recurse=True)

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Parameter]

property preemph_coeff: float
Return type

float

register_backward_hook(hook)

Registers a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_buffer(name, tensor, persistent=True)

Adds a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Args:
name (string): name of the buffer. The buffer can be accessed

from this module using the given name

tensor (Tensor or None): buffer to be registered. If None, then operations

that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.

persistent (bool): whether the buffer is part of this module’s

state_dict.

Example:

>>> self.register_buffer('running_mean', torch.zeros(num_features))
Return type

None

register_forward_hook(hook)

Registers a forward hook on the module.

The hook will be called every time after forward() has computed an output. It should have the following signature:

hook(module, input, output) -> None or modified output

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

The hook will be called every time before forward() is invoked. It should have the following signature:

hook(module, input) -> None or modified input

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned(unless that value is already a tuple).

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_full_backward_hook(hook)

Registers a backward hook on the module.

The hook will be called every time the gradients with respect to module inputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_parameter(name, param)

Adds a parameter to the module.

The parameter can be accessed as an attribute using given name.

Args:
name (string): name of the parameter. The parameter can be accessed

from this module using the given name

param (Parameter or None): parameter to be added to the module. If

None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

Return type

None

property remove_dc_offset: bool
Return type

bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Args:
requires_grad (bool): whether autograd should record operations on

parameters in this module. Default: True.

Returns:

Module: self

Return type

~T

property sampling_rate: int
Return type

int

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:

state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_()

Return type

~T

state_dict(destination=None, prefix='', keep_vars=False)

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> module.state_dict().keys()
['bias', 'weight']
to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:
device (torch.device): the desired device of the parameters

and buffers in this module

dtype (torch.dtype): the desired floating point or complex dtype of

the parameters and buffers in this module

tensor (torch.Tensor): Tensor whose dtype and device are the desired

dtype and device for all parameters and buffers in this module

memory_format (torch.memory_format): the desired memory

format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

Args:
device (torch.device): The desired device of the parameters

and buffers in this module.

Returns:

Module: self

Return type

~T

train(mode=True)

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Args:
mode (bool): whether to set training mode (True) or evaluation

mode (False). Default: True.

Returns:

Module: self

Return type

~T

type(dst_type)

Casts all parameters and buffers to dst_type.

Note

This method modifies the module in-place.

Args:

dst_type (type or string): the desired type

Returns:

Module: self

Return type

~T

property window_type: str
Return type

str

xpu(device=None)

Moves all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

zero_grad(set_to_none=False)

Sets gradients of all model parameters to zero. See similar function under torch.optim.Optimizer for more context.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

See torch.optim.Optimizer.zero_grad() for details.

Return type

None

training: bool
class lhotse.features.kaldi.layers.Wav2LogSpec(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Short-Time Fourier Transform (STFT). The STFT is transformed either to a log-magnitude spectrum (use_fft_mag=True) or a log-power spectrum (use_fft_mag=False).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2LogSpec()
>>> t(x).shape
torch.Size([1, 100, 257])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_fft_bins).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=True, use_fft_mag=False)[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type

Tensor

T_destination

alias of TypeVar(‘T_destination’, bound=Mapping[str, torch.Tensor])

add_module(name, module)

Adds a child module to the current module.

The module can be accessed as an attribute using the given name.

Args:
name (string): name of the child module. The child module can be

accessed from this module using the given name

module (Module): child module to be added to the module.

Return type

None

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also nn-init-doc).

Args:

fn (Module -> None): function to be applied to each submodule

Returns:

Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Return type

~T

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

buffers(recurse=True)

Returns an iterator over module buffers.

Args:
recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Tensor]

children()

Returns an iterator over immediate children modules.

Yields:

Module: a child module

Return type

Iterator[Module]

cpu()

Moves all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

cuda(device=None)

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Args:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

property dither: float
Return type

float

double()

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

dump_patches: bool = False

This allows better BC support for load_state_dict(). In state_dict(), the version number will be saved as in the attribute _metadata of the returned state dict, and thus pickled. _metadata is a dictionary with keys that follow the naming convention of state dict. See _load_from_state_dict on how to use this information in loading.

If new parameters/buffers are added/removed from a module, this number shall be bumped, and the module’s _load_from_state_dict method can compare the version number and do appropriate changes if the state dict is from before the change.

eval()

Sets the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

Module: self

Return type

~T

extra_repr()

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type

str

float()

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

property frame_length: float
Return type

float

property frame_shift: float
Return type

float

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the buffer

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not a buffer

Return type

Tensor

get_extra_state()

Returns any extra state to include in the module’s state_dict. Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be pickleable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

object: Any extra state to store in the module’s state_dict

Return type

Any

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the Parameter

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Parameter

Return type

Parameter

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

For example, let’s say you have an nn.Module A that looks like this:

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Args:
target: The fully-qualified string name of the submodule

to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Module

Return type

Module

half()

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Returns an iterator over all modules in the network.

Yields:

Module: a module in the network

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
        print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)
Return type

Iterator[Module]

named_buffers(prefix='', recurse=True)

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

(string, torch.Tensor): Tuple containing the name and buffer

Example:

>>> for name, buf in self.named_buffers():
>>>    if name in ['running_var']:
>>>        print(buf.size())
Return type

Iterator[Tuple[str, Tensor]]

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Yields:

(string, Module): Tuple containing a name and child module

Example:

>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)
Return type

Iterator[Tuple[str, Module]]

named_modules(memo=None, prefix='', remove_duplicate=True)

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:

memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result or not

Yields:

(string, Module): Tuple of name and module

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
        print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
named_parameters(prefix='', recurse=True)

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

(string, Parameter): Tuple containing the name and parameter

Example:

>>> for name, param in self.named_parameters():
>>>    if name in ['bias']:
>>>        print(param.size())
Return type

Iterator[Tuple[str, Parameter]]

parameters(recurse=True)

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Parameter]

property preemph_coeff: float
Return type

float

register_backward_hook(hook)

Registers a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_buffer(name, tensor, persistent=True)

Adds a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Args:
name (string): name of the buffer. The buffer can be accessed

from this module using the given name

tensor (Tensor or None): buffer to be registered. If None, then operations

that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.

persistent (bool): whether the buffer is part of this module’s

state_dict.

Example:

>>> self.register_buffer('running_mean', torch.zeros(num_features))
Return type

None

register_forward_hook(hook)

Registers a forward hook on the module.

The hook will be called every time after forward() has computed an output. It should have the following signature:

hook(module, input, output) -> None or modified output

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

The hook will be called every time before forward() is invoked. It should have the following signature:

hook(module, input) -> None or modified input

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned(unless that value is already a tuple).

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_full_backward_hook(hook)

Registers a backward hook on the module.

The hook will be called every time the gradients with respect to module inputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_parameter(name, param)

Adds a parameter to the module.

The parameter can be accessed as an attribute using given name.

Args:
name (string): name of the parameter. The parameter can be accessed

from this module using the given name

param (Parameter or None): parameter to be added to the module. If

None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

Return type

None

property remove_dc_offset: bool
Return type

bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Args:
requires_grad (bool): whether autograd should record operations on

parameters in this module. Default: True.

Returns:

Module: self

Return type

~T

property sampling_rate: int
Return type

int

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:

state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_()

Return type

~T

state_dict(destination=None, prefix='', keep_vars=False)

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> module.state_dict().keys()
['bias', 'weight']
to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:
device (torch.device): the desired device of the parameters

and buffers in this module

dtype (torch.dtype): the desired floating point or complex dtype of

the parameters and buffers in this module

tensor (torch.Tensor): Tensor whose dtype and device are the desired

dtype and device for all parameters and buffers in this module

memory_format (torch.memory_format): the desired memory

format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

Args:
device (torch.device): The desired device of the parameters

and buffers in this module.

Returns:

Module: self

Return type

~T

train(mode=True)

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Args:
mode (bool): whether to set training mode (True) or evaluation

mode (False). Default: True.

Returns:

Module: self

Return type

~T

type(dst_type)

Casts all parameters and buffers to dst_type.

Note

This method modifies the module in-place.

Args:

dst_type (type or string): the desired type

Returns:

Module: self

Return type

~T

property window_type: str
Return type

str

xpu(device=None)

Moves all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

zero_grad(set_to_none=False)

Sets gradients of all model parameters to zero. See similar function under torch.optim.Optimizer for more context.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

See torch.optim.Optimizer.zero_grad() for details.

Return type

None

training: bool
class lhotse.features.kaldi.layers.Wav2LogFilterBank(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=- 400.0, num_filters=80, norm_filters=False)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their log-Mel filter bank energies (also known as “fbank”).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2LogFilterBank()
>>> t(x).shape
torch.Size([1, 100, 80])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_filters).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=- 400.0, num_filters=80, norm_filters=False)[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

T_destination

alias of TypeVar(‘T_destination’, bound=Mapping[str, torch.Tensor])

add_module(name, module)

Adds a child module to the current module.

The module can be accessed as an attribute using the given name.

Args:
name (string): name of the child module. The child module can be

accessed from this module using the given name

module (Module): child module to be added to the module.

Return type

None

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also nn-init-doc).

Args:

fn (Module -> None): function to be applied to each submodule

Returns:

Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Return type

~T

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

buffers(recurse=True)

Returns an iterator over module buffers.

Args:
recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Tensor]

children()

Returns an iterator over immediate children modules.

Yields:

Module: a child module

Return type

Iterator[Module]

cpu()

Moves all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

cuda(device=None)

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Args:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

property dither: float
Return type

float

double()

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

dump_patches: bool = False

This allows better BC support for load_state_dict(). In state_dict(), the version number will be saved as in the attribute _metadata of the returned state dict, and thus pickled. _metadata is a dictionary with keys that follow the naming convention of state dict. See _load_from_state_dict on how to use this information in loading.

If new parameters/buffers are added/removed from a module, this number shall be bumped, and the module’s _load_from_state_dict method can compare the version number and do appropriate changes if the state dict is from before the change.

eval()

Sets the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

Module: self

Return type

~T

extra_repr()

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type

str

float()

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

property frame_length: float
Return type

float

property frame_shift: float
Return type

float

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the buffer

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not a buffer

Return type

Tensor

get_extra_state()

Returns any extra state to include in the module’s state_dict. Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be pickleable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

object: Any extra state to store in the module’s state_dict

Return type

Any

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the Parameter

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Parameter

Return type

Parameter

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

For example, let’s say you have an nn.Module A that looks like this:

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Args:
target: The fully-qualified string name of the submodule

to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Module

Return type

Module

half()

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Returns an iterator over all modules in the network.

Yields:

Module: a module in the network

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
        print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)
Return type

Iterator[Module]

named_buffers(prefix='', recurse=True)

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

(string, torch.Tensor): Tuple containing the name and buffer

Example:

>>> for name, buf in self.named_buffers():
>>>    if name in ['running_var']:
>>>        print(buf.size())
Return type

Iterator[Tuple[str, Tensor]]

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Yields:

(string, Module): Tuple containing a name and child module

Example:

>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)
Return type

Iterator[Tuple[str, Module]]

named_modules(memo=None, prefix='', remove_duplicate=True)

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:

memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result or not

Yields:

(string, Module): Tuple of name and module

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
        print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
named_parameters(prefix='', recurse=True)

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

(string, Parameter): Tuple containing the name and parameter

Example:

>>> for name, param in self.named_parameters():
>>>    if name in ['bias']:
>>>        print(param.size())
Return type

Iterator[Tuple[str, Parameter]]

parameters(recurse=True)

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Parameter]

property preemph_coeff: float
Return type

float

register_backward_hook(hook)

Registers a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_buffer(name, tensor, persistent=True)

Adds a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Args:
name (string): name of the buffer. The buffer can be accessed

from this module using the given name

tensor (Tensor or None): buffer to be registered. If None, then operations

that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.

persistent (bool): whether the buffer is part of this module’s

state_dict.

Example:

>>> self.register_buffer('running_mean', torch.zeros(num_features))
Return type

None

register_forward_hook(hook)

Registers a forward hook on the module.

The hook will be called every time after forward() has computed an output. It should have the following signature:

hook(module, input, output) -> None or modified output

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

The hook will be called every time before forward() is invoked. It should have the following signature:

hook(module, input) -> None or modified input

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned(unless that value is already a tuple).

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_full_backward_hook(hook)

Registers a backward hook on the module.

The hook will be called every time the gradients with respect to module inputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_parameter(name, param)

Adds a parameter to the module.

The parameter can be accessed as an attribute using given name.

Args:
name (string): name of the parameter. The parameter can be accessed

from this module using the given name

param (Parameter or None): parameter to be added to the module. If

None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

Return type

None

property remove_dc_offset: bool
Return type

bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Args:
requires_grad (bool): whether autograd should record operations on

parameters in this module. Default: True.

Returns:

Module: self

Return type

~T

property sampling_rate: int
Return type

int

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:

state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_()

Return type

~T

state_dict(destination=None, prefix='', keep_vars=False)

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> module.state_dict().keys()
['bias', 'weight']
to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:
device (torch.device): the desired device of the parameters

and buffers in this module

dtype (torch.dtype): the desired floating point or complex dtype of

the parameters and buffers in this module

tensor (torch.Tensor): Tensor whose dtype and device are the desired

dtype and device for all parameters and buffers in this module

memory_format (torch.memory_format): the desired memory

format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

Args:
device (torch.device): The desired device of the parameters

and buffers in this module.

Returns:

Module: self

Return type

~T

train(mode=True)

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Args:
mode (bool): whether to set training mode (True) or evaluation

mode (False). Default: True.

Returns:

Module: self

Return type

~T

type(dst_type)

Casts all parameters and buffers to dst_type.

Note

This method modifies the module in-place.

Args:

dst_type (type or string): the desired type

Returns:

Module: self

Return type

~T

property window_type: str
Return type

str

xpu(device=None)

Moves all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

zero_grad(set_to_none=False)

Sets gradients of all model parameters to zero. See similar function under torch.optim.Optimizer for more context.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

See torch.optim.Optimizer.zero_grad() for details.

Return type

None

training: bool
class lhotse.features.kaldi.layers.Wav2MFCC(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=- 400.0, num_filters=23, norm_filters=False, num_ceps=13, cepstral_lifter=22)[source]

Apply standard Kaldi preprocessing (dithering, removing DC offset, pre-emphasis, etc.) on the input waveforms and compute their Mel-Frequency Cepstral Coefficients (MFCC).

Example:

>>> x = torch.randn(1, 16000, dtype=torch.float32)
>>> x.shape
torch.Size([1, 16000])
>>> t = Wav2MFCC()
>>> t(x).shape
torch.Size([1, 100, 13])

The input is a tensor of shape (batch_size, num_samples). The output is a tensor of shape (batch_size, num_frames, num_ceps).

__init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, fft_length=512, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, low_freq=20.0, high_freq=- 400.0, num_filters=23, norm_filters=False, num_ceps=13, cepstral_lifter=22)[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

static make_lifter(N, Q)[source]

Makes the liftering function

Args:

N: Number of cepstral coefficients. Q: Liftering parameter

Returns:

Liftering vector.

static make_dct_matrix(num_ceps, num_filters)[source]
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

T_destination

alias of TypeVar(‘T_destination’, bound=Mapping[str, torch.Tensor])

add_module(name, module)

Adds a child module to the current module.

The module can be accessed as an attribute using the given name.

Args:
name (string): name of the child module. The child module can be

accessed from this module using the given name

module (Module): child module to be added to the module.

Return type

None

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also nn-init-doc).

Args:

fn (Module -> None): function to be applied to each submodule

Returns:

Module: self

Example:

>>> @torch.no_grad()
>>> def init_weights(m):
>>>     print(m)
>>>     if type(m) == nn.Linear:
>>>         m.weight.fill_(1.0)
>>>         print(m.weight)
>>> net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
>>> net.apply(init_weights)
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Linear(in_features=2, out_features=2, bias=True)
Parameter containing:
tensor([[ 1.,  1.],
        [ 1.,  1.]])
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
Return type

~T

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

buffers(recurse=True)

Returns an iterator over module buffers.

Args:
recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

torch.Tensor: module buffer

Example:

>>> for buf in model.buffers():
>>>     print(type(buf), buf.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Tensor]

children()

Returns an iterator over immediate children modules.

Yields:

Module: a child module

Return type

Iterator[Module]

cpu()

Moves all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

cuda(device=None)

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Args:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

property dither: float
Return type

float

double()

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

dump_patches: bool = False

This allows better BC support for load_state_dict(). In state_dict(), the version number will be saved as in the attribute _metadata of the returned state dict, and thus pickled. _metadata is a dictionary with keys that follow the naming convention of state dict. See _load_from_state_dict on how to use this information in loading.

If new parameters/buffers are added/removed from a module, this number shall be bumped, and the module’s _load_from_state_dict method can compare the version number and do appropriate changes if the state dict is from before the change.

eval()

Sets the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See locally-disable-grad-doc for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

Module: self

Return type

~T

extra_repr()

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

Return type

str

float()

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

property frame_length: float
Return type

float

property frame_shift: float
Return type

float

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the buffer

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.Tensor: The buffer referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not a buffer

Return type

Tensor

get_extra_state()

Returns any extra state to include in the module’s state_dict. Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be pickleable to ensure working serialization of the state_dict. We only provide provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

object: Any extra state to store in the module’s state_dict

Return type

Any

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

See the docstring for get_submodule for a more detailed explanation of this method’s functionality as well as how to correctly specify target.

Args:
target: The fully-qualified string name of the Parameter

to look for. (See get_submodule for how to specify a fully-qualified string.)

Returns:

torch.nn.Parameter: The Parameter referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Parameter

Return type

Parameter

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

For example, let’s say you have an nn.Module A that looks like this:

(The diagram shows an nn.Module A. A has a nested submodule net_b, which itself has two submodules net_c and linear. net_c then has a submodule conv.)

To check whether or not we have the linear submodule, we would call get_submodule("net_b.linear"). To check whether we have the conv submodule, we would call get_submodule("net_b.net_c.conv").

The runtime of get_submodule is bounded by the degree of module nesting in target. A query against named_modules achieves the same result, but it is O(N) in the number of transitive modules. So, for a simple check to see if some submodule exists, get_submodule should always be used.

Args:
target: The fully-qualified string name of the submodule

to look for. (See above example for how to specify a fully-qualified string.)

Returns:

torch.nn.Module: The submodule referenced by target

Raises:
AttributeError: If the target string references an invalid

path or resolves to something that is not an nn.Module

Return type

Module

half()

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

Module: self

Return type

~T

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

modules()

Returns an iterator over all modules in the network.

Yields:

Module: a module in the network

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.modules()):
        print(idx, '->', m)

0 -> Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
)
1 -> Linear(in_features=2, out_features=2, bias=True)
Return type

Iterator[Module]

named_buffers(prefix='', recurse=True)

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

Args:

prefix (str): prefix to prepend to all buffer names. recurse (bool): if True, then yields buffers of this module

and all submodules. Otherwise, yields only buffers that are direct members of this module.

Yields:

(string, torch.Tensor): Tuple containing the name and buffer

Example:

>>> for name, buf in self.named_buffers():
>>>    if name in ['running_var']:
>>>        print(buf.size())
Return type

Iterator[Tuple[str, Tensor]]

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

Yields:

(string, Module): Tuple containing a name and child module

Example:

>>> for name, module in model.named_children():
>>>     if name in ['conv4', 'conv5']:
>>>         print(module)
Return type

Iterator[Tuple[str, Module]]

named_modules(memo=None, prefix='', remove_duplicate=True)

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

Args:

memo: a memo to store the set of modules already added to the result prefix: a prefix that will be added to the name of the module remove_duplicate: whether to remove the duplicated module instances in the result or not

Yields:

(string, Module): Tuple of name and module

Note:

Duplicate modules are returned only once. In the following example, l will be returned only once.

Example:

>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
        print(idx, '->', m)

0 -> ('', Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
named_parameters(prefix='', recurse=True)

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

Args:

prefix (str): prefix to prepend to all parameter names. recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

(string, Parameter): Tuple containing the name and parameter

Example:

>>> for name, param in self.named_parameters():
>>>    if name in ['bias']:
>>>        print(param.size())
Return type

Iterator[Tuple[str, Parameter]]

parameters(recurse=True)

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module

and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
Return type

Iterator[Parameter]

property preemph_coeff: float
Return type

float

register_backward_hook(hook)

Registers a backward hook on the module.

This function is deprecated in favor of register_full_backward_hook() and the behavior of this function will change in future versions.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_buffer(name, tensor, persistent=True)

Adds a buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the module’s state. Buffers, by default, are persistent and will be saved alongside parameters. This behavior can be changed by setting persistent to False. The only difference between a persistent buffer and a non-persistent buffer is that the latter will not be a part of this module’s state_dict.

Buffers can be accessed as attributes using given names.

Args:
name (string): name of the buffer. The buffer can be accessed

from this module using the given name

tensor (Tensor or None): buffer to be registered. If None, then operations

that run on buffers, such as cuda, are ignored. If None, the buffer is not included in the module’s state_dict.

persistent (bool): whether the buffer is part of this module’s

state_dict.

Example:

>>> self.register_buffer('running_mean', torch.zeros(num_features))
Return type

None

register_forward_hook(hook)

Registers a forward hook on the module.

The hook will be called every time after forward() has computed an output. It should have the following signature:

hook(module, input, output) -> None or modified output

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the output. It can modify the input inplace but it will not have effect on forward since this is called after forward() is called.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_forward_pre_hook(hook)

Registers a forward pre-hook on the module.

The hook will be called every time before forward() is invoked. It should have the following signature:

hook(module, input) -> None or modified input

The input contains only the positional arguments given to the module. Keyword arguments won’t be passed to the hooks and only to the forward. The hook can modify the input. User can either return a tuple or a single modified value in the hook. We will wrap the value into a tuple if a single value is returned(unless that value is already a tuple).

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_full_backward_hook(hook)

Registers a backward hook on the module.

The hook will be called every time the gradients with respect to module inputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> tuple(Tensor) or None

The grad_input and grad_output are tuples that contain the gradients with respect to the inputs and outputs respectively. The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations. grad_input will only correspond to the inputs given as positional arguments and all kwarg arguments are ignored. Entries in grad_input and grad_output will be None for all non-Tensor arguments.

For technical reasons, when this hook is applied to a Module, its forward function will receive a view of each Tensor passed to the Module. Similarly the caller will receive a view of each Tensor returned by the Module’s forward function.

Warning

Modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error.

Returns:
torch.utils.hooks.RemovableHandle:

a handle that can be used to remove the added hook by calling handle.remove()

Return type

RemovableHandle

register_parameter(name, param)

Adds a parameter to the module.

The parameter can be accessed as an attribute using given name.

Args:
name (string): name of the parameter. The parameter can be accessed

from this module using the given name

param (Parameter or None): parameter to be added to the module. If

None, then operations that run on parameters, such as cuda, are ignored. If None, the parameter is not included in the module’s state_dict.

Return type

None

property remove_dc_offset: bool
Return type

bool

requires_grad_(requires_grad=True)

Change if autograd should record operations on parameters in this module.

This method sets the parameters’ requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

See locally-disable-grad-doc for a comparison between .requires_grad_() and several similar mechanisms that may be confused with it.

Args:
requires_grad (bool): whether autograd should record operations on

parameters in this module. Default: True.

Returns:

Module: self

Return type

~T

property sampling_rate: int
Return type

int

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Args:

state (dict): Extra state from the state_dict

share_memory()

See torch.Tensor.share_memory_()

Return type

~T

state_dict(destination=None, prefix='', keep_vars=False)

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> module.state_dict().keys()
['bias', 'weight']
to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Args:
device (torch.device): the desired device of the parameters

and buffers in this module

dtype (torch.dtype): the desired floating point or complex dtype of

the parameters and buffers in this module

tensor (torch.Tensor): Tensor whose dtype and device are the desired

dtype and device for all parameters and buffers in this module

memory_format (torch.memory_format): the desired memory

format for 4D parameters and buffers in this module (keyword only argument)

Returns:

Module: self

Examples:

>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

Args:
device (torch.device): The desired device of the parameters

and buffers in this module.

Returns:

Module: self

Return type

~T

train(mode=True)

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Args:
mode (bool): whether to set training mode (True) or evaluation

mode (False). Default: True.

Returns:

Module: self

Return type

~T

type(dst_type)

Casts all parameters and buffers to dst_type.

Note

This method modifies the module in-place.

Args:

dst_type (type or string): the desired type

Returns:

Module: self

Return type

~T

property window_type: str
Return type

str

xpu(device=None)

Moves all model parameters and buffers to the XPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on XPU while being optimized.

Note

This method modifies the module in-place.

Arguments:
device (int, optional): if specified, all parameters will be

copied to that device

Returns:

Module: self

Return type

~T

zero_grad(set_to_none=False)

Sets gradients of all model parameters to zero. See similar function under torch.optim.Optimizer for more context.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

See torch.optim.Optimizer.zero_grad() for details.

Return type

None

training: bool
lhotse.features.kaldi.layers.create_mel_scale(num_filters, fft_length, sampling_rate, low_freq=0, high_freq=None, norm_filters=True)[source]
lhotse.features.kaldi.layers.available_windows()[source]
Return type

List[str]

lhotse.features.kaldi.layers.create_frame_window(window_size, window_type='povey', blackman_coeff=0.42)[source]

Returns a window function with the given type and size

lhotse.features.kaldi.layers.lin2mel(x)[source]
lhotse.features.kaldi.layers.mel2lin(x)[source]

Torchaudio feature extractors

class lhotse.features.fbank.FbankConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=- 400.0, num_mel_bins=80, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0)[source]
dither: float = 0.0
window_type: str = 'povey'
frame_length: float = 0.025
frame_shift: float = 0.01
remove_dc_offset: bool = True
round_to_power_of_two: bool = True
energy_floor: float = 1e-10
min_duration: float = 0.0
preemphasis_coefficient: float = 0.97
raw_energy: bool = True
low_freq: float = 20.0
high_freq: float = -400.0
num_mel_bins: int = 80
use_energy: bool = False
vtln_low: float = 100.0
vtln_high: float = -500.0
vtln_warp: float = 1.0
to_dict()[source]
Return type

Dict[str, Any]

static from_dict(data)[source]
Return type

FbankConfig

__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=- 400.0, num_mel_bins=80, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0)
class lhotse.features.fbank.Fbank(config=None)[source]

Log Mel energy filter bank feature extractor based on torchaudio.compliance.kaldi.fbank function.

name = 'fbank'
config_type

alias of lhotse.features.fbank.FbankConfig

feature_dim(sampling_rate)[source]
Return type

int

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters
  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type

ndarray

Returns

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters

features (ndarray) – A feature matrix.

Return type

float

Returns

A positive float value of the signal energy.

__init__(config=None)
property device: Union[str, torch.device]
Return type

Union[str, device]

extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

ndarray

Returns

a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate)

Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation.

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

Return type

Union[ndarray, List[ndarray]]

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

  • load audio from disk;

  • optionally, perform audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features and the source data used.

Parameters
  • recording (Recording) – a Recording that specifies what’s the input audio.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an optional offset in seconds for where to start reading the recording.

  • duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.

  • channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

  • optional audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters
  • samples (ndarray) – a numpy ndarray with the audio samples.

  • sampling_rate (int) – integer sampling rate of samples.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.

  • channel (Optional[int]) – an optional channel number to insert into Features manifest.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix (it is not written to disk).

property frame_shift: float
Return type

float

classmethod from_dict(data)
Return type

FeatureExtractor

classmethod from_yaml(path)
Return type

FeatureExtractor

to_dict()
Return type

Dict[str, Any]

to_yaml(path)
class lhotse.features.mfcc.MfccConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=- 400.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)[source]
dither: float = 0.0
window_type: str = 'povey'
frame_length: float = 0.025
frame_shift: float = 0.01
remove_dc_offset: bool = True
round_to_power_of_two: bool = True
energy_floor: float = 1e-10
min_duration: float = 0.0
preemphasis_coefficient: float = 0.97
raw_energy: bool = True
low_freq: float = 20.0
high_freq: float = -400.0
num_mel_bins: int = 23
use_energy: bool = False
vtln_low: float = 100.0
vtln_high: float = -500.0
vtln_warp: float = 1.0
cepstral_lifter: float = 22.0
num_ceps: int = 13
to_dict()[source]
Return type

Dict[str, Any]

static from_dict(data)[source]
Return type

MfccConfig

__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True, low_freq=20.0, high_freq=- 400.0, num_mel_bins=23, use_energy=False, vtln_low=100.0, vtln_high=- 500.0, vtln_warp=1.0, cepstral_lifter=22.0, num_ceps=13)
class lhotse.features.mfcc.Mfcc(config=None)[source]

MFCC feature extractor based on torchaudio.compliance.kaldi.mfcc function.

name = 'mfcc'
config_type

alias of lhotse.features.mfcc.MfccConfig

feature_dim(sampling_rate)[source]
Return type

int

__init__(config=None)
static compute_energy(features)

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters

features (ndarray) – A feature matrix.

Return type

float

Returns

A positive float value of the signal energy.

property device: Union[str, torch.device]
Return type

Union[str, device]

extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

ndarray

Returns

a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate)

Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation.

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

Return type

Union[ndarray, List[ndarray]]

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

  • load audio from disk;

  • optionally, perform audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features and the source data used.

Parameters
  • recording (Recording) – a Recording that specifies what’s the input audio.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an optional offset in seconds for where to start reading the recording.

  • duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.

  • channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

  • optional audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters
  • samples (ndarray) – a numpy ndarray with the audio samples.

  • sampling_rate (int) – integer sampling rate of samples.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.

  • channel (Optional[int]) – an optional channel number to insert into Features manifest.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix (it is not written to disk).

property frame_shift: float
Return type

float

classmethod from_dict(data)
Return type

FeatureExtractor

classmethod from_yaml(path)
Return type

FeatureExtractor

static mix(features_a, features_b, energy_scaling_factor_b)

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters
  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type

ndarray

Returns

A mixed feature matrix.

to_dict()
Return type

Dict[str, Any]

to_yaml(path)
class lhotse.features.spectrogram.SpectrogramConfig(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)[source]
dither: float = 0.0
window_type: str = 'povey'
frame_length: float = 0.025
frame_shift: float = 0.01
remove_dc_offset: bool = True
round_to_power_of_two: bool = True
energy_floor: float = 1e-10
min_duration: float = 0.0
preemphasis_coefficient: float = 0.97
raw_energy: bool = True
to_dict()[source]
Return type

Dict[str, Any]

static from_dict(data)[source]
Return type

SpectrogramConfig

__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)
class lhotse.features.spectrogram.Spectrogram(config=None)[source]

Log spectrogram feature extractor based on torchaudio.compliance.kaldi.spectrogram function.

name = 'spectrogram'
config_type

alias of lhotse.features.spectrogram.SpectrogramConfig

feature_dim(sampling_rate)[source]
Return type

int

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters
  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type

ndarray

Returns

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters

features (ndarray) – A feature matrix.

Return type

float

Returns

A positive float value of the signal energy.

__init__(config=None)
property device: Union[str, torch.device]
Return type

Union[str, device]

extract(samples, sampling_rate)

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

ndarray

Returns

a numpy ndarray representing the feature matrix.

extract_batch(samples, sampling_rate)

Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation.

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

Return type

Union[ndarray, List[ndarray]]

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

  • load audio from disk;

  • optionally, perform audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features and the source data used.

Parameters
  • recording (Recording) – a Recording that specifies what’s the input audio.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an optional offset in seconds for where to start reading the recording.

  • duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.

  • channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

  • optional audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters
  • samples (ndarray) – a numpy ndarray with the audio samples.

  • sampling_rate (int) – integer sampling rate of samples.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.

  • channel (Optional[int]) – an optional channel number to insert into Features manifest.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix (it is not written to disk).

property frame_shift: float
Return type

float

classmethod from_dict(data)
Return type

FeatureExtractor

classmethod from_yaml(path)
Return type

FeatureExtractor

to_dict()
Return type

Dict[str, Any]

to_yaml(path)

Librosa filter-bank

class lhotse.features.librosa_fbank.LibrosaFbankConfig(sampling_rate=22050, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600)[source]

Default librosa config with values consistent with various TTS projects.

This config is intended for use with popular TTS projects such as [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) Warning: You may need to normalize your features.

sampling_rate: int = 22050
fft_size: int = 1024
hop_size: int = 256
win_length: int = None
window: str = 'hann'
num_mel_bins: int = 80
fmin: int = 80
fmax: int = 7600
to_dict()[source]
Return type

Dict[str, Any]

static from_dict(data)[source]
Return type

LibrosaFbankConfig

__init__(sampling_rate=22050, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600)
lhotse.features.librosa_fbank.pad_or_truncate_features(feats, expected_num_frames, abs_tol=1, pad_value=- 23.025850929940457)[source]
lhotse.features.librosa_fbank.logmelfilterbank(audio, sampling_rate, fft_size=1024, hop_size=256, win_length=None, window='hann', num_mel_bins=80, fmin=80, fmax=7600, eps=1e-10)[source]

Compute log-Mel filterbank feature.

Args:

audio (ndarray): Audio signal (T,). sampling_rate (int): Sampling rate. fft_size (int): FFT size. hop_size (int): Hop size. win_length (int): Window length. If set to None, it will be the same as fft_size. window (str): Window function type. num_mel_bins (int): Number of mel basis. fmin (int): Minimum frequency in mel basis calculation. fmax (int): Maximum frequency in mel basis calculation. eps (float): Epsilon value to avoid inf in log calculation.

Returns:

ndarray: Log Mel filterbank feature (#source_feats, num_mel_bins).

class lhotse.features.librosa_fbank.LibrosaFbank(config=None)[source]

Librosa fbank feature extractor

Differs from Fbank extractor in that it uses librosa backend for stft and mel scale calculations. It can be easily configured to be compatible with existing speech-related projects that use librosa features.

name = 'librosa-fbank'
config_type

alias of lhotse.features.librosa_fbank.LibrosaFbankConfig

property frame_shift: float
Return type

float

feature_dim(sampling_rate)[source]
Return type

int

extract(samples, sampling_rate)[source]

Defines how to extract features using a numpy ndarray of audio samples and the sampling rate.

Return type

ndarray

Returns

a numpy ndarray representing the feature matrix.

static mix(features_a, features_b, energy_scaling_factor_b)[source]

Perform feature-domain mix of two signals, a and b, and return the mixed signal.

Parameters
  • features_a (ndarray) – Left-hand side (reference) signal.

  • features_b (ndarray) – Right-hand side (mixed-in) signal.

  • energy_scaling_factor_b (float) – A scaling factor for features_b energy. It is used to achieve a specific SNR. E.g. to mix with an SNR of 10dB when both features_a and features_b energies are 100, the features_b signal energy needs to be scaled by 0.1. Since different features (e.g. spectrogram, fbank, MFCC) require different combination of transformations (e.g. exp, log, sqrt, pow) to allow mixing of two signals, the exact place where to apply energy_scaling_factor_b to the signal is determined by the implementer.

Return type

ndarray

Returns

A mixed feature matrix.

static compute_energy(features)[source]

Compute the total energy of a feature matrix. How the energy is computed depends on a particular type of features. It is expected that when implemented, compute_energy will never return zero.

Parameters

features (ndarray) – A feature matrix.

Return type

float

Returns

A positive float value of the signal energy.

__init__(config=None)
property device: Union[str, torch.device]
Return type

Union[str, device]

extract_batch(samples, sampling_rate)

Performs batch extraction. It is not guaranteed to be faster than FeatureExtractor.extract() – it depends on whether the implementation of a particular feature extractor supports accelerated batch computation.

Note

Unless overridden by child classes, it defaults to sequentially calling FeatureExtractor.extract() on the inputs.

Note

This method should support variable length inputs.

Return type

Union[ndarray, List[ndarray]]

extract_from_recording_and_store(recording, storage, offset=0, duration=None, channels=None, augment_fn=None)

Extract the features from a Recording in a full pipeline:

  • load audio from disk;

  • optionally, perform audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features and the source data used.

Parameters
  • recording (Recording) – a Recording that specifies what’s the input audio.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an optional offset in seconds for where to start reading the recording.

  • duration (Optional[float]) – an optional duration specifying how much audio to load from the recording.

  • channels (Union[int, List[int], None]) – an optional int or list of ints, specifying the channels; by default, all channels will be used.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix.

extract_from_samples_and_store(samples, storage, sampling_rate, offset=0, channel=None, augment_fn=None)

Extract the features from an array of audio samples in a full pipeline:

  • optional audio augmentation;

  • extract the features;

  • save them to disk in a specified directory;

  • return a Features object with a description of the extracted features.

Note, unlike in extract_from_recording_and_store, the returned Features object might not be suitable to store in a FeatureSet, as it does not reference any particular Recording. Instead, this method is useful when extracting features from cuts - especially MixedCut instances, which may be created from multiple recordings and channels.

Parameters
  • samples (ndarray) – a numpy ndarray with the audio samples.

  • sampling_rate (int) – integer sampling rate of samples.

  • storage (FeaturesWriter) – a FeaturesWriter object that will handle storing the feature matrices.

  • offset (float) – an offset in seconds for where to start reading the recording - when used for Cut feature extraction, must be equal to Cut.start.

  • channel (Optional[int]) – an optional channel number to insert into Features manifest.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional WavAugmenter instance to modify the waveform before feature extraction.

Return type

Features

Returns

a Features manifest item for the extracted feature matrix (it is not written to disk).

classmethod from_dict(data)
Return type

FeatureExtractor

classmethod from_yaml(path)
Return type

FeatureExtractor

to_dict()
Return type

Dict[str, Any]

to_yaml(path)

Feature storage

class lhotse.features.io.FeaturesWriter[source]

FeaturesWriter defines the interface of how to store numpy arrays in a particular storage backend. This backend could either be:

  • separate files on a local filesystem;

  • a single file with multiple arrays;

  • cloud storage;

  • etc.

Each class inheriting from FeaturesWriter must define:

  • the write() method, which defines the storing operation

    (accepts a key used to place the value array in the storage);

  • the storage_path() property, which is either a common directory for the files,

    the name of the file storing multiple arrays, name of the cloud bucket, etc.

  • the name() property that is unique to this particular storage mechanism -

    it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

Each FeaturesWriter can also be used as a context manager, as some implementations might need to free a resource after the writing is finalized. By default nothing happens in the context manager functions, and this can be modified by the inheriting subclasses.

Example:
with MyWriter(‘some/path’) as storage:

extractor.extract_from_recording_and_store(recording, storage)

The features loading must be defined separately in a class inheriting from FeaturesReader.

abstract property name: str
Return type

str

abstract property storage_path: str
Return type

str

abstract write(key, value)[source]
Return type

str

class lhotse.features.io.FeaturesReader[source]

FeaturesReader defines the interface of how to load numpy arrays from a particular storage backend. This backend could either be:

  • separate files on a local filesystem;

  • a single file with multiple arrays;

  • cloud storage;

  • etc.

Each class inheriting from FeaturesReader must define:

  • the read() method, which defines the loading operation

    (accepts the key to locate the array in the storage and return it). The read method should support selecting only a subset of the feature matrix, with the bounds expressed as arguments left_offset_frames and right_offset_frames. It’s up to the Reader implementation to load only the required part or trim it to that range only after loading. It is assumed that the time dimension is always the first one.

  • the name() property that is unique to this particular storage mechanism -

    it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

The features writing must be defined separately in a class inheriting from FeaturesWriter.

abstract property name: str
Return type

str

abstract read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

lhotse.features.io.available_storage_backends()[source]
Return type

List[str]

lhotse.features.io.register_reader(cls)[source]

Decorator used to add a new FeaturesReader to Lhotse’s registry.

Example:

@register_reader
class MyFeatureReader(FeatureReader):
    ...
lhotse.features.io.register_writer(cls)[source]

Decorator used to add a new FeaturesWriter to Lhotse’s registry.

Example:

@register_writer
class MyFeatureWriter(FeatureWriter):
    ...
lhotse.features.io.get_reader(name)[source]

Find a FeaturesReader sub-class that corresponds to the provided name and return its type.

Example:

reader_type = get_reader(“lilcom_files”) reader = reader_type(“/storage/features/”)

Return type

Type[FeaturesReader]

lhotse.features.io.get_writer(name)[source]

Find a FeaturesWriter sub-class that corresponds to the provided name and return its type.

Example:

writer_type = get_writer(“lilcom_files”) writer = writer_type(“/storage/features/”)

Return type

Type[FeaturesWriter]

class lhotse.features.io.LilcomFilesReader(storage_path, *args, **kwargs)[source]

Reads Lilcom-compressed files from a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'lilcom_files'
__init__(storage_path, *args, **kwargs)[source]
read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

class lhotse.features.io.LilcomFilesWriter(storage_path, tick_power=- 5, *args, **kwargs)[source]

Writes Lilcom-compressed files to a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'lilcom_files'
__init__(storage_path, tick_power=- 5, *args, **kwargs)[source]
property storage_path: str
Return type

str

write(key, value)[source]
Return type

str

class lhotse.features.io.NumpyFilesReader(storage_path, *args, **kwargs)[source]

Reads non-compressed numpy arrays from files in a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'numpy_files'
__init__(storage_path, *args, **kwargs)[source]
read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

class lhotse.features.io.NumpyFilesWriter(storage_path, *args, **kwargs)[source]

Writes non-compressed numpy arrays to files in a directory on the local filesystem. storage_path corresponds to the directory path; storage_key for each utterance is the name of the file in that directory.

name = 'numpy_files'
__init__(storage_path, *args, **kwargs)[source]
property storage_path: str
Return type

str

write(key, value)[source]
Return type

str

lhotse.features.io.lookup_cache_or_open(storage_path)[source]

Helper internal function used in HDF5 readers. It opens the HDF files and keeps their handles open in a global program cache to avoid excessive amount of syscalls when the Reader class is instantiated and destroyed in a loop repeatedly (frequent use-case).

The file handles can be freed at any time by calling close_cached_file_handles().

lhotse.features.io.lookup_chunk_size(h5_file_handle)[source]

Helper internal function to retrieve the chunk size from an HDF5 file. Helps avoid unnecessary repeated disk reads.

Return type

int

lhotse.features.io.close_cached_file_handles()[source]

Closes the cached file handles in lookup_cache_or_open (see its docs for more details).

Return type

None

class lhotse.features.io.NumpyHdf5Reader(storage_path, *args, **kwargs)[source]

Reads non-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'numpy_hdf5'
__init__(storage_path, *args, **kwargs)[source]
read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

class lhotse.features.io.NumpyHdf5Writer(storage_path, mode='w', *args, **kwargs)[source]

Writes non-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

Internally, this class opens the file lazily so that this object can be passed between processes without issues. This simplifies the parallel feature extraction code.

name = 'numpy_hdf5'
__init__(storage_path, mode='w', *args, **kwargs)[source]
Parameters
  • storage_path (Union[Path, str]) – Path under which we’ll create the HDF5 file. We will add a .h5 suffix if it is not already in storage_path.

  • mode (str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise

property storage_path: str
Return type

str

write(key, value)[source]
Return type

str

close()[source]
Return type

None

class lhotse.features.io.LilcomHdf5Reader(storage_path, *args, **kwargs)[source]

Reads lilcom-compressed numpy arrays from a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'lilcom_hdf5'
__init__(storage_path, *args, **kwargs)[source]
read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

class lhotse.features.io.LilcomHdf5Writer(storage_path, tick_power=- 5, mode='w', *args, **kwargs)[source]

Writes lilcom-compressed numpy arrays to a HDF5 file with a “flat” layout. Each array is stored as a separate HDF Dataset because their shapes (numbers of frames) may vary. storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'lilcom_hdf5'
__init__(storage_path, tick_power=- 5, mode='w', *args, **kwargs)[source]
Parameters
  • storage_path (Union[Path, str]) – Path under which we’ll create the HDF5 file. We will add a .h5 suffix if it is not already in storage_path.

  • tick_power (int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.

  • mode (str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise

property storage_path: str
Return type

str

write(key, value)[source]
Return type

str

close()[source]
Return type

None

class lhotse.features.io.ChunkedLilcomHdf5Reader(storage_path, *args, **kwargs)[source]

Reads lilcom-compressed numpy arrays from a HDF5 file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.

storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'chunked_lilcom_hdf5'
__init__(storage_path, *args, **kwargs)[source]
read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

class lhotse.features.io.ChunkedLilcomHdf5Writer(storage_path, tick_power=- 5, chunk_size=100, mode='w', *args, **kwargs)[source]

Writes lilcom-compressed numpy arrays to a HDF5 file with chunked lilcom storage. Each feature matrix is stored in an array of chunks - binary data compressed with lilcom. Upon reading, we check how many chunks need to be retrieved to avoid excessive I/O.

storage_path corresponds to the HDF5 file path; storage_key for each utterance is the key corresponding to the array (i.e. HDF5 “Group” name).

name = 'chunked_lilcom_hdf5'
__init__(storage_path, tick_power=- 5, chunk_size=100, mode='w', *args, **kwargs)[source]
Parameters
  • storage_path (Union[Path, str]) – Path under which we’ll create the HDF5 file. We will add a .h5 suffix if it is not already in storage_path.

  • tick_power (int) – Determines the lilcom compression accuracy; the input will be compressed to integer multiples of 2^tick_power.

  • chunk_size (int) – How many frames to store per chunk. Too low a number will require many reads for long feature matrices, too high a number will require to read more redundant data.

  • mode (str) – Modes supported by h5py: w Create file, truncate if exists (default) w- or x Create file, fail if exists a Read/write if exists, create otherwise

property storage_path: str
Return type

str

write(key, value)[source]
Return type

str

close()[source]
Return type

None

class lhotse.features.io.LilcomURLReader(storage_path, *args, **kwargs)[source]

Downloads Lilcom-compressed files from a URL (S3, GCP, Azure, HTTP, etc.). storage_path corresponds to the root URL (e.g. “s3://my-data-bucket”) storage_key will be concatenated to storage_path to form a full URL (e.g. “my-feature-file.llc”)

Caution

Requires smart_open to be installed (pip install smart_open).

name = 'lilcom_url'
__init__(storage_path, *args, **kwargs)[source]
read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

class lhotse.features.io.LilcomURLWriter(storage_path, tick_power=- 5, *args, **kwargs)[source]

Writes Lilcom-compressed files to a URL (S3, GCP, Azure, HTTP, etc.). storage_path corresponds to the root URL (e.g. “s3://my-data-bucket”) storage_key will be concatenated to storage_path to form a full URL (e.g. “my-feature-file.llc”)

Caution

Requires smart_open to be installed (pip install smart_open).

name = 'lilcom_url'
__init__(storage_path, tick_power=- 5, *args, **kwargs)[source]
property storage_path: str
Return type

str

write(key, value)[source]
Return type

str

class lhotse.features.io.KaldiReader(storage_path, *args, **kwargs)[source]

Reads Kaldi’s “feats.scp” file using kaldiio. storage_path corresponds to the path to feats.scp. storage_key corresponds to the utterance-id in Kaldi.

Caution

Requires kaldiio to be installed (pip install kaldiio).

name = 'kaldiio'
__init__(storage_path, *args, **kwargs)[source]
read(key, left_offset_frames=0, right_offset_frames=None)[source]
Return type

ndarray

Feature-domain mixing

class lhotse.features.mixer.FeatureMixer(feature_extractor, base_feats, frame_shift, padding_value=- 1000.0)[source]

Utility class to mix multiple feature matrices into a single one. It should be instantiated separately for each mixing session (i.e. each MixedCut will create a separate FeatureMixer to mix its tracks). It is initialized with a numpy array of features (typically float32) that represents the “reference” signal for the mix. Other signals can be mixed to it with different time offsets and SNRs using the add_to_mix method. The time offset is relative to the start of the reference signal (only positive values are supported). The SNR is relative to the energy of the signal used to initialize the FeatureMixer.

It relies on the FeatureExtractor to have defined mix and compute_energy methods, so that the FeatureMixer knows how to scale and add two feature matrices together.

__init__(feature_extractor, base_feats, frame_shift, padding_value=- 1000.0)[source]
Parameters
  • feature_extractor (FeatureExtractor) – The FeatureExtractor instance that specifies how to mix the features.

  • base_feats (ndarray) – The features used to initialize the FeatureMixer are a point of reference in terms of energy and offset for all features mixed into them.

  • frame_shift (float) – Required to correctly compute offset and padding during the mix.

  • padding_value (float) – The value used to pad the shorter features during the mix. This value is adequate only for log space features. For non-log space features, e.g. energies, use either 0 or a small positive value like 1e-5.

property num_features
property unmixed_feats: numpy.ndarray

Return a numpy ndarray with the shape (num_tracks, num_frames, num_features), where each track’s feature matrix is padded and scaled adequately to the offsets and SNR used in add_to_mix call.

Return type

ndarray

property mixed_feats: numpy.ndarray

Return a numpy ndarray with the shape (num_frames, num_features) - a mono mixed feature matrix of the tracks supplied with add_to_mix calls.

Return type

ndarray

add_to_mix(feats, sampling_rate, snr=None, offset=0.0)[source]

Add feature matrix of a new track into the mix. :type feats: ndarray :param feats: A 2D feature matrix to be mixed in. :type sampling_rate: int :param sampling_rate: The sampling rate of feats :type snr: Optional[float] :param snr: Signal-to-noise ratio, assuming feats represents noise (positive SNR - lower feats energy, negative SNR - higher feats energy) :type offset: float :param offset: How many seconds to shift feats in time. For mixing, the signal will be padded before the start with low energy values.

Augmentation

Cuts

Data structures and tools used to create training/testing examples.

class lhotse.cut.Cut[source]

Caution

Cut is just an abstract class – the actual logic is implemented by its child classes (scroll down for references).

Cut is a base class for audio cuts. An “audio cut” is a subset of a Recording – it can also be thought of as a “view” or a pointer to a chunk of audio. It is not limited to audio data – cuts may also point to (sub-spans of) precomputed Features.

Cuts are different from SupervisionSegment in that they may be arbitrarily longer or shorter than supervisions; cuts may even contain multiple supervisions for creating contextual training data, and unsupervised regions that provide real or synthetic acoustic background context for the supervised segments.

The following example visualizes how a cut may represent a part of a single-channel recording with two utterances and some background noise in between:

                  Recording
|-------------------------------------------|
"Hey, Matt!"     "Yes?"        "Oh, nothing"
|----------|     |----|        |-----------|
           Cut1
|------------------------|

This scenario can be represented in code, using MonoCut, as:

>>> from lhotse import Recording, SupervisionSegment, MonoCut
>>> rec = Recording(id='rec1', duration=10.0, sampling_rate=8000, num_samples=80000, sources=[...])
>>> sups = [
...     SupervisionSegment(id='sup1', recording_id='rec1', start=0, duration=3.37, text='Hey, Matt!'),
...     SupervisionSegment(id='sup2', recording_id='rec1', start=4.5, duration=0.9, text='Yes?'),
...     SupervisionSegment(id='sup3', recording_id='rec1', start=6.9, duration=2.9, text='Oh, nothing'),
... ]
>>> cut = MonoCut(id='rec1-cut1', start=0.0, duration=6.0, channel=0, recording=rec,
...     supervisions=[sups[0], sups[1]])

Note

All Cut classes assume that the SupervisionSegment time boundaries are relative to the beginning of the cut. E.g. if the underlying Recording starts at 0s (always true), the cut starts at 100s, and the SupervisionSegment inside the cut starts at 3s, it really did start at 103rd second of the recording. In some cases, the supervision might have a negative start, or a duration exceeding the duration of the cut; this means that the supervision in the recording extends beyond the cut.

Cut allows to check and read audio data or features data:

>>> assert cut.has_recording
>>> samples = cut.load_audio()
>>> if cut.has_features:
...     feats = cut.load_features()

It can be visualized, and listened to, inside Jupyter Notebooks:

>>> cut.plot_audio()
>>> cut.play_audio()
>>> cut.plot_features()

Cuts can be used with Lhotse’s FeatureExtractor to compute features.

>>> from lhotse import Fbank
>>> feats = cut.compute_features(extractor=Fbank())

It is also possible to use a FeaturesWriter to store the features and attach their manifest to a copy of the cut:

>>> from lhotse import LilcomHdf5Writer
>>> with LilcomHdf5Writer('feats.h5') as storage:
...     cut_with_feats = cut.compute_and_store_features(
...         extractor=Fbank(),
...         storage=storage
...     )

Cuts have several methods that allow their manipulation, transformation, and mixing. Some examples (see the respective methods documentation for details):

>>> cut_2_to_4s = cut.truncate(offset=2, duration=2)
>>> cut_padded = cut.pad(duration=10.0)
>>> cut_mixed = cut.mix(other_cut, offset_other_by=5.0, snr=20)
>>> cut_append = cut.append(other_cut)
>>> cut_24k = cut.resample(24000)
>>> cut_sp = cut.perturb_speed(1.1)
>>> cut_vp = cut.perturb_volume(2.)

Note

All cut transformations are performed lazily, on-the-fly, upon calling load_audio or load_features. The stored waveforms and features are untouched.

Caution

Operations on cuts are not mutating – they return modified copies of Cut objects, leaving the original object unmodified.

A Cut that contains multiple segments (SupervisionSegment) can be decayed into smaller cuts that correspond directly to supervisions:

>>> smaller_cuts = cut.trim_to_supervisions()

Cuts can be detached from parts of their metadata:

>>> cut_no_feat = cut.drop_features()
>>> cut_no_rec = cut.drop_recording()
>>> cut_no_sup = cut.drop_supervisions()

Finally, cuts provide convenience methods to compute feature frame and audio sample masks for supervised regions:

>>> sup_frames = cut.supervisions_feature_mask()
>>> sup_samples = cut.supervisions_audio_mask()

See also:

id: str
start: float
duration: float
sampling_rate: int
supervisions: List[lhotse.supervision.SupervisionSegment]
num_samples: Optional[int]
num_frames: Optional[int]
num_features: Optional[int]
frame_shift: Optional[float]
features_type: Optional[str]
has_recording: bool
has_features: bool
from_dict: Callable[[Dict], lhotse.cut.Cut]
load_audio: Callable[[], numpy.ndarray]
load_features: Callable[[], numpy.ndarray]
compute_and_store_features: Callable
drop_features: Callable
drop_recording: Callable
drop_supervisions: Callable
truncate: Callable
pad: Callable
resample: Callable
perturb_speed: Callable
perturb_tempo: Callable
perturb_volume: Callable
map_supervisions: Callable
filter_supervisions: Callable
with_features_path_prefix: Callable
with_recording_path_prefix: Callable
to_dict()[source]
Return type

dict

property trimmed_supervisions: List[lhotse.supervision.SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

Return type

List[SupervisionSegment]

mix(other, offset_other_by=0.0, snr=None, preserve_id=None)[source]

Refer to :function:`~lhotse.cut.mix` documentation.

Return type

MixedCut

append(other, snr=None, preserve_id=None)[source]

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters

preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.

Return type

MixedCut

compute_features(extractor, augment_fn=None)[source]

Compute the features from this cut. This cut has to be able to load audio.

Parameters
  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type

ndarray

Returns

a numpy ndarray with the computed features.

plot_audio()[source]

Display a plot of the waveform. Requires matplotlib to be installed.

play_audio()[source]

Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_features()[source]

Display the feature matrix as an image. Requires matplotlib to be installed.

plot_alignment(alignment_type='word')[source]

Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center')[source]

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|
Parameters
  • keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2.

  • min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.

  • context_direction (Literal[‘center’, ‘left’, ‘right’, ‘random’]) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.

Return type

List[Cut]

Returns

a list of cuts.

index_supervisions(index_mixed_tracks=False, keep_ids=None)[source]

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters
  • index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.

  • keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type

Dict[str, IntervalTree]

Returns

a mapping from Cut ID to an interval tree of SupervisionSegments.

compute_and_store_recording(storage_path, augment_fn=None)[source]

Store this cut’s waveform as audio recording to disk.

Parameters
  • storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

Return type

MonoCut

Returns

a new MonoCut instance.

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)[source]

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)[source]

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_feature_mask(use_alignment_if_exists=None)[source]

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_audio_mask(use_alignment_if_exists=None)[source]

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

with_id(id_)[source]

Return a copy of the Cut with a new ID.

Return type

Cut

class lhotse.cut.MonoCut(id, start, duration, channel, supervisions=<factory>, features=None, recording=None)[source]

MonoCut is a Cut of a single channel of a Recording. In addition to Cut, it has a specified channel attribute. This is the most commonly used type of cut.

Please refer to the documentation of Cut to learn more about using cuts.

See also:

id: str
start: float
duration: float
channel: int
supervisions: List[lhotse.supervision.SupervisionSegment]
features: Optional[lhotse.features.base.Features] = None
recording: Optional[lhotse.audio.Recording] = None
property recording_id: str
Return type

str

property end: float
Return type

float

property has_features: bool
Return type

bool

property has_recording: bool
Return type

bool

property frame_shift: Optional[float]
Return type

Optional[float]

property num_frames: Optional[int]
Return type

Optional[int]

property num_samples: Optional[int]
Return type

Optional[int]

property num_features: Optional[int]
Return type

Optional[int]

property features_type: Optional[str]
Return type

Optional[str]

property sampling_rate: int
Return type

int

load_features()[source]

Load the features from the underlying storage and cut them to the relevant [begin, duration] region of the current MonoCut.

Return type

Optional[ndarray]

load_audio()[source]

Load the audio by locating the appropriate recording in the supplied RecordingSet. The audio is trimmed to the [begin, end] range specified by the MonoCut.

Return type

Optional[ndarray]

Returns

a numpy ndarray with audio samples, with shape (1 <channel>, N <samples>)

drop_features()[source]

Return a copy of the current MonoCut, detached from features.

Return type

MonoCut

drop_recording()[source]

Return a copy of the current MonoCut, detached from recording.

Return type

MonoCut

drop_supervisions()[source]

Return a copy of the current MonoCut, detached from supervisions.

Return type

MonoCut

compute_and_store_features(extractor, storage, augment_fn=None, *args, **kwargs)[source]

Compute the features from this cut, store them on disk, and attach a feature manifest to this cut. This cut has to be able to load audio.

Parameters
  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • storage (FeaturesWriter) – a FeaturesWriter instance used to write the features to a storage.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation.

Return type

Cut

Returns

a new MonoCut instance with a Features manifest attached to it.

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)[source]

Returns a new MonoCut that is a sub-region of the current MonoCut.

Note that no operation is done on the actual features or recording - it’s only during the call to MonoCut.load_features() / MonoCut.load_audio() when the actual changes happen (a subset of features/audio is loaded).

Parameters
  • offset (float) – float (seconds), controls the start of the new cut relative to the current MonoCut’s start. E.g., if the current MonoCut starts at 10.0, and offset is 2.0, the new start is 12.0.

  • duration (Optional[float]) – optional float (seconds), controls the duration of the resulting MonoCut. By default, the duration is (end of the cut before truncation) - (offset).

  • keep_excessive_supervisions (bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.

  • preserve_id (bool) – bool. Should the truncated cut keep the same ID or get a new, random one.

  • _supervisions_index (Optional[Dict[str, IntervalTree]]) – an IntervalTree; when passed, allows to speed up processing of Cuts with a very large number of supervisions. Intended as an internal parameter.

Return type

MonoCut

Returns

a new MonoCut instance. If the current MonoCut is shorter than the duration, return None.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right', preserve_id=False)[source]

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters
  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

  • preserve_id (bool) – When True, preserves the cut ID before padding. Otherwise, a new random ID is generated for the padded cut (default).

Return type

Cut

Returns

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

resample(sampling_rate, affix_id=False)[source]

Return a new MonoCut that will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.

Parameters
  • sampling_rate (int) – The new sampling rate.

  • affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type

MonoCut

Returns

a modified copy of the current MonoCut.

perturb_speed(factor, affix_id=True)[source]

Return a new MonoCut that will lazily perturb the speed while loading audio. The num_samples, start and duration fields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlying Recording and the supervisions.

Parameters
  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_sp{factor}”.

Return type

MonoCut

Returns

a modified copy of the current MonoCut.

perturb_tempo(factor, affix_id=True)[source]

Return a new MonoCut that will lazily perturb the tempo while loading audio.

Compared to speed perturbation, tempo preserves pitch. The num_samples, start and duration fields are updated to reflect the shrinking/extending effect of speed. We are also updating the time markers of the underlying Recording and the supervisions.

Parameters
  • factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_tp{factor}”.

Return type

MonoCut

Returns

a modified copy of the current MonoCut.

perturb_volume(factor, affix_id=True)[source]

Return a new MonoCut that will lazily perturb the volume while loading audio.

Parameters
  • factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).

  • affix_id (bool) – When true, we will modify the MonoCut.id field by affixing it with “_vp{factor}”.

Return type

MonoCut

Returns

a modified copy of the current MonoCut.

map_supervisions(transform_fn)[source]

Modify the SupervisionSegments by transform_fn of this MonoCut.

Parameters

transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type

Cut

Returns

a modified MonoCut.

filter_supervisions(predicate)[source]

Modify cut to store only supervisions accepted by predicate

Example:
>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids)
>>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0)
>>> cut = cut.filter_supervisions(lambda s: s.text is not None)
Parameters

predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type

Cut

Returns

a modified MonoCut

static from_dict(data)[source]
Return type

MonoCut

with_features_path_prefix(path)[source]
Return type

MonoCut

with_recording_path_prefix(path)[source]
Return type

MonoCut

__init__(id, start, duration, channel, supervisions=<factory>, features=None, recording=None)
append(other, snr=None, preserve_id=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters

preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.

Return type

MixedCut

compute_and_store_recording(storage_path, augment_fn=None)

Store this cut’s waveform as audio recording to disk.

Parameters
  • storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

Return type

MonoCut

Returns

a new MonoCut instance.

compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

Parameters
  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type

ndarray

Returns

a numpy ndarray with the computed features.

index_supervisions(index_mixed_tracks=False, keep_ids=None)

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters
  • index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.

  • keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type

Dict[str, IntervalTree]

Returns

a mapping from Cut ID to an interval tree of SupervisionSegments.

mix(other, offset_other_by=0.0, snr=None, preserve_id=None)

Refer to :function:`~lhotse.cut.mix` documentation.

Return type

MixedCut

play_audio()

Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_alignment(alignment_type='word')

Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

plot_audio()

Display a plot of the waveform. Requires matplotlib to be installed.

plot_features()

Display the feature matrix as an image. Requires matplotlib to be installed.

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_audio_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_feature_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

to_dict()
Return type

dict

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center')

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|
Parameters
  • keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2.

  • min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.

  • context_direction (Literal[‘center’, ‘left’, ‘right’, ‘random’]) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.

Return type

List[Cut]

Returns

a list of cuts.

property trimmed_supervisions: List[lhotse.supervision.SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

Return type

List[SupervisionSegment]

with_id(id_)

Return a copy of the Cut with a new ID.

Return type

Cut

class lhotse.cut.PaddingCut(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None)[source]

PaddingCut is a dummy Cut that doesn’t refer to actual recordings or features –it simply returns zero samples in the time domain and a specified features value in the feature domain. Its main role is to be appended to other cuts to make them evenly sized.

Please refer to the documentation of Cut to learn more about using cuts.

See also:

id: str
duration: float
sampling_rate: int
feat_value: float
num_frames: Optional[int] = None
num_features: Optional[int] = None
frame_shift: Optional[float] = None
num_samples: Optional[int] = None
property start: float
Return type

float

property end: float
Return type

float

property supervisions
property has_features: bool
Return type

bool

property has_recording: bool
Return type

bool

load_features(*args, **kwargs)[source]
Return type

Optional[ndarray]

load_audio(*args, **kwargs)[source]
Return type

Optional[ndarray]

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, **kwargs)[source]
Return type

PaddingCut

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right', preserve_id=False)[source]

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters
  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

  • preserve_id (bool) – When True, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).

Return type

Cut

Returns

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

resample(sampling_rate, affix_id=False)[source]

Return a new MonoCut that will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.

Parameters
  • sampling_rate (int) – The new sampling rate.

  • affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type

PaddingCut

Returns

a modified copy of the current MonoCut.

perturb_speed(factor, affix_id=True)[source]

Return a new PaddingCut that will “mimic” the effect of speed perturbation on duration and num_samples.

Parameters
  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the PaddingCut.id field by affixing it with “_sp{factor}”.

Return type

PaddingCut

Returns

a modified copy of the current PaddingCut.

perturb_tempo(factor, affix_id=True)[source]

Return a new PaddingCut that will “mimic” the effect of tempo perturbation on duration and num_samples.

Compared to speed perturbation, tempo preserves pitch. :type factor: float :param factor: The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster). :type affix_id: bool :param affix_id: When true, we will modify the PaddingCut.id field

by affixing it with “_tp{factor}”.

Return type

PaddingCut

Returns

a modified copy of the current PaddingCut.

perturb_volume(factor, affix_id=True)[source]

Return a new PaddingCut that will “mimic” the effect of volume perturbation on amplitude of samples.

Parameters
  • factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).

  • affix_id (bool) – When true, we will modify the PaddingCut.id field by affixing it with “_vp{factor}”.

Return type

PaddingCut

Returns

a modified copy of the current PaddingCut.

drop_features()[source]

Return a copy of the current PaddingCut, detached from features.

Return type

PaddingCut

drop_recording()[source]

Return a copy of the current PaddingCut, detached from recording.

Return type

PaddingCut

drop_supervisions()[source]

Return a copy of the current PaddingCut, detached from supervisions.

Return type

PaddingCut

compute_and_store_features(extractor, *args, **kwargs)[source]

Returns a new PaddingCut with updates information about the feature dimension and number of feature frames, depending on the extractor properties.

Return type

Cut

map_supervisions(transform_fn)[source]

Just for consistency with MonoCut and MixedCut.

Parameters

transform_fn (Callable[[Any], Any]) – a dummy function that would be never called actually.

Return type

Cut

Returns

the PaddingCut itself.

filter_supervisions(predicate)[source]

Just for consistency with MonoCut and MixedCut.

Parameters

predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type

Cut

Returns

a modified MonoCut

static from_dict(data)[source]
Return type

PaddingCut

with_features_path_prefix(path)[source]
Return type

PaddingCut

with_recording_path_prefix(path)[source]
Return type

PaddingCut

__init__(id, duration, sampling_rate, feat_value, num_frames=None, num_features=None, frame_shift=None, num_samples=None)
append(other, snr=None, preserve_id=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters

preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.

Return type

MixedCut

compute_and_store_recording(storage_path, augment_fn=None)

Store this cut’s waveform as audio recording to disk.

Parameters
  • storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

Return type

MonoCut

Returns

a new MonoCut instance.

compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

Parameters
  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type

ndarray

Returns

a numpy ndarray with the computed features.

index_supervisions(index_mixed_tracks=False, keep_ids=None)

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters
  • index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.

  • keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type

Dict[str, IntervalTree]

Returns

a mapping from Cut ID to an interval tree of SupervisionSegments.

mix(other, offset_other_by=0.0, snr=None, preserve_id=None)

Refer to :function:`~lhotse.cut.mix` documentation.

Return type

MixedCut

play_audio()

Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_alignment(alignment_type='word')

Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

plot_audio()

Display a plot of the waveform. Requires matplotlib to be installed.

plot_features()

Display the feature matrix as an image. Requires matplotlib to be installed.

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_audio_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_feature_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

to_dict()
Return type

dict

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center')

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|
Parameters
  • keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2.

  • min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.

  • context_direction (Literal[‘center’, ‘left’, ‘right’, ‘random’]) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.

Return type

List[Cut]

Returns

a list of cuts.

property trimmed_supervisions: List[lhotse.supervision.SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

Return type

List[SupervisionSegment]

with_id(id_)

Return a copy of the Cut with a new ID.

Return type

Cut

features_type: Optional[str]
class lhotse.cut.MixTrack(cut, offset=0.0, snr=None)[source]

Represents a single track in a mix of Cuts. Points to a specific MonoCut and holds information on how to mix it with other Cuts, relative to the first track in a mix.

cut: Union[lhotse.cut.MonoCut, lhotse.cut.PaddingCut]
offset: float = 0.0
snr: Optional[float] = None
static from_dict(data)[source]
__init__(cut, offset=0.0, snr=None)
class lhotse.cut.MixedCut(id, tracks)[source]

MixedCut is a Cut that actually consists of multiple other cuts. It can be interpreted as a multi-channel cut, but its primary purpose is to allow time-domain and feature-domain augmentation via mixing the training cuts with noise, music, and babble cuts. The actual mixing operations are performed on-the-fly.

Internally, MixedCut holds other cuts in multiple trakcs (MixTrack), each with its own offset and SNR that is relative to the first track.

Please refer to the documentation of Cut to learn more about using cuts.

In addition to methods available in Cut, MixedCut provides the methods to read all of its tracks audio and features as separate channels:

>>> cut = MixedCut(...)
>>> mono_features = cut.load_features()
>>> assert len(mono_features.shape) == 2
>>> multi_features = cut.load_features(mixed=False)
>>> # Now, the first dimension is the channel.
>>> assert len(multi_features.shape) == 3

See also:

id: str
tracks: List[lhotse.cut.MixTrack]
property supervisions: List[lhotse.supervision.SupervisionSegment]

Lists the supervisions of the underlying source cuts. Each segment start time will be adjusted by the track offset.

Return type

List[SupervisionSegment]

property start: float
Return type

float

property end: float
Return type

float

property duration: float
Return type

float

property has_features: bool
Return type

bool

property has_recording: bool
Return type

bool

property num_frames: Optional[int]
Return type

Optional[int]

property frame_shift: Optional[float]
Return type

Optional[float]

property sampling_rate: Optional[int]
Return type

Optional[int]

property num_samples: Optional[int]
Return type

Optional[int]

property num_features: Optional[int]
Return type

Optional[int]

property features_type: Optional[str]
Return type

Optional[str]

truncate(*, offset=0.0, duration=None, keep_excessive_supervisions=True, preserve_id=False, _supervisions_index=None)[source]

Returns a new MixedCut that is a sub-region of the current MixedCut. This method truncates the underlying Cuts and modifies their offsets in the mix, as needed. Tracks that do not fit in the truncated cut are removed.

Note that no operation is done on the actual features - it’s only during the call to load_features() when the actual changes happen (a subset of features is loaded).

Parameters
  • offset (float) – float (seconds), controls the start of the new cut relative to the current MixedCut’s start.

  • duration (Optional[float]) – optional float (seconds), controls the duration of the resulting MixedCut. By default, the duration is (end of the cut before truncation) - (offset).

  • keep_excessive_supervisions (bool) – bool. Since trimming may happen inside a SupervisionSegment, the caller has an option to either keep or discard such supervisions.

  • preserve_id (bool) – bool. Should the truncated cut keep the same ID or get a new, random one.

Return type

Cut

Returns

a new MixedCut instance.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right', preserve_id=False)[source]

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters
  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

  • preserve_id (bool) – When True, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).

Return type

Cut

Returns

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

resample(sampling_rate, affix_id=False)[source]

Return a new MixedCut that will lazily resample the audio while reading it. This operation will drop the feature manifest, if attached. It does not affect the supervision.

Parameters
  • sampling_rate (int) – The new sampling rate.

  • affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type

MixedCut

Returns

a modified copy of the current MixedCut.

perturb_speed(factor, affix_id=True)[source]

Return a new MixedCut that will lazily perturb the speed while loading audio. The num_samples, start and duration fields of the underlying Cuts (and their Recordings and SupervisionSegments) are updated to reflect the shrinking/extending effect of speed. We are also updating the offsets of all underlying tracks.

Parameters
  • factor (float) – The speed will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_sp{factor}”.

Return type

MixedCut

Returns

a modified copy of the current MixedCut.

perturb_tempo(factor, affix_id=True)[source]

Return a new MixedCut that will lazily perturb the tempo while loading audio.

Compared to speed perturbation, tempo preserves pitch. The num_samples, start and duration fields of the underlying Cuts (and their Recordings and SupervisionSegments) are updated to reflect the shrinking/extending effect of tempo. We are also updating the offsets of all underlying tracks.

Parameters
  • factor (float) – The tempo will be adjusted this many times (e.g. factor=1.1 means 1.1x faster).

  • affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_tp{factor}”.

Return type

MixedCut

Returns

a modified copy of the current MixedCut.

perturb_volume(factor, affix_id=True)[source]

Return a new MixedCut that will lazily perturb the volume while loading audio. Recordings of the underlying Cuts are updated to reflect volume change.

Parameters
  • factor (float) – The volume will be adjusted this many times (e.g. factor=1.1 means 1.1x louder).

  • affix_id (bool) – When true, we will modify the MixedCut.id field by affixing it with “_vp{factor}”.

Return type

MixedCut

Returns

a modified copy of the current MixedCut.

load_features(mixed=True)[source]

Loads the features of the source cuts and mixes them on-the-fly.

Parameters

mixed (bool) – when True (default), returns a 2D array of features mixed in the feature domain. Otherwise returns a 3D array with the first dimension equal to the number of tracks.

Return type

Optional[ndarray]

Returns

A numpy ndarray with features and with shape (num_frames, num_features), or (num_tracks, num_frames, num_features)

load_audio(mixed=True)[source]

Loads the audios of the source cuts and mix them on-the-fly.

Parameters

mixed (bool) – When True (default), returns a mono mix of the underlying tracks. Otherwise returns a numpy array with the number of channels equal to the number of tracks.

Return type

Optional[ndarray]

Returns

A numpy ndarray with audio samples and with shape (num_channels, num_samples)

plot_tracks_features()[source]

Display the feature matrix as an image. Requires matplotlib to be installed.

plot_tracks_audio()[source]

Display plots of the individual tracks’ waveforms. Requires matplotlib to be installed.

drop_features()[source]

Return a copy of the current MixedCut, detached from features.

Return type

MixedCut

drop_recording()[source]

Return a copy of the current MixedCut, detached from recording.

Return type

MixedCut

drop_supervisions()[source]

Return a copy of the current MixedCut, detached from supervisions.

Return type

MixedCut

compute_and_store_features(extractor, storage, augment_fn=None, mix_eagerly=True)[source]

Compute the features from this cut, store them on disk, and create a new MonoCut object with the feature manifest attached. This cut has to be able to load audio.

Parameters
  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • storage (FeaturesWriter) – a FeaturesWriter instance used to store the features.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation.

  • mix_eagerly (bool) – when False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a new MonoCut instance with the same ID. The returned MonoCut will not have a Recording attached.

Return type

Cut

Returns

a new MonoCut instance if mix_eagerly is True, or returns self with each of the tracks containing the Features manifests.

map_supervisions(transform_fn)[source]

Modify the SupervisionSegments by transform_fn of this MixedCut.

Parameters

transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type

Cut

Returns

a modified MixedCut.

filter_supervisions(predicate)[source]

Modify cut to store only supervisions accepted by predicate

Example:
>>> cut = cut.filter_supervisions(lambda s: s.id in supervision_ids)
>>> cut = cut.filter_supervisions(lambda s: s.duration < 5.0)
>>> cut = cut.filter_supervisions(lambda s: s.text is not None)
Parameters

predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type

Cut

Returns

a modified MonoCut

static from_dict(data)[source]
Return type

MixedCut

with_features_path_prefix(path)[source]
Return type

MixedCut

with_recording_path_prefix(path)[source]
Return type

MixedCut

__init__(id, tracks)
append(other, snr=None, preserve_id=None)

Append the other Cut after the current Cut. Conceptually the same as mix but with an offset matching the current cuts length. Optionally scale down (positive SNR) or scale up (negative SNR) the other cut. Returns a MixedCut, which only keeps the information about the mix; actual mixing is performed during the call to load_features.

Parameters

preserve_id (Optional[str]) – optional string (“left”, “right”). When specified, append will preserve the cut ID of the left- or right-hand side argument. Otherwise, a new random ID is generated.

Return type

MixedCut

compute_and_store_recording(storage_path, augment_fn=None)

Store this cut’s waveform as audio recording to disk.

Parameters
  • storage_path (Union[Path, str]) – The path to location where we will store the audio recordings.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

Return type

MonoCut

Returns

a new MonoCut instance.

compute_features(extractor, augment_fn=None)

Compute the features from this cut. This cut has to be able to load audio.

Parameters
  • extractor (FeatureExtractor) – a FeatureExtractor instance used to compute the features.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – optional WavAugmenter instance for audio augmentation.

Return type

ndarray

Returns

a numpy ndarray with the computed features.

index_supervisions(index_mixed_tracks=False, keep_ids=None)

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters
  • index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.

  • keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type

Dict[str, IntervalTree]

Returns

a mapping from Cut ID to an interval tree of SupervisionSegments.

mix(other, offset_other_by=0.0, snr=None, preserve_id=None)

Refer to :function:`~lhotse.cut.mix` documentation.

Return type

MixedCut

play_audio()

Display a Jupyter widget that allows to listen to the waveform. Works only in Jupyter notebook/lab or similar (e.g. Colab).

plot_alignment(alignment_type='word')

Display the alignment on top of a spectrogram. Requires matplotlib to be installed.

plot_audio()

Display a plot of the waveform. Requires matplotlib to be installed.

plot_features()

Display the feature matrix as an image. Requires matplotlib to be installed.

speakers_audio_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_samples), and its values are 0 for nonspeech samples and 1 for speech samples for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

speakers_feature_mask(min_speaker_dim=None, speaker_to_idx_map=None, use_alignment_if_exists=None)

Return a matrix of per-speaker activity in a cut. The matrix shape is (num_speakers, num_frames), and its values are 0 for nonspeech frames and 1 for speech frames for each respective speaker.

This is somewhat inspired by the TS-VAD setup: https://arxiv.org/abs/2005.07272

Parameters
  • min_speaker_dim (Optional[int]) – optional int, when specified it will enforce that the matrix shape is at least that value (useful for datasets like CHiME 6 where the number of speakers is always 4, but some cuts might have less speakers than that).

  • speaker_to_idx_map (Optional[Dict[str, int]]) – optional dict mapping speaker names (strings) to their global indices (ints). Useful when you want to preserve the order of the speakers (e.g. speaker XYZ is always mapped to index 2)

  • use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_audio_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for samples covered by at least one supervision, and 0 for samples not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

supervisions_feature_mask(use_alignment_if_exists=None)

Return a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Parameters

use_alignment_if_exists (Optional[str]) – optional str, key for alignment type to use for generating the mask. If not exists, fall back on supervision time spans.

Return type

ndarray

to_dict()
Return type

dict

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center')

Splits the current Cut into as many cuts as there are supervisions (SupervisionSegment). These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded via keep_overlapping flag.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|
Parameters
  • keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2.

  • min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.

  • context_direction (Literal[‘center’, ‘left’, ‘right’, ‘random’]) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.

Return type

List[Cut]

Returns

a list of cuts.

property trimmed_supervisions: List[lhotse.supervision.SupervisionSegment]

Return the supervisions in this Cut that have modified time boundaries so as not to exceed the Cut’s start or end.

Note that when cut.supervisions is called, the supervisions may have negative start values that indicate the supervision actually begins before the cut, or end values that exceed the Cut’s duration (it means the supervision continued in the original recording after the Cut’s ending).

Caution

For some tasks such as speech recognition (ASR), trimmed supervisions could result in corrupted training data. This is because a part of the transcript might actually reside outside of the cut.

Return type

List[SupervisionSegment]

with_id(id_)

Return a copy of the Cut with a new ID.

Return type

Cut

class lhotse.cut.CutSet(cuts=None)[source]

CutSet represents a collection of cuts, indexed by cut IDs. CutSet ties together all types of data – audio, features and supervisions, and is suitable to represent training/dev/test sets.

Note

CutSet is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.

When coming from Kaldi, there is really no good equivalent – the closest concept may be Kaldi’s “egs” for training neural networks, which are chunks of feature matrices and corresponding alignments used respectively as inputs and supervisions. CutSet is different because it provides you with all kinds of metadata, and you can select just the interesting bits to feed them to your models.

CutSet can be created from any combination of RecordingSet, SupervisionSet, and FeatureSet with lhotse.cut.CutSet.from_manifests():

>>> from lhotse import CutSet
>>> cuts = CutSet.from_manifests(recordings=my_recording_set)
>>> cuts2 = CutSet.from_manifests(features=my_feature_set)
>>> cuts3 = CutSet.from_manifests(
...     recordings=my_recording_set,
...     features=my_feature_set,
...     supervisions=my_supervision_set,
... )

When creating a CutSet with CutSet.from_manifests(), the resulting cuts will have the same duration as the input recordings or features. For long recordings, it is not viable for training. We provide several methods to transform the cuts into shorter ones.

Consider the following scenario:

                  Recording
|-------------------------------------------|
"Hey, Matt!"     "Yes?"        "Oh, nothing"
|----------|     |----|        |-----------|

.......... CutSet.from_manifests() ..........
                    Cut1
|-------------------------------------------|

............. Example CutSet A ..............
    Cut1          Cut2              Cut3
|----------|     |----|        |-----------|

............. Example CutSet B ..............
          Cut1                  Cut2
|---------------------||--------------------|

............. Example CutSet C ..............
             Cut1        Cut2
            |---|      |------|

The CutSet’s A, B and C can be created like:

>>> cuts_A = cuts.trim_to_supervisions()
>>> cuts_B = cuts.cut_into_windows(duration=5.0)
>>> cuts_C = cuts.trim_to_unsupervised_segments()

Note

Some operations support parallel execution via an optional num_jobs parameter. By default, all processing is single-threaded.

Caution

Operations on cut sets are not mutating – they return modified copies of CutSet objects, leaving the original object unmodified (and all of its cuts are also unmodified).

CutSet can be stored and read from JSON, JSONL, etc. and supports optional gzip compression:

>>> cuts.to_file('cuts.jsonl.gz')
>>> cuts4 = CutSet.from_file('cuts.jsonl.gz')

It behaves similarly to a dict:

>>> 'rec1-1-0' in cuts
True
>>> cut = cuts['rec1-1-0']
>>> for cut in cuts:
>>>    pass
>>> len(cuts)
127

CutSet has some convenience properties and methods to gather information about the dataset:

>>> ids = list(cuts.ids)
>>> speaker_id_set = cuts.speakers
>>> # The following prints a message:
>>> cuts.describe()
Cuts count: 547
Total duration (hours): 326.4
Speech duration (hours): 79.6 (24.4%)
***
Duration statistics (seconds):
mean    2148.0
std      870.9
min      477.0
25%     1523.0
50%     2157.0
75%     2423.0
max     5415.0
dtype: float64

Manipulation examples:

>>> longer_than_5s = cuts.filter(lambda c: c.duration > 5)
>>> first_100 = cuts.subset(first=100)
>>> split_into_4 = cuts.split(num_splits=4)
>>> shuffled = cuts.shuffle()
>>> random_sample = cuts.sample(n_cuts=10)
>>> new_ids = cuts.modify_ids(lambda c: c.id + '-newid')

These operations can be composed to implement more complex operations, e.g. bucketing by duration:

>>> buckets = cuts.sort_by_duration().split(num_splits=30)

Cuts in a CutSet can be detached from parts of their metadata:

>>> cuts_no_feat = cuts.drop_features()
>>> cuts_no_rec = cuts.drop_recordings()
>>> cuts_no_sup = cuts.drop_supervisions()

Sometimes specific sorting patterns are useful when a small CutSet represents a mini-batch:

>>> cuts = cuts.sort_by_duration(ascending=False)
>>> cuts = cuts.sort_like(other_cuts)

CutSet offers some batch processing operations:

>>> cuts = cuts.pad(num_frames=300)  # or duration=30.0
>>> cuts = cuts.truncate(max_duration=30.0, offset_type='start')  # truncate from start to 30.0s
>>> cuts = cuts.mix(other_cuts, snr=[10, 30], mix_prob=0.5)

CutSet supports lazy data augmentation/transformation methods which require adjusting some information in the manifest (e.g., num_samples or duration). Note that in the following examples, the audio is untouched – the operations are stored in the manifest, and executed upon reading the audio:

>>> cuts_sp = cuts.perturb_speed(factor=1.1)
>>> cuts_vp = cuts.perturb_volume(factor=2.)
>>> cuts_24k = cuts.resample(24000)

Caution

If the CutSet contained Features manifests, they will be detached after performing audio augmentations such as CutSet.perturb_speed() or CutSet.resample() or CutSet.perturb_volume().

CutSet offers parallel feature extraction capabilities (see meth:.CutSet.compute_and_store_features: for details), and can be used to estimate global mean and variance:

>>> from lhotse import Fbank
>>> cuts = CutSet()
>>> cuts = cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='/data/feats',
...     num_jobs=4
... )
>>> mvn_stats = cuts.compute_global_feature_stats('/data/features/mvn_stats.pkl', max_cuts=10000)

See also:

__init__(cuts=None)[source]
property is_lazy: bool

Indicates whether this manifest was opened in lazy (read-on-the-fly) mode or not.

Return type

bool

property mixed_cuts: Dict[str, lhotse.cut.MixedCut]
Return type

Dict[str, MixedCut]

property simple_cuts: Dict[str, lhotse.cut.MonoCut]
Return type

Dict[str, MonoCut]

property ids: Iterable[str]
Return type

Iterable[str]

property speakers: FrozenSet[str]
Return type

FrozenSet[str]

static from_cuts(cuts)[source]
Return type

CutSet

static from_manifests(recordings=None, supervisions=None, features=None, random_ids=False)[source]

Create a CutSet from any combination of supervision, feature and recording manifests. At least one of recordings or features is required.

The created cuts will be of type MonoCut, even when the recordings have multiple channels. The MonoCut boundaries correspond to those found in the features, when available, otherwise to those found in the recordings.

When supervisions are provided, we’ll be searching them for matching recording IDs and attaching to created cuts, assuming they are fully within the cut’s time span.

Parameters
  • recordings (Optional[RecordingSet]) – an optional RecordingSet manifest.

  • supervisions (Optional[SupervisionSet]) – an optional SupervisionSet manifest.

  • features (Optional[FeatureSet]) – an optional FeatureSet manifest.

  • random_ids (bool) – boolean, should the cut IDs be randomized. By default, use the recording ID with a loop index and a channel idx, i.e. “{recording_id}-{idx}-{channel}”)

Return type

CutSet

Returns

a new CutSet instance.

static from_dicts(data)[source]
Return type

CutSet

to_dicts()[source]
Return type

Iterable[dict]

describe()[source]

Print a message describing details about the CutSet - the number of cuts and the duration statistics, including the total duration and the percentage of speech segments.

Example output:

Cuts count: 547 Total duration (hours): 326.4 Speech duration (hours): 79.6 (24.4%) *** Duration statistics (seconds): mean 2148.0 std 870.9 min 477.0 25% 1523.0 50% 2157.0 75% 2423.0 max 5415.0 dtype: float64

Return type

None

shuffle(rng=None)[source]

Shuffle the cut IDs in the current CutSet and return a shuffled copy of self.

Parameters

rng (Optional[Random]) – an optional instance of random.Random for precise control of randomness.

Return type

CutSet

Returns

a shuffled copy of self.

split(num_splits, shuffle=False, drop_last=False)[source]

Split the CutSet into num_splits pieces of equal size.

Parameters
  • num_splits (int) – Requested number of splits.

  • shuffle (bool) – Optionally shuffle the recordings order first.

  • drop_last (bool) – determines how to handle splitting when len(seq) is not divisible by num_splits. When False (default), the splits might have unequal lengths. When True, it may discard the last element in some splits to ensure they are equally long.

Return type

List[CutSet]

Returns

A list of CutSet pieces.

subset(*, supervision_ids=None, cut_ids=None, first=None, last=None)[source]

Return a new CutSet according to the selected subset criterion. Only a single argument to subset is supported at this time.

Example:
>>> cuts = CutSet.from_yaml('path/to/cuts')
>>> train_set = cuts.subset(supervision_ids=train_ids)
>>> test_set = cuts.subset(supervision_ids=test_ids)
Parameters
  • supervision_ids (Optional[Iterable[str]]) – List of supervision IDs to keep.

  • cut_ids (Optional[Iterable[str]]) – List of cut IDs to keep. The returned CutSet preserves the order of cut_ids.

  • first (Optional[int]) – int, the number of first cuts to keep.

  • last (Optional[int]) – int, the number of last cuts to keep.

Return type

CutSet

Returns

a new CutSet with the subset results.

filter_supervisions(predicate)[source]

Return a new CutSet with Cuts containing only SupervisionSegments satisfying predicate

Cuts without supervisions are preserved

Example:
>>> cuts = CutSet.from_yaml('path/to/cuts')
>>> at_least_five_second_supervisions = cuts.filter_supervisions(lambda s: s.duration >= 5)
Parameters

predicate (Callable[[SupervisionSegment], bool]) – A callable that accepts SupervisionSegment and returns bool

Return type

CutSet

Returns

a CutSet with filtered supervisions

filter(predicate)[source]

Return a new CutSet with the Cuts that satisfy the predicate.

Parameters

predicate (Callable[[Cut], bool]) – a function that takes a cut as an argument and returns bool.

Return type

CutSet

Returns

a filtered CutSet.

trim_to_supervisions(keep_overlapping=True, min_duration=None, context_direction='center', num_jobs=1)[source]

Return a new CutSet with Cuts that have identical spans as their supervisions.

For example, the following cut:

        Cut
|-----------------|
 Sup1
|----|  Sup2
   |-----------|

is transformed into two cuts:

 Cut1
|----|
 Sup1
|----|
   Sup2
   |-|
        Cut2
   |-----------|
   Sup1
   |-|
        Sup2
   |-----------|
Parameters
  • keep_overlapping (bool) – when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration above, it would discard Sup2 in Cut1 and Sup1 in Cut2.

  • min_duration (Optional[float]) – An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.

  • context_direction (Literal[‘center’, ‘left’, ‘right’, ‘random’]) – Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.

  • num_jobs (int) – Number of parallel workers to process the cuts.

Return type

CutSet

Returns

a CutSet.

trim_to_unsupervised_segments()[source]

Return a new CutSet with Cuts created from segments that have no supervisions (likely silence or noise).

Return type

CutSet

Returns

a CutSet.

mix_same_recording_channels()[source]

Find cuts that come from the same recording and have matching start and end times, but represent different channels. Then, mix them together (in matching groups) and return a new CutSet that contains their mixes. This is useful for processing microphone array recordings.

It is intended to be used as the first operation after creating a new CutSet (but might also work in other circumstances, e.g. if it was cut to windows first).

Example:
>>> ami = prepare_ami('path/to/ami')
>>> cut_set = CutSet.from_manifests(recordings=ami['train']['recordings'])
>>> multi_channel_cut_set = cut_set.mix_same_recording_channels()

In the AMI example, the multi_channel_cut_set will yield MixedCuts that hold all single-channel Cuts together.

Return type

CutSet

sort_by_duration(ascending=False)[source]

Sort the CutSet according to cuts duration and return the result. Descending by default.

Return type

CutSet

sort_like(other)[source]

Sort the CutSet according to the order of cut IDs in other and return the result.

Return type

CutSet

index_supervisions(index_mixed_tracks=False, keep_ids=None)[source]

Create a two-level index of supervision segments. It is a mapping from a Cut’s ID to an interval tree that contains the supervisions of that Cut.

The interval tree can be efficiently queried for overlapping and/or enveloping segments. It helps speed up some operations on Cuts of very long recordings (1h+) that contain many supervisions.

Parameters
  • index_mixed_tracks (bool) – Should the tracks of MixedCut’s be indexed as additional, separate entries.

  • keep_ids (Optional[Set[str]]) – If specified, we will only index the supervisions with the specified IDs.

Return type

Dict[str, IntervalTree]

Returns

a mapping from MonoCut ID to an interval tree of SupervisionSegments.

pad(duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right', preserve_id=False)[source]

Return a new CutSet with Cuts padded to duration, num_frames or num_samples. Cuts longer than the specified argument will not be affected. By default, cuts will be padded to the right (i.e. after the signal).

When none of duration, num_frames, or num_samples is specified, we’ll try to determine the best way to pad to the longest cut based on whether features or recordings are available.

Parameters
  • duration (Optional[float]) – The cuts minimal duration after padding. When not specified, we’ll choose the duration of the longest cut in the CutSet.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

  • preserve_id (bool) – When True, preserves the cut ID from before padding. Otherwise, generates a new random ID (default).

Return type

CutSet

Returns

A padded CutSet.

truncate(max_duration, offset_type, keep_excessive_supervisions=True, preserve_id=False)[source]

Return a new CutSet with the Cuts truncated so that their durations are at most max_duration. Cuts shorter than max_duration will not be changed. :type max_duration: float :param max_duration: float, the maximum duration in seconds of a cut in the resulting manifest. :type offset_type: str :param offset_type: str, can be: - ‘start’ => cuts are truncated from their start; - ‘end’ => cuts are truncated from their end minus max_duration; - ‘random’ => cuts are truncated randomly between their start and their end minus max_duration :type keep_excessive_supervisions: bool :param keep_excessive_supervisions: bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept. :type preserve_id: bool :param preserve_id: bool. Should the truncated cut keep the same ID or get a new, random one. :rtype: CutSet :return: a new CutSet instance with truncated cuts.

cut_into_windows(duration, keep_excessive_supervisions=True, num_jobs=1)[source]

Return a new CutSet, made by traversing each MonoCut in windows of duration seconds and creating new MonoCut out of them.

The last window might have a shorter duration if there was not enough audio, so you might want to use either .filter() or .pad() afterwards to obtain a uniform duration CutSet.

Parameters
  • duration (float) – Desired duration of the new cuts in seconds.

  • keep_excessive_supervisions (bool) – bool. When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

  • num_jobs (int) – The number of parallel workers.

Return type

CutSet

Returns

a new CutSet with cuts made from shorter duration windows.

sample(n_cuts=1)[source]

Randomly sample this CutSet and return n_cuts cuts. When n_cuts is 1, will return a single cut instance; otherwise will return a CutSet.

Return type

Union[Cut, CutSet]

resample(sampling_rate, affix_id=False)[source]

Return a new CutSet that contains cuts resampled to the new sampling_rate. All cuts in the manifest must contain recording information. If the feature manifests are attached, they are dropped.

Parameters
  • sampling_rate (int) – The new sampling rate.

  • affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type

CutSet

Returns

a modified copy of the CutSet.

perturb_speed(factor, affix_id=True)[source]

Return a new CutSet that contains speed perturbed cuts with a factor of factor. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are modified to reflect the speed perturbed start times and durations.

Parameters
  • factor (float) – The resulting playback speed is factor times the original one.

  • affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type

CutSet

Returns

a modified copy of the CutSet.

perturb_tempo(factor, affix_id=True)[source]

Return a new CutSet that contains tempo perturbed cuts with a factor of factor.

Compared to speed perturbation, tempo preserves pitch. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are modified to reflect the tempo perturbed start times and durations.

Parameters
  • factor (float) – The resulting playback tempo is factor times the original one.

  • affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type

CutSet

Returns

a modified copy of the CutSet.

perturb_volume(factor, affix_id=True)[source]

Return a new CutSet that contains volume perturbed cuts with a factor of factor. It requires the recording manifests to be present. If the feature manifests are attached, they are dropped. The supervision manifests are remaining the same.

Parameters
  • factor (float) – The resulting playback volume is factor times the original one.

  • affix_id (bool) – Should we modify the ID (useful if both versions of the same cut are going to be present in a single manifest).

Return type

CutSet

Returns

a modified copy of the CutSet.

mix(cuts, duration=None, snr=20, preserve_id=None, mix_prob=1.0)[source]

Mix cuts in this CutSet with randomly sampled cuts from another CutSet. A typical application would be data augmentation with noise, music, babble, etc.

Parameters
  • cuts (CutSet) – a CutSet containing cuts to be mixed into this CutSet.

  • duration (Optional[float]) – an optional float in seconds. When None, we will preserve the duration of the cuts in self (i.e. we’ll truncate the mix if it exceeded the original duration). Otherwise, we will keep sampling cuts to mix in until we reach the specified duration (and truncate to that value, should it be exceeded).

  • snr (Union[float, Sequence[float], None]) – an optional float, or pair (range) of floats, in decibels. When it’s a single float, we will mix all cuts with this SNR level (where cuts in self are treated as signals, and cuts in cuts are treated as noise). When it’s a pair of floats, we will uniformly sample SNR values from that range. When None, we will mix the cuts without any level adjustment (could be too noisy for data augmentation).

  • preserve_id (Optional[str]) – optional string (“left”, “right”). when specified, append will preserve the cut id of the left- or right-hand side argument. otherwise, a new random id is generated.

  • mix_prob (float) – an optional float in range [0, 1]. Specifies the probability of performing a mix. Values lower than 1.0 mean that some cuts in the output will be unchanged.

Return type

CutSet

Returns

a new CutSet with mixed cuts.

drop_features()[source]

Return a new CutSet, where each Cut is copied and detached from its extracted features.

Return type

CutSet

drop_recordings()[source]

Return a new CutSet, where each Cut is copied and detached from its recordings.

Return type

CutSet

drop_supervisions()[source]

Return a new CutSet, where each Cut is copied and detached from its supervisions.

Return type

CutSet

compute_and_store_features(extractor, storage_path, num_jobs=None, augment_fn=None, storage_type=<class 'lhotse.features.io.LilcomHdf5Writer'>, executor=None, mix_eagerly=True, progress_bar=True)[source]

Extract features for all cuts, possibly in parallel, and store them using the specified storage object.

Examples:

Extract fbank features on one machine using 8 processes, store arrays partitioned in 8 HDF5 files with lilcom compression:

>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=8,
... )

Extract fbank features on one machine using 8 processes, store each array in a separate file with lilcom compression:

>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=8,
...     storage_type=LilcomFilesWriter
... )

Extract fbank features on multiple machines using a Dask cluster with 80 jobs, store arrays partitioned in 80 HDF5 files with lilcom compression:

>>> from distributed import Client
... cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='feats',
...     num_jobs=80,
...     executor=Client(...)
... )

Extract fbank features on one machine using 8 processes, store each array in an S3 bucket (requires smart_open):

>>> cuts = CutSet(...)
... cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='s3://my-feature-bucket/my-corpus-features',
...     num_jobs=8,
...     storage_type=LilcomURLWriter
... )
Parameters
  • extractor (FeatureExtractor) – A FeatureExtractor instance (either Lhotse’s built-in or a custom implementation).

  • storage_path (Union[Path, str]) – The path to location where we will store the features. The exact type and layout of stored files will be dictated by the storage_type argument.

  • num_jobs (Optional[int]) – The number of parallel processes used to extract the features. We will internally split the CutSet into this many chunks and process each chunk in parallel.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

  • storage_type (Type[~FW]) – a FeaturesWriter subclass type. It determines how the features are stored to disk, e.g. separate file per array, HDF5 files with multiple arrays, etc.

  • executor (Optional[Executor]) – when provided, will be used to parallelize the feature extraction process. By default, we will instantiate a ProcessPoolExecutor. Learn more about the Executor API at https://lhotse.readthedocs.io/en/latest/parallelism.html

  • mix_eagerly (bool) – Related to how the features are extracted for MixedCut instances, if any are present. When False, extract and store the features for each track separately, and mix them dynamically when loading the features. When True, mix the audio first and store the mixed features, returning a new MonoCut instance with the same ID. The returned MonoCut will not have a Recording attached.

  • progress_bar (bool) – Should a progress bar be displayed (automatically turned off for parallel computation).

Return type

CutSet

Returns

Returns a new CutSet with Features manifests attached to the cuts.

compute_and_store_features_batch(extractor, storage_path, batch_duration=600.0, num_workers=4, augment_fn=None, storage_type=<class 'lhotse.features.io.LilcomHdf5Writer'>)[source]

Extract features for all cuts in batches. This method is intended for use with compatible feature extractors that implement an accelerated extract_batch() method. For example, kaldifeat extractors can be used this way (see, e.g., KaldifeatFbank or KaldifeatMfcc).

When a CUDA GPU is available and enabled for the feature extractor, this can be much faster than CutSet.compute_and_store_features(). Otherwise, the speed will be comparable to single-threaded extraction.

Example: extract fbank features on one GPU, using 4 dataloading workers for reading audio, and store the arrays in an HDF5 file with lilcom compression:

>>> from lhotse import KaldifeatFbank, KaldifeatFbankConfig
>>> extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda'))
>>> cuts = CutSet(...)
... cuts = cuts.compute_and_store_features_batch(
...     extractor=extractor,
...     storage_path='feats',
...     batch_duration=500,
...     num_workers=4,
... )
Parameters
  • extractor (FeatureExtractor) – A FeatureExtractor instance, which should implement an accelerated extract_batch method.

  • storage_path (Union[Path, str]) – The path to location where we will store the features. The exact type and layout of stored files will be dictated by the storage_type argument.

  • batch_duration (float) – The maximum number of audio seconds in a batch. Determines batch size dynamically.

  • num_workers (int) – How many background dataloading workers should be used for reading the audio.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

  • storage_type (Type[~FW]) – a FeaturesWriter subclass type. It determines how the features are stored to disk, e.g. separate file per array, HDF5 files with multiple arrays, etc.

Return type

CutSet

Returns

Returns a new CutSet with Features manifests attached to the cuts.

compute_and_store_recordings(storage_path, num_jobs=None, executor=None, augment_fn=None, progress_bar=True)[source]

Store waveforms of all cuts as audio recordings to disk.

Parameters
  • storage_path (Union[Path, str]) – The path to location where we will store the audio recordings. For each cut, a sub-directory will be created that starts with the first 3 characters of the cut’s ID. The audio recording is then stored in the sub-directory using the cut ID as filename and ‘.flac’ as suffix.

  • num_jobs (Optional[int]) – The number of parallel processes used to store the audio recordings. We will internally split the CutSet into this many chunks and process each chunk in parallel.

  • augment_fn (Optional[Callable[[ndarray, int], ndarray]]) – an optional callable used for audio augmentation. Be careful with the types of augmentations used: if they modify the start/end/duration times of the cut and its supervisions, you will end up with incorrect supervision information when using this API. E.g. for speed perturbation, use CutSet.perturb_speed() instead.

  • executor (Optional[Executor]) – when provided, will be used to parallelize the process. By default, we will instantiate a ProcessPoolExecutor. Learn more about the Executor API at https://lhotse.readthedocs.io/en/latest/parallelism.html

  • progress_bar (bool) – Should a progress bar be displayed (automatically turned off for parallel computation).

Return type

CutSet

Returns

Returns a new CutSet.

compute_global_feature_stats(storage_path=None, max_cuts=None)[source]

Compute the global means and standard deviations for each feature bin in the manifest. It follows the implementation in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/extmath.py#L715 which follows the paper: “Algorithms for computing the sample variance: analysis and recommendations”, by Chan, Golub, and LeVeque.

Parameters
  • storage_path (Union[Path, str, None]) – an optional path to a file where the stats will be stored with pickle.

  • max_cuts (Optional[int]) – optionally, limit the number of cuts used for stats estimation. The cuts will be selected randomly in that case.

Return a dict of ``{‘norm_means’``{‘norm_means’

np.ndarray, ‘norm_stds’: np.ndarray}`` with the shape of the arrays equal to the number of feature bins in this manifest.

Return type

Dict[str, ndarray]

with_features_path_prefix(path)[source]
Return type

CutSet

with_recording_path_prefix(path)[source]
Return type

CutSet

map(transform_fn)[source]

Apply transform_fn to the cuts in this CutSet and return a new CutSet.

Parameters

transform_fn (Callable[[Cut], Cut]) – A callable (function) that accepts a single cut instance and returns a single cut instance.

Return type

CutSet

Returns

a new CutSet with transformed cuts.

modify_ids(transform_fn)[source]

Modify the IDs of cuts in this CutSet. Useful when combining multiple ``CutSet``s that were created from a single source, but contain features with different data augmentations techniques.

Parameters

transform_fn (Callable[[str], str]) – A callable (function) that accepts a string (cut ID) and returns

a new string (new cut ID). :rtype: CutSet :return: a new CutSet with cuts with modified IDs.

map_supervisions(transform_fn)[source]

Modify the SupervisionSegments by transform_fn in this CutSet.

Parameters

transform_fn (Callable[[SupervisionSegment], SupervisionSegment]) – a function that modifies a supervision as an argument.

Return type

CutSet

Returns

a new, modified CutSet.

transform_text(transform_fn)[source]

Return a copy of this CutSet with all SupervisionSegments text transformed with transform_fn. Useful for text normalization, phonetic transcription, etc.

Parameters

transform_fn (Callable[[str], str]) – a function that accepts a string and returns a string.

Return type

CutSet

Returns

a new, modified CutSet.

count(value) integer -- return number of occurrences of value
classmethod from_file(path)
Return type

Any

classmethod from_json(path)
Return type

Any

classmethod from_jsonl(path)
Return type

Any

classmethod from_jsonl_lazy(path)

Read a JSONL manifest in a lazy manner, which opens the file but does not read it immediately. It is only suitable for sequential reads and iteration.

Warning

Opening the manifest in this way might cause some methods that rely on random access to fail.

Return type

Any

classmethod from_yaml(path)
Return type

Any

index(value[, start[, stop]]) integer -- return first index of value.

Raises ValueError if the value is not present.

Supporting start and stop arguments is optional, but recommended.

classmethod open_writer(path, overwrite=True)

Open a sequential writer that allows to store the manifests one by one, without the necessity of storing the whole manifest set in-memory. Supports writing to JSONL format (.jsonl), with optional gzip compression (.jsonl.gz).

Example:

>>> from lhotse import RecordingSet
... recordings = [...]
... with RecordingSet.open_writer('recordings.jsonl.gz') as writer:
...     for recording in recordings:
...         writer.write(recording)

This writer can be useful for continuing to write files that were previously stopped – it will open the existing file and scan it for item IDs to skip writing them later. It can also be queried for existing IDs so that the user code may skip preparing the corresponding manifets.

Example:

>>> from lhotse import RecordingSet, Recording
... with RecordingSet.open_writer('recordings.jsonl.gz', overwrite=False) as writer:
...     for path in Path('.').rglob('*.wav'):
...         recording_id = path.stem
...         if writer.contains(recording_id):
...             # Item already written previously - skip processing.
...             continue
...         # Item doesn't exist yet - run extra work to prepare the manifest
...         # and store it.
...         recording = Recording.from_file(path, recording_id=recording_id)
...         writer.write(recording)
Return type

SequentialJsonlWriter

to_file(path)
Return type

None

to_json(path)
Return type

None

to_jsonl(path)
Return type

None

to_yaml(path)
Return type

None

lhotse.cut.make_windowed_cuts_from_features(feature_set, cut_duration, cut_shift=None, keep_shorter_windows=False)[source]

Converts a FeatureSet to a CutSet by traversing each Features object in - possibly overlapping - windows, and creating a MonoCut out of that area. By default, the last window in traversal will be discarded if it cannot satisfy the cut_duration requirement.

Parameters
  • feature_set (FeatureSet) – a FeatureSet object.

  • cut_duration (float) – float, duration of created Cuts in seconds.

  • cut_shift (Optional[float]) – optional float, specifies how many seconds are in between the starts of consecutive windows. Equals cut_duration by default.

  • keep_shorter_windows (bool) – bool, when True, the last window will be used to create a MonoCut even if its duration is shorter than cut_duration.

Return type

CutSet

Returns

a CutSet object.

lhotse.cut.mix(reference_cut, mixed_in_cut, offset=0, snr=None, preserve_id=None)[source]

Overlay, or mix, two cuts. Optionally the mixed_in_cut may be shifted by offset seconds and scaled down (positive SNR) or scaled up (negative SNR). Returns a MixedCut, which contains both cuts and the mix information. The actual feature mixing is performed during the call to load_features().

Parameters
  • reference_cut (Cut) – The reference cut for the mix - offset and snr are specified w.r.t this cut.

  • mixed_in_cut (Cut) – The mixed-in cut - it will be offset and rescaled to match the offset and snr parameters.

  • offset (float) – How many seconds to shift the mixed_in_cut w.r.t. the reference_cut.

  • snr (Optional[float]) – Desired SNR of the right_cut w.r.t. the left_cut in the mix.

  • preserve_id (Optional[str]) – optional string (“left”, “right”). when specified, append will preserve the cut id of the left- or right-hand side argument. otherwise, a new random id is generated.

Return type

MixedCut

Returns

A MixedCut instance.

lhotse.cut.pad(cut, duration=None, num_frames=None, num_samples=None, pad_feat_value=- 23.025850929940457, direction='right', preserve_id=False)[source]

Return a new MixedCut, padded with zeros in the recording, and pad_feat_value in each feature bin.

The user can choose to pad either to a specific duration; a specific number of frames max_frames; or a specific number of samples num_samples. The three arguments are mutually exclusive.

Parameters
  • cut (Cut) – MonoCut to be padded.

  • duration (Optional[float]) – The cut’s minimal duration after padding.

  • num_frames (Optional[int]) – The cut’s total number of frames after padding.

  • num_samples (Optional[int]) – The cut’s total number of samples after padding.

  • pad_feat_value (float) – A float value that’s used for padding the features. By default we assume a log-energy floor of approx. -23 (1e-10 after exp).

  • direction (str) – string, ‘left’, ‘right’ or ‘both’. Determines whether the padding is added before or after the cut.

  • preserve_id (bool) – When True, preserves the cut ID before padding. Otherwise, a new random ID is generated for the padded cut (default).

Return type

Cut

Returns

a padded MixedCut if duration is greater than this cut’s duration, otherwise self.

lhotse.cut.append(left_cut, right_cut, snr=None, preserve_id=None)[source]

Helper method for functional-style appending of Cuts.

Return type

MixedCut

lhotse.cut.mix_cuts(cuts)[source]

Return a MixedCut that consists of the input Cuts mixed with each other as-is.

Return type

MixedCut

lhotse.cut.append_cuts(cuts)[source]

Return a MixedCut that consists of the input Cuts appended to each other as-is.

Return type

Cut

lhotse.cut.compute_supervisions_frame_mask(cut, frame_shift=None, use_alignment_if_exists=None)[source]

Compute a mask that indicates which frames in a cut are covered by supervisions.

Parameters
  • cut (Cut) – a cut object.

  • frame_shift (Optional[float]) – optional frame shift in seconds; required when the cut does not have pre-computed features, otherwise ignored.

  • use_alignment_if_exists (Optional[str]) – optional str (key from alignment dict); use the specified alignment type for generating the mask

:returns a 1D numpy array with value 1 for frames covered by at least one supervision, and 0 for frames not covered by any supervision.

Recipes

Convenience methods used to prepare recording and supervision manifests for standard corpora.

Kaldi conversion

Convenience methods used to interact with Kaldi data directories.

lhotse.kaldi.get_duration(path)[source]

Read a audio file, it supports pipeline style wave path and real waveform.

Parameters

path (Union[Path, str]) – Path to an audio file or a Kaldi-style pipe.

Return type

float

Returns

float duration of the recording, in seconds.

lhotse.kaldi.load_kaldi_data_dir(path, sampling_rate, frame_shift=None, map_string_to_underscores=None, num_jobs=1)[source]

Load a Kaldi data directory and convert it to a Lhotse RecordingSet and SupervisionSet manifests. For this to work, at least the wav.scp file must exist. SupervisionSet is created only when a segments file exists. All the other files (text, utt2spk, etc.) are optional, and some of them might not be handled yet. In particular, feats.scp files are ignored.

Parameters

map_string_to_underscores (Optional[str]) – optional string, when specified, we will replace all instances of this string in SupervisonSegment IDs to underscores. This is to help with handling underscores in Kaldi (see export_to_kaldi()). This is also done for speaker IDs.

Return type

Tuple[RecordingSet, Optional[SupervisionSet], Optional[FeatureSet]]

lhotse.kaldi.export_to_kaldi(recordings, supervisions, output_dir, map_underscores_to=None)[source]

Export a pair of RecordingSet and SupervisionSet to a Kaldi data directory. It even supports recordings that have multiple channels but the recordings will still have to have a single AudioSource.

The RecordingSet and SupervisionSet must be compatible, i.e. it must be possible to create a CutSet out of them.

Parameters
  • recordings (RecordingSet) – a RecordingSet manifest.

  • supervisions (SupervisionSet) – a SupervisionSet manifest.

  • output_dir (Union[Path, str]) – path where the Kaldi-style data directory will be created.

  • map_underscores_to (Optional[str]) – optional string with which we will replace all underscores. This helps avoid issues with Kaldi data dir sorting.

lhotse.kaldi.load_kaldi_text_mapping(path, must_exist=False)[source]

Load Kaldi files such as utt2spk, spk2gender, text, etc. as a dict.

Return type

Dict[str, Optional[str]]

lhotse.kaldi.save_kaldi_text_mapping(data, path)[source]

Save flat dicts to Kaldi files such as utt2spk, spk2gender, text, etc.

lhotse.kaldi.make_wavscp_channel_string_map(source, sampling_rate)[source]
Return type

Dict[int, str]

Others

Helper methods used throughout the codebase.

lhotse.manipulation.combine(*manifests)[source]

Combine multiple manifests of the same type into one.

Examples:
>>> # Pass several arguments
>>> combine(recording_set1, recording_set2, recording_set3)
>>> # Or pass a single list/tuple of manifests
>>> combine([supervision_set1, supervision_set2])
Return type

~Manifest

lhotse.manipulation.split_parallelize_combine(num_jobs, manifest, fn, *args, **kwargs)[source]

Convenience wrapper that parallelizes the execution of functions that transform manifests. It splits the manifests into num_jobs pieces, applies the function to each split, and then combines the splits.

This function is used internally in Lhotse to implement some parallel ops.

Example:

>>> from lhotse import CutSet, split_parallelize_combine
>>> cuts = CutSet(...)
>>> window_cuts = split_parallelize_combine(
...     16,
...     cuts,
...     CutSet.cut_into_windows,
...     duration=30.0
... )
Parameters
  • num_jobs (int) – The number of parallel jobs.

  • manifest (~Manifest) – The manifest to be processed.

  • fn (Callable) – Function or method that transforms the manifest; the first parameter has to be manifest (for methods, they have to be methods on that manifests type, e.g. CutSet.cut_into_windows.

  • args – positional arguments to fn.

:param kwargs keyword arguments to fn.

Return type

~Manifest

lhotse.manipulation.to_manifest(items)[source]

Take an iterable of data types in Lhotse such as Recording, SupervisonSegment or Cut, and create the manifest of the corresponding type. When the iterable is empty, returns None.

Return type

Optional[~Manifest]