Feature extraction

Feature extraction in Lhotse is currently based exclusively on the Torchaudio library. We support spectrograms, log-Mel energies (fbank) and MFCCs. Fbank are the default features. We also support custom defined feature extractors via a Python API (which won’t be available in the CLI, unless there is a popular demand for that).

We are striving for a simple relation between the audio duration, the number of frames, and the frame shift. You only need to know two of those values to compute the third one, regardless of the frame length. This is equivalent of having Kaldi’s snip_edges parameter set to False.

Storing features

Features in Lhotse are stored as numpy matrices with shape (num_frames, num_features). By default, we use lilcom for lossy compression and reduce the size on the disk by about 3x. The lilcom compression method uses a fixed precision that doesn’t depend on the magnitude of the thing being compressed, so it’s better suited to log-energy features than energy features. We currently support two kinds of storage:

  • HDF5 files with multiple feature matrices

  • directory with feature matrix per file

We retrieve the arrays by loading the whole feature matrix from disk and selecting the relevant region (e.g. specified by a cut). Therefore it makes sense to cut the recordings first, and then extract the features for them to avoid loading unnecessary data from disk (especially for very long recordings).

There are two types of manifests:

  • one describing the feature extractor;

  • one describing the extracted feature matrices.

The feature extractor manifest is mapped to a Python configuration dataclass. An example for spectrogram:

dither: 0.0
energy_floor: 1e-10
frame_length: 0.025
frame_shift: 0.01
min_duration: 0.0
preemphasis_coefficient: 0.97
raw_energy: true
remove_dc_offset: true
round_to_power_of_two: true
window_type: povey
type: spectrogram

And the corresponding configuration class:

class lhotse.features.SpectrogramConfig(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True)
dither: float = 0.0
window_type: str = 'povey'
frame_length: float = 0.025
frame_shift: float = 0.01
remove_dc_offset: bool = True
round_to_power_of_two: bool = True
energy_floor: float = 1e-10
min_duration: float = 0.0
preemphasis_coefficient: float = 0.97
raw_energy: bool = True
__init__(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True)

Initialize self. See help(type(self)) for accurate signature.

The feature matrices manifest is a list of documents. These documents contain the information necessary to tie the features to a particular recording: start, duration, channel and recording_id. They currently do not have their own IDs. They also provide some useful information, such as the type of features, number of frames and feature dimension. Finally, they specify how the feature matrix is stored with storage_type (currently numpy or lilcom), and where to find it with the storage_path. In the future there might be more storage types.

- channels: 0
  duration: 16.04
  num_features: 23
  num_frames: 1604
  recording_id: recording-1
  start: 0.0
  storage_path: test/fixtures/libri/storage/dc2e0952-f2f8-423c-9b8c-f5481652ee1d.llc
  storage_type: lilcom
  type: fbank

Creating custom feature extractor

There are two components needed to implement a custom feature extractor: a configuration and the extractor itself. We expect the configuration class to be a dataclass, so that it can be automatically mapped to dict and serialized. The feature extractor should inherit from FeatureExtractor, and implement a small number of methods/properties. The base class takes care of initialization (you need to pass a config object), serialization to YAML, etc. A minimal, complete example of adding a new feature extractor:

from scipy.signal import stft

class ExampleFeatureExtractorConfig:
    frame_len: Seconds = 0.025
    frame_shift: Seconds = 0.01

class ExampleFeatureExtractor(FeatureExtractor):
    A minimal class example, showing how to implement a custom feature extractor in Lhotse.
    name = 'example-feature-extractor'
    config_type = ExampleFeatureExtractorConfig

    def extract(self, samples: np.ndarray, sampling_rate: int) -> np.ndarray:
        f, t, Zxx = stft(
            nperseg=round(self.config.frame_len * sampling_rate),
            noverlap=round(self.frame_shift * sampling_rate)
        # Note: returning a magnitude of the STFT might interact badly with lilcom compression,
        # as it performs quantization of the float values and works best with log-scale quantities.
        # It's advised to turn lilcom compression off, or use log-scale, in such cases.
        return np.abs(Zxx)

    def frame_shift(self) -> Seconds:
        return self.config.frame_shift

    def feature_dim(self, sampling_rate: int) -> int:
        return (sampling_rate * self.config.frame_len) / 2 + 1

The overridden members include:

  • name for easy debuggability/automatic re-creation of an extractor;

  • config_type which specifies the complementary configuration class type;

  • extract() where the actual computation takes place;

  • frame_shift property, which is key to know the relationship between the duration and the number of frames.

  • feature_dim() method, which accepts the sampling_rate as its argument, as some types of features (e.g. spectrogram) will depend on that.

Additionally, there are two extra methods than when overridden, allow to perform dynamic feature-space mixing (see Cuts):

def mix(features_a: np.ndarray, features_b: np.ndarray, gain_b: float) -> np.ndarray:
    raise ValueError(f'The feature extractor\'s "mix" operation is undefined.')

def compute_energy(features: np.ndarray) -> float:
    raise ValueError(f'The feature extractor\'s "compute_energy" is undefined.')

They are:

  • mix() which specifies how to mix two feature matrices to obtain a new feature matrix representing the sum of signals;

  • compute_energy() which specifies how to obtain a total energy of the feature matrix, which is needed to mix two signals with a specified SNR. E.g. for a power spectrogram, this could be the sum of every time-frequency bin. It is expected to never return a zero.

During the feature-domain mix with a specified signal-to-noise ratio (SNR), we assume that one of the signals is a reference signal - it is used to initialize the FeatureMixer class. We compute the energy of both signals and scale the non-reference signal, so that its energy satisfies the requested SNR. The scaling factor (gain) is computed using the following formula:


        reference_feats = self.tracks[0]
        num_frames_offset = compute_num_frames(duration=offset, frame_shift=self.frame_shift)
        current_num_frames = reference_feats.shape[0]
        incoming_num_frames = feats.shape[0] + num_frames_offset
        mix_num_frames = max(current_num_frames, incoming_num_frames)

        feats_to_add = feats

Note that we interpret the energy and the SNR in a power quantity context (as opposed to root-power/field quantities).

Feature normalization

We will briefly discuss how to perform mean and variance normalization (a.k.a. CMVN) in Lhotse effectively. We compute and store unnormalized features, and it is up to the user to normalize them if they want to do so. There are three common ways to perform feature normalization:

  • Global normalization: we compute the means and variances using the whole data (FeatureSet or CutSet), and apply the same transform on every sample. The global statistics can be computed efficiently with FeatureSet.compute_global_stats() or CutSet.compute_global_feature_stats(). They use an iterative algorithm that does not require loading the whole dataset into memory.

  • Per-instance normalization: we compute the means and variances separately for each data sample (i.e. a single feature matrix). Each feature matrix undergoes a different transform. This approach seems to be common in computer vision modelling.

  • Sliding window (“online”) normalization: we compute the means and variances using a slice of the feature matrix with a specified duration, e.g. 3 seconds (a standard value in Kaldi). This is useful when we expect the model to work on incomplete inputs, e.g. streaming speech recognition. We currently recommend using Torchaudio CMVN for that.

Storage backend details

Lhotse can be extended with additional storage backends via two abstractions: FeaturesWriter and FeaturesReader. We currently implement the following writers (and their corresponding readers):

  • lhotse.features.io.LilcomFilesWriter

  • lhotse.features.io.NumpyFilesWriter

  • lhotse.features.io.LilcomHdf5Writer

  • lhotse.features.io.NumpyHdf5Writer

The FeaturesWriter and FeaturesReader API is as follows:

class lhotse.features.io.FeaturesWriter

FeaturesWriter defines the interface of how to store numpy arrays in a particular storage backend. This backend could either be:

  • separate files on a local filesystem;

  • a single file with multiple arrays;

  • cloud storage;

  • etc.

Each class inheriting from FeaturesWriter must define:

  • the write() method, which defines the storing operation

    (accepts a key used to place the value array in the storage);

  • the storage_path() property, which is either a common directory for the files,

    the name of the file storing multiple arrays, name of the cloud bucket, etc.

  • the name() property that is unique to this particular storage mechanism -

    it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

Each FeaturesWriter can also be used as a context manager, as some implementations might need to free a resource after the writing is finalized. By default nothing happens in the context manager functions, and this can be modified by the inheriting subclasses.

with MyWriter(‘some/path’) as storage:

extractor.extract_from_recording_and_store(recording, storage)

The features loading must be defined separately in a class inheriting from FeaturesReader.

abstract property name
Return type


abstract property storage_path
Return type


abstract write(key, value)
Return type


class lhotse.features.io.FeaturesReader

FeaturesReader defines the interface of how to load numpy arrays from a particular storage backend. This backend could either be:

  • separate files on a local filesystem;

  • a single file with multiple arrays;

  • cloud storage;

  • etc.

Each class inheriting from FeaturesReader must define:

  • the read() method, which defines the loading operation

    (accepts the key to locate the array in the storage and return it). The read method should support selecting only a subset of the feature matrix, with the bounds expressed as arguments left_offset_frames and right_offset_frames. It’s up to the Reader implementation to load only the required part or trim it to that range only after loading. It is assumed that the time dimension is always the first one.

  • the name() property that is unique to this particular storage mechanism -

    it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.

The features writing must be defined separately in a class inheriting from FeaturesWriter.

abstract property name
Return type


abstract read(key, left_offset_frames=0, right_offset_frames=None)
Return type


Python usage

The feature manifest is represented by a FeatureSet object. Feature extractors have a class that represents both the extract and its configuration, named FeatureExtractor. We provide a utility called FeatureSetBuilder that can process a RecordingSet in parallel, store the feature matrices on disk and generate a feature manifest.

For example:

from lhotse import RecordingSet, Fbank, LilcomFilesWriter

# Read a RecordingSet from disk
recording_set = RecordingSet.from_yaml('audio.yml')
# Create a log Mel energy filter bank feature extractor with default settings
feature_extractor = Fbank()
# Create a feature set builder that uses this extractor and stores the results in a directory called 'features'
with LilcomFilesWriter('features') as storage:
    builder = FeatureSetBuilder(feature_extractor=feature_extractor, storage=storage)
    # Extract the features using 8 parallel processes, compress, and store them on in 'features/storage/' directory.
    # Then, return the feature manifest object, which is also compressed and
    # stored in 'features/feature_manifest.json.gz'
    feature_set = builder.process_and_store_recordings(

CLI usage

An equivalent example using the terminal:

lhotse write-default-feature-config feat-config.yml
lhotse make-feats -j 8 --storage-type lilcom_files -f feat-config.yml audio.yml features/

Kaldi compatibility caveats

We are relying on Torchaudio Kaldi compatibility module, so most of the spectrogram/fbank/mfcc parameters are the same as in Kaldi. However, we are not fully compatible - Kaldi computes energies from a signal scaled between -32,768 to 32,767, while Torchaudio scales the signal between -1.0 and 1.0. It results in Kaldi energies being significantly greater than in Lhotse. By default, we turn off dithering for deterministic feature extraction.