Feature extraction
Lhotse provides the following feature extractor implementations:
Log-Mel filter-bank
Fbank
and MFCCMfcc
PyTorch implementations. They are very close to Kaldi’s, and their underlying components are PyTorch modules that can be used as layers in neural networks (i.e. support batching, GPUs, autograd, and TorchScript). These classes are found inlhotse.features.kaldi.layers
(in particular:Wav2LogFilterBank
andWav2MFCC
). We also provide online inference methods to support deployment in audio streaming applications.Torchaudio Kaldi-compatible extractors
TorchaudioFbank
,TorchaudioMfcc
, andSpectrogram
. They only support processing one utterance at a time (batching is not possible).Librosa compatible filter-bank feature extractor
LibrosaFbank
(compatible with the one used in ESPnet and ParallelWaveGAN projects for TTS and vocoders);kaldifeat – another Kaldi-compatible feature extraction implementation that can process batches of uneven lengths efficiently, implemented in C++ with Python wrappers.
`opensmile`_ – a wrapper over popular set of feature extractors, often used in modeling non-verbal aspects of speech (e.g., emotion recognition).
We also support custom defined feature extractors via a Python API.
We are striving for a simple relation between the audio duration, the number of frames, and the frame shift (with a known sampling rate):
num_samples = round(duration * sampling_rate)
window_hop = round(frame_shift * sampling_rate)
num_frames = int((num_samples + window_hop // 2) // window_hop)
This is equivalent of having Kaldi’s snip_edges
parameter set to False, and Lhotse expects every feature extractor to conform to that requirement.
Storing features
Features in Lhotse are stored as numpy matrices with shape (num_frames, num_features)
.
By default, we use lilcom for lossy compression and reduce the size on the disk by about 3x.
The lilcom compression method uses a fixed precision that doesn’t depend on the magnitude of the thing being compressed, so it’s better suited to log-energy features than energy features.
By default, we store these matrices in archives with our own custom format that allows efficient reads of chunks compressed with lilcom. Other options such as HDF5 are also available.
There are two types of manifests:
one describing the feature extractor;
one describing the extracted feature matrices.
The feature extractor manifest is mapped to a Python configuration dataclass. An example for spectrogram:
dither: 0.0
energy_floor: 1e-10
frame_length: 0.025
frame_shift: 0.01
min_duration: 0.0
preemphasis_coefficient: 0.97
raw_energy: true
remove_dc_offset: true
round_to_power_of_two: true
window_type: povey
type: spectrogram
And the corresponding configuration class:
- class lhotse.features.SpectrogramConfig(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, device='cpu')[source]
-
sampling_rate:
int
= 16000
-
frame_length:
float
= 0.025
-
frame_shift:
float
= 0.01
-
round_to_power_of_two:
bool
= True
-
remove_dc_offset:
bool
= True
-
preemph_coeff:
float
= 0.97
-
window_type:
str
= 'povey'
-
dither:
float
= 0.0
-
snip_edges:
bool
= False
-
energy_floor:
float
= 1e-10
-
raw_energy:
bool
= True
-
use_energy:
bool
= False
-
use_fft_mag:
bool
= False
-
device:
str
= 'cpu'
- to_dict()[source]
- Return type:
Dict
[str
,Any
]
- static from_dict(data)[source]
- Return type:
SpectrogramConfig
- __init__(sampling_rate=16000, frame_length=0.025, frame_shift=0.01, round_to_power_of_two=True, remove_dc_offset=True, preemph_coeff=0.97, window_type='povey', dither=0.0, snip_edges=False, energy_floor=1e-10, raw_energy=True, use_energy=False, use_fft_mag=False, device='cpu')
-
sampling_rate:
The feature matrices manifest is a list of documents.
These documents contain the information necessary to tie the features to a particular recording: start
, duration
,
channel
and recording_id
.
They also provide some useful information, such as the type of features, number of frames and feature dimension.
Finally, they specify how the feature matrix is stored with storage_type
(currently numpy
or lilcom
),
and where to find it with the storage_path
. In the future there might be more storage types.
- channels: 0
duration: 16.04
num_features: 23
num_frames: 1604
recording_id: recording-1
start: 0.0
storage_path: test/fixtures/libri/storage/dc2e0952-f2f8-423c-9b8c-f5481652ee1d.llc
storage_type: lilcom
type: fbank
Feature normalization
We will briefly discuss how to perform mean and variance normalization (a.k.a. CMVN) in Lhotse effectively. We compute and store unnormalized features, and it is up to the user to normalize them if they want to do so. There are three common ways to perform feature normalization:
Global normalization: we compute the means and variances using the whole data (
FeatureSet
orCutSet
), and apply the same transform on every sample. The global statistics can be computed efficiently withFeatureSet.compute_global_stats()
orCutSet.compute_global_feature_stats()
. They use an iterative algorithm that does not require loading the whole dataset into memory.Per-instance normalization: we compute the means and variances separately for each data sample (i.e. a single feature matrix). Each feature matrix undergoes a different transform. This approach seems to be common in computer vision modeling.
Sliding window (“online”) normalization: we compute the means and variances using a slice of the feature matrix with a specified duration, e.g. 3 seconds (a standard value in Kaldi). This is useful when we expect the model to work on incomplete inputs, e.g. streaming speech recognition. We currently recommend using Torchaudio CMVN for that.
Python usage
Typically you’ll want to extract features from cuts. In case of long recordings, it is fine to extract the features for long-recording cuts, and cut those into shorter segments later. Our default feature storage mechanism is fairly efficient when reading chunks.
from lhotse import CutSet
cuts = CutSet.from_file("data/cuts.jsonl.gz")
# Create a log Mel energy filter bank feature extractor with default settings
fbank = Fbank()
# Compute features for cuts with 8 parallel jobs and return a new CutSet which
# references those features.
cuts = cuts.compute_and_store_features(
extractor=fbank,
storage_path="data/fbank",
num_jobs=8,
)
cuts.to_file("data/cuts_fbank.jsonl.gz")
CLI usage
An equivalent example using the terminal:
lhotse feat write-default-config feat-config.yml
lhotse feat extract-cuts -j 8 -f feat-config.yml \
data/cuts.jsonl.gz data/cuts_fbank.jsonl.gz data/fbank
Kaldi compatibility caveats
Most of the spectrogram/fbank/mfcc parameters are the same as in Kaldi. However, we are not fully compatible - Kaldi computes energies from a signal scaled between -32,768 to 32,767, while we scale signal between -1.0 and 1.0. It results in Kaldi energies being significantly greater than in Lhotse. Also, by default, we turn off dithering for deterministic feature extraction.