Feature extraction¶
Feature extraction in Lhotse is currently based exclusively on the Torchaudio library. We support spectrograms, log-Mel energies (fbank) and MFCCs. Fbank are the default features. We also support custom defined feature extractors via a Python API (which won’t be available in the CLI, unless there is a popular demand for that).
We are striving for a simple relation between the audio duration, the number of frames,
and the frame shift.
You only need to know two of those values to compute the third one, regardless of the frame length.
This is equivalent of having Kaldi’s snip_edges
parameter set to False.
Storing features¶
Features in Lhotse are stored as numpy matrices with shape (num_frames, num_features)
.
By default, we use lilcom for lossy compression and reduce the size on the disk by about 3x.
The lilcom compression method uses a fixed precision that doesn’t depend on the magnitude of the thing being compressed, so it’s better suited to log-energy features than energy features.
We currently support two kinds of storage:
HDF5 files with multiple feature matrices
directory with feature matrix per file
We retrieve the arrays by loading the whole feature matrix from disk and selecting the relevant region (e.g. specified by a cut). Therefore it makes sense to cut the recordings first, and then extract the features for them to avoid loading unnecessary data from disk (especially for very long recordings).
There are two types of manifests:
one describing the feature extractor;
one describing the extracted feature matrices.
The feature extractor manifest is mapped to a Python configuration dataclass. An example for spectrogram:
dither: 0.0
energy_floor: 1e-10
frame_length: 0.025
frame_shift: 0.01
min_duration: 0.0
preemphasis_coefficient: 0.97
raw_energy: true
remove_dc_offset: true
round_to_power_of_two: true
window_type: povey
type: spectrogram
And the corresponding configuration class:
-
class
lhotse.features.
SpectrogramConfig
(dither: float = 0.0, window_type: str = 'povey', frame_length: float = 0.025, frame_shift: float = 0.01, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, energy_floor: float = 1e-10, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True) -
dither
: float = 0.0
-
window_type
: str = 'povey'
-
frame_length
: float = 0.025
-
frame_shift
: float = 0.01
-
remove_dc_offset
: bool = True
-
round_to_power_of_two
: bool = True
-
energy_floor
: float = 1e-10
-
min_duration
: float = 0.0
-
preemphasis_coefficient
: float = 0.97
-
raw_energy
: bool = True
-
__init__
(dither=0.0, window_type='povey', frame_length=0.025, frame_shift=0.01, remove_dc_offset=True, round_to_power_of_two=True, energy_floor=1e-10, min_duration=0.0, preemphasis_coefficient=0.97, raw_energy=True) Initialize self. See help(type(self)) for accurate signature.
-
The feature matrices manifest is a list of documents.
These documents contain the information necessary to tie the features to a particular recording: start
, duration
,
channel
and recording_id
. They currently do not have their own IDs.
They also provide some useful information, such as the type of features, number of frames and feature dimension.
Finally, they specify how the feature matrix is stored with storage_type
(currently numpy
or lilcom
),
and where to find it with the storage_path
. In the future there might be more storage types.
- channels: 0
duration: 16.04
num_features: 23
num_frames: 1604
recording_id: recording-1
start: 0.0
storage_path: test/fixtures/libri/storage/dc2e0952-f2f8-423c-9b8c-f5481652ee1d.llc
storage_type: lilcom
type: fbank
Creating custom feature extractor¶
There are two components needed to implement a custom feature extractor: a configuration and the extractor itself.
We expect the configuration class to be a dataclass, so that it can be automatically mapped to dict and serialized.
The feature extractor should inherit from FeatureExtractor
,
and implement a small number of methods/properties.
The base class takes care of initialization (you need to pass a config object), serialization to YAML, etc.
A minimal, complete example of adding a new feature extractor:
from scipy.signal import stft
@dataclass
class ExampleFeatureExtractorConfig:
frame_len: Seconds = 0.025
frame_shift: Seconds = 0.01
class ExampleFeatureExtractor(FeatureExtractor):
"""
A minimal class example, showing how to implement a custom feature extractor in Lhotse.
"""
name = 'example-feature-extractor'
config_type = ExampleFeatureExtractorConfig
def extract(self, samples: np.ndarray, sampling_rate: int) -> np.ndarray:
f, t, Zxx = stft(
samples,
sampling_rate,
nperseg=round(self.config.frame_len * sampling_rate),
noverlap=round(self.frame_shift * sampling_rate)
)
# Note: returning a magnitude of the STFT might interact badly with lilcom compression,
# as it performs quantization of the float values and works best with log-scale quantities.
# It's advised to turn lilcom compression off, or use log-scale, in such cases.
return np.abs(Zxx)
@property
def frame_shift(self) -> Seconds:
return self.config.frame_shift
def feature_dim(self, sampling_rate: int) -> int:
return (sampling_rate * self.config.frame_len) / 2 + 1
The overridden members include:
name
for easy debuggability/automatic re-creation of an extractor;config_type
which specifies the complementary configuration class type;extract()
where the actual computation takes place;frame_shift
property, which is key to know the relationship between the duration and the number of frames.feature_dim()
method, which accepts thesampling_rate
as its argument, as some types of features (e.g. spectrogram) will depend on that.
Additionally, there are two extra methods than when overridden, allow to perform dynamic feature-space mixing (see Cuts):
@staticmethod
def mix(features_a: np.ndarray, features_b: np.ndarray, gain_b: float) -> np.ndarray:
raise ValueError(f'The feature extractor\'s "mix" operation is undefined.')
@staticmethod
def compute_energy(features: np.ndarray) -> float:
raise ValueError(f'The feature extractor\'s "compute_energy" is undefined.')
They are:
mix()
which specifies how to mix two feature matrices to obtain a new feature matrix representing the sum of signals;compute_energy()
which specifies how to obtain a total energy of the feature matrix, which is needed to mix two signals with a specified SNR. E.g. for a power spectrogram, this could be the sum of every time-frequency bin. It is expected to never return a zero.
During the feature-domain mix with a specified signal-to-noise ratio (SNR), we assume that one of the signals is a reference signal - it is used to initialize the FeatureMixer
class. We compute the energy of both signals and scale the non-reference signal, so that its energy satisfies the requested SNR. The scaling factor (gain) is computed using the following formula:
1 2 3 4 5 6 7 8 9 |
reference_feats = self.tracks[0]
num_frames_offset = compute_num_frames(duration=offset, frame_shift=self.frame_shift)
current_num_frames = reference_feats.shape[0]
incoming_num_frames = feats.shape[0] + num_frames_offset
mix_num_frames = max(current_num_frames, incoming_num_frames)
feats_to_add = feats
|
Note that we interpret the energy and the SNR in a power quantity context (as opposed to root-power/field quantities).
Feature normalization¶
We will briefly discuss how to perform mean and variance normalization (a.k.a. CMVN) in Lhotse effectively. We compute and store unnormalized features, and it is up to the user to normalize them if they want to do so. There are three common ways to perform feature normalization:
Global normalization: we compute the means and variances using the whole data (
FeatureSet
orCutSet
), and apply the same transform on every sample. The global statistics can be computed efficiently withFeatureSet.compute_global_stats()
orCutSet.compute_global_feature_stats()
. They use an iterative algorithm that does not require loading the whole dataset into memory.Per-instance normalization: we compute the means and variances separately for each data sample (i.e. a single feature matrix). Each feature matrix undergoes a different transform. This approach seems to be common in computer vision modelling.
Sliding window (“online”) normalization: we compute the means and variances using a slice of the feature matrix with a specified duration, e.g. 3 seconds (a standard value in Kaldi). This is useful when we expect the model to work on incomplete inputs, e.g. streaming speech recognition. We currently recommend using Torchaudio CMVN for that.
Storage backend details¶
Lhotse can be extended with additional storage backends via two abstractions: FeaturesWriter
and FeaturesReader
. We currently implement the following writers (and their corresponding readers):
lhotse.features.io.LilcomFilesWriter
lhotse.features.io.NumpyFilesWriter
lhotse.features.io.LilcomHdf5Writer
lhotse.features.io.NumpyHdf5Writer
The FeaturesWriter
and FeaturesReader
API is as follows:
-
class
lhotse.features.io.
FeaturesWriter
FeaturesWriter
defines the interface of how to store numpy arrays in a particular storage backend. This backend could either be:separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.
Each class inheriting from
FeaturesWriter
must define:- the
write()
method, which defines the storing operation (accepts a
key
used to place thevalue
array in the storage);
- the
- the
storage_path()
property, which is either a common directory for the files, the name of the file storing multiple arrays, name of the cloud bucket, etc.
- the
- the
name()
property that is unique to this particular storage mechanism - it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.
- the
Each
FeaturesWriter
can also be used as a context manager, as some implementations might need to free a resource after the writing is finalized. By default nothing happens in the context manager functions, and this can be modified by the inheriting subclasses.- Example:
- with MyWriter(‘some/path’) as storage:
extractor.extract_from_recording_and_store(recording, storage)
The features loading must be defined separately in a class inheriting from
FeaturesReader
.-
abstract property
name
- Return type
str
-
abstract property
storage_path
- Return type
str
-
abstract
write
(key, value) - Return type
str
-
class
lhotse.features.io.
FeaturesReader
FeaturesReader
defines the interface of how to load numpy arrays from a particular storage backend. This backend could either be:separate files on a local filesystem;
a single file with multiple arrays;
cloud storage;
etc.
Each class inheriting from
FeaturesReader
must define:- the
read()
method, which defines the loading operation (accepts the
key
to locate the array in the storage and return it). The read method should support selecting only a subset of the feature matrix, with the bounds expressed as argumentsleft_offset_frames
andright_offset_frames
. It’s up to the Reader implementation to load only the required part or trim it to that range only after loading. It is assumed that the time dimension is always the first one.
- the
- the
name()
property that is unique to this particular storage mechanism - it is stored in the features manifests (metadata) and used to automatically deduce the backend when loading the features.
- the
The features writing must be defined separately in a class inheriting from
FeaturesWriter
.-
abstract property
name
- Return type
str
-
abstract
read
(key, left_offset_frames=0, right_offset_frames=None) - Return type
ndarray
Python usage¶
The feature manifest is represented by a FeatureSet
object.
Feature extractors have a class that represents both the extract and its configuration, named FeatureExtractor
.
We provide a utility called FeatureSetBuilder
that can process a RecordingSet
in parallel,
store the feature matrices on disk and generate a feature manifest.
For example:
from lhotse import RecordingSet, Fbank, LilcomFilesWriter
# Read a RecordingSet from disk
recording_set = RecordingSet.from_yaml('audio.yml')
# Create a log Mel energy filter bank feature extractor with default settings
feature_extractor = Fbank()
# Create a feature set builder that uses this extractor and stores the results in a directory called 'features'
with LilcomFilesWriter('features') as storage:
builder = FeatureSetBuilder(feature_extractor=feature_extractor, storage=storage)
# Extract the features using 8 parallel processes, compress, and store them on in 'features/storage/' directory.
# Then, return the feature manifest object, which is also compressed and
# stored in 'features/feature_manifest.json.gz'
feature_set = builder.process_and_store_recordings(
recordings=recording_set,
num_jobs=8
)
CLI usage¶
An equivalent example using the terminal:
lhotse write-default-feature-config feat-config.yml
lhotse make-feats -j 8 --storage-type lilcom_files -f feat-config.yml audio.yml features/
Kaldi compatibility caveats¶
We are relying on Torchaudio Kaldi compatibility module, so most of the spectrogram/fbank/mfcc parameters are the same as in Kaldi. However, we are not fully compatible - Kaldi computes energies from a signal scaled between -32,768 to 32,767, while Torchaudio scales the signal between -1.0 and 1.0. It results in Kaldi energies being significantly greater than in Lhotse. By default, we turn off dithering for deterministic feature extraction.