Kaldi Interoperability
Data import/export
We support importing Kaldi data directories that contain at least the wav.scp file,
required to create the RecordingSet.
Other files, such as segments, utt2spk, etc. are used to create the SupervisionSet.
We also support converting feats.scp to FeatureSet, and reading features
directly from Kaldi’s scp/ark files via kaldi_native_io library (which is an optional Lhotse’s dependency).
We also allow to export a pair of RecordingSet and SupervisionSet
to a Kaldi data directory.
We currently do not support the following (but may start doing so in the future):
Exporting Lhotse extracted features to Kaldi’s
feats.scpExport Lhotse’s multi-channel recording sets to Kaldi
Kaldi feature extractors
We support Kaldi-compatible log-mel filter energies (“fbank”) and MFCCs. We provide a PyTorch implementation that is GPU-compatible, allows batching, and backpropagation. To learn more about feature extraction in Lhotse, see Feature extraction.
Python
Python methods related to Kaldi support:
- lhotse.kaldi.floor_duration_to_milliseconds(duration)[source]
- Return type:
float
Floor the duration to multiplies of 0.001 seconds. This is to avoid float precision problems with workflows like:
lhotse kaldi import … lhotse fix … ./local/compute_fbank_imported.py (from icefall) lhotse cut trim-to-supervisions … ./local/validate_manifest.py … (from icefall)
- Without flooring, there were different lengths:
Supervision end time 1093.33995833 is larger than cut end time 1093.3399375
- This is still within the 2ms tolerance in K2SpeechRecognitionDataset::validate_for_asr():
https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/speech_recognition.py#L201
- lhotse.kaldi.get_duration(path)[source]
Read a audio file, it supports pipeline style wave path and real waveform.
- Parameters:
path (
Union[Path,str]) – Path to an audio file or a Kaldi-style pipe.- Return type:
Optional[float]- Returns:
float duration of the recording, in seconds or None in case of read error.
- lhotse.kaldi.load_kaldi_data_dir(path, sampling_rate, frame_shift=None, map_string_to_underscores=None, use_reco2dur=True, num_jobs=1, feature_type='kaldi-fbank')[source]
Load a Kaldi data directory and convert it to a Lhotse RecordingSet and SupervisionSet manifests. For this to work, at least the wav.scp file must exist. SupervisionSet is created only when a segments file exists. reco2dur is used by default when exists (to enforce reading the duration from the audio files themselves, please set use_reco2dur = False. All the other files (text, utt2spk, etc.) are optional, and some of them might not be handled yet. In particular, feats.scp files are ignored.
- Parameters:
path (
Union[Path,str]) – Path to the Kaldi data directory.sampling_rate (
int) – Sampling rate of the recordings.frame_shift (
Optional[float]) – Optional, if specified, we will create a Features manifest and store the frame_shift value in it.map_string_to_underscores (
Optional[str]) – optional string, when specified, we will replace all instances of this string in SupervisonSegment IDs to underscores. This is to help with handling underscores in Kaldi (seeexport_to_kaldi()). This is also done for speaker IDs.use_reco2dur (
bool) – If True, we will use the reco2dur file to read the durations of the recordings. If False, we will read the durations from the audio files themselves.num_jobs (
int) – Number of parallel jobs to use when reading the audio files.
- Return type:
Tuple[RecordingSet,Optional[SupervisionSet],Optional[FeatureSet]]
- lhotse.kaldi.export_to_kaldi(recordings, supervisions, output_dir, map_underscores_to=None, prefix_spk_id=False)[source]
Export a pair of
RecordingSetandSupervisionSetto a Kaldi data directory. It even supports recordings that have multiple channels but the recordings will still have to have a singleAudioSource.The
RecordingSetandSupervisionSetmust be compatible, i.e. it must be possible to create aCutSetout of them.- Parameters:
recordings (
RecordingSet) – aRecordingSetmanifest.supervisions (
SupervisionSet) – aSupervisionSetmanifest.output_dir (
Union[Path,str]) – path where the Kaldi-style data directory will be created.map_underscores_to (
Optional[str]) – optional string with which we will replace all underscores. This helps avoid issues with Kaldi data dir sorting.prefix_spk_id (
Optional[bool]) – add speaker_id as a prefix of utterance_id (this is to ensure correct sorting inside files which is required by Kaldi)
Note
If you export a
RecordingSetwith multiple channels, then the resulting Kaldi data directory may not be back-compatible with Lhotse (i.e. you won’t be able to import it back to Lhotse in the same form). This is because Kaldi does not inherently support multi-channel recordings, so we have to break them down into single-channel recordings.
- lhotse.kaldi.load_start_and_duration(segments_path=None, feats_path=None, frame_shift=None)[source]
Load start time from segments and duration from feats, when both segments and feats.scp are available.
- Return type:
Dict[Tuple,None]
- lhotse.kaldi.load_kaldi_text_mapping(path, must_exist=False, float_vals=False)[source]
Load Kaldi files such as utt2spk, spk2gender, text, etc. as a dict.
- Return type:
Dict[str,Optional[str]]
- lhotse.kaldi.save_kaldi_text_mapping(data, path)[source]
Save flat dicts to Kaldi files such as utt2spk, spk2gender, text, etc.
- lhotse.kaldi.make_wavscp_channel_string_map(source, sampling_rate, transforms=None)[source]
- Return type:
Dict[int,str]
CLI
Converting Kaldi data directory called data/train, with 16kHz sampling rate recordings,
to a directory with Lhotse manifests called train_manifests:
# Convert data/train to train_manifests/{recordings,supervisions}.json
lhotse kaldi import \
data/train \
16000 \
train_manifests
# Convert train_manifests/{recordings,supervisions}.json to data/train
lhotse kaldi export \
train_manifests/recordings.json \
train_manifests/supervisions.json \
data/train
Example
Hint
Before you continue, make sure you have run pip install kaldi-native-io;
otherwise, you won’t be able to get features.jsonl.gz below.
In the following, we demonstrate how to import a Kaldi data directory using
the yesno dataset.
Assume you have run the following commands with Kaldi:
cd kaldi/egs/yesno/s5
./run.sh
Take the data/train_yesno directory as an example:
ls data/train_yesno/
cmvn.scp conf feats.scp frame_shift spk2utt split1 text utt2dur utt2num_frames utt2spk wav.scp
You can use the following command to import it into lhotse:
lhotse kaldi import \
--frame-shift 0.01 \
./data/train_yesno \
8000 \
./data/train_manifests/
Hint
You can use lhotse kaldi import --help to view the help information.
In the above, 8000 is the sampling rate for the yesno dataset.
It will generate the following files:
$ ls data/train_manifests/
features.jsonl.gz recordings.jsonl.gz supervisions.jsonl.gz
To create a CutSet from the above files, you can use:
lhotse cut simple \
-r ./data/train_manifests/recordings.jsonl.gz \
-f ./data/train_manifests/features.jsonl.gz \
-s ./data/train_manifests/supervisions.jsonl.gz \
./yesno_train.jsonl.gz
Now you can use ./yesno_train.jsonl.gz for training.