Kaldi Interoperability
Data import/export
We support importing Kaldi data directories that contain at least the wav.scp
file,
required to create the RecordingSet
.
Other files, such as segments
, utt2spk
, etc. are used to create the SupervisionSet
.
We also support converting feats.scp
to FeatureSet
, and reading features
directly from Kaldi’s scp/ark files via kaldi_native_io library (which is an optional Lhotse’s dependency).
We also allow to export a pair of RecordingSet
and SupervisionSet
to a Kaldi data directory.
We currently do not support the following (but may start doing so in the future):
Exporting Lhotse extracted features to Kaldi’s
feats.scp
Export Lhotse’s multi-channel recording sets to Kaldi
Kaldi feature extractors
We support Kaldi-compatible log-mel filter energies (“fbank”) and MFCCs. We provide a PyTorch implementation that is GPU-compatible, allows batching, and backpropagation. To learn more about feature extraction in Lhotse, see Feature extraction.
Python
Python methods related to Kaldi support:
- lhotse.kaldi.floor_duration_to_milliseconds(duration)[source]
Floor the duration to multiplies of 0.001 seconds. This is to avoid float precision problems with workflows like:
lhotse kaldi import … lhotse fix … ./local/compute_fbank_imported.py (from icefall) lhotse cut trim-to-supervisions … ./local/validate_manifest.py … (from icefall)
- Without flooring, there were different lengths:
Supervision end time 1093.33995833 is larger than cut end time 1093.3399375
- This is still within the 2ms tolerance in K2SpeechRecognitionDataset::validate_for_asr():
https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/speech_recognition.py#L201
- Return type
float
- lhotse.kaldi.get_duration(path)[source]
Read a audio file, it supports pipeline style wave path and real waveform.
- Parameters
path (
Union
[Path
,str
]) – Path to an audio file or a Kaldi-style pipe.- Return type
Optional
[float
,None
]- Returns
float duration of the recording, in seconds or None in case of read error.
- lhotse.kaldi.load_kaldi_data_dir(path, sampling_rate, frame_shift=None, map_string_to_underscores=None, use_reco2dur=True, num_jobs=1, feature_type='kaldi-fbank')[source]
Load a Kaldi data directory and convert it to a Lhotse RecordingSet and SupervisionSet manifests. For this to work, at least the wav.scp file must exist. SupervisionSet is created only when a segments file exists. reco2dur is used by default when exists (to enforce reading the duration from the audio files themselves, please set use_reco2dur = False. All the other files (text, utt2spk, etc.) are optional, and some of them might not be handled yet. In particular, feats.scp files are ignored.
- Parameters
path (
Union
[Path
,str
]) – Path to the Kaldi data directory.sampling_rate (
int
) – Sampling rate of the recordings.frame_shift (
Optional
[float
,None
]) – Optional, if specified, we will create a Features manifest and store the frame_shift value in it.map_string_to_underscores (
Optional
[str
,None
]) – optional string, when specified, we will replace all instances of this string in SupervisonSegment IDs to underscores. This is to help with handling underscores in Kaldi (seeexport_to_kaldi()
). This is also done for speaker IDs.use_reco2dur (
bool
) – If True, we will use the reco2dur file to read the durations of the recordings. If False, we will read the durations from the audio files themselves.num_jobs (
int
) – Number of parallel jobs to use when reading the audio files.
- Return type
Tuple
[RecordingSet
,Optional
[SupervisionSet
,None
],Optional
[FeatureSet
,None
]]
- lhotse.kaldi.export_to_kaldi(recordings, supervisions, output_dir, map_underscores_to=None, prefix_spk_id=False)[source]
Export a pair of
RecordingSet
andSupervisionSet
to a Kaldi data directory. It even supports recordings that have multiple channels but the recordings will still have to have a singleAudioSource
.The
RecordingSet
andSupervisionSet
must be compatible, i.e. it must be possible to create aCutSet
out of them.- Parameters
recordings (
RecordingSet
) – aRecordingSet
manifest.supervisions (
SupervisionSet
) – aSupervisionSet
manifest.output_dir (
Union
[Path
,str
]) – path where the Kaldi-style data directory will be created.map_underscores_to (
Optional
[str
,None
]) – optional string with which we will replace all underscores. This helps avoid issues with Kaldi data dir sorting.prefix_spk_id (
Optional
[bool
,None
]) – add speaker_id as a prefix of utterance_id (this is to ensure correct sorting inside files which is required by Kaldi)
Note
If you export a
RecordingSet
with multiple channels, then the resulting Kaldi data directory may not be back-compatible with Lhotse (i.e. you won’t be able to import it back to Lhotse in the same form). This is because Kaldi does not inherently support multi-channel recordings, so we have to break them down into single-channel recordings.
- lhotse.kaldi.load_start_and_duration(segments_path=None, feats_path=None, frame_shift=None)[source]
Load start time from segments and duration from feats, when both segments and feats.scp are available.
- Return type
Dict
[Tuple
,None
]
- lhotse.kaldi.load_kaldi_text_mapping(path, must_exist=False, float_vals=False)[source]
Load Kaldi files such as utt2spk, spk2gender, text, etc. as a dict.
- Return type
Dict
[str
,Optional
[str
,None
]]
- lhotse.kaldi.save_kaldi_text_mapping(data, path)[source]
Save flat dicts to Kaldi files such as utt2spk, spk2gender, text, etc.
- lhotse.kaldi.make_wavscp_channel_string_map(source, sampling_rate, transforms=None)[source]
- Return type
Dict
[int
,str
]
CLI
Converting Kaldi data directory called data/train
, with 16kHz sampling rate recordings,
to a directory with Lhotse manifests called train_manifests
:
# Convert data/train to train_manifests/{recordings,supervisions}.json
lhotse kaldi import \
data/train \
16000 \
train_manifests
# Convert train_manifests/{recordings,supervisions}.json to data/train
lhotse kaldi export \
train_manifests/recordings.json \
train_manifests/supervisions.json \
data/train
Example
In the following, we demonstrate how to import a Kaldi data directory using
the yesno
dataset.
Assume you have run the following commands with Kaldi:
cd kaldi/egs/yesno/s5
./run.sh
Take the data/train_yesno
directory as an example:
ls data/train_yesno/
cmvn.scp conf feats.scp frame_shift spk2utt split1 text utt2dur utt2num_frames utt2spk wav.scp
You can use the following command to import it into lhotse:
lhotse kaldi import \
--frame-shift 0.01 \
./data/train_yesno \
8000 \
./data/train_manifests/
Hint
You can use lhotse kaldi import --help
to view the help information.
In the above, 8000
is the sampling rate for the yesno
dataset.
It will generate the following files:
$ ls data/train_manifests/
features.jsonl.gz recordings.jsonl.gz supervisions.jsonl.gz
To create a CutSet
from the above files, you can use:
lhotse cut simple \
-r ./data/train_manifests/recordings.jsonl.gz \
-f ./data/train_manifests/features.jsonl.gz \
-s ./data/train_manifests/supervisions.jsonl.gz \
./yesno_train.jsonl.gz
Now you can use ./yesno_train.jsonl.gz
for training.