Command-line interface
lhotse
The shell entry point to Lhotse, a tool and a library for audio data manipulation in high altitudes.
lhotse [OPTIONS] COMMAND [ARGS]...
Options
- -s, --seed <seed>
Random seed.
combine
Load MANIFESTS, combine them into a single one, and write it to OUTPUT_MANIFEST.
lhotse combine [OPTIONS] [MANIFESTS]... OUTPUT_MANIFEST
Arguments
- MANIFESTS
Optional argument(s)
- OUTPUT_MANIFEST
Required argument
copy
Load INPUT_MANIFEST and store it to OUTPUT_MANIFEST. Useful for conversion between different serialization formats (e.g. JSON, JSONL, YAML). Automatically supports gzip compression when ‘.gz’ suffix is detected.
lhotse copy [OPTIONS] INPUT_MANIFEST OUTPUT_MANIFEST
Arguments
- INPUT_MANIFEST
Required argument
- OUTPUT_MANIFEST
Required argument
copy-feats
Load INPUT_MANIFEST of type lhotse.FeatureSet
or lhotse.CutSet,
read every feature matrix using features.load()
or cut.load_features()
,
save them in STORAGE_PATH and save the updated manifest to OUTPUT_MANIFEST.
lhotse copy-feats [OPTIONS] INPUT_MANIFEST OUTPUT_MANIFEST STORAGE_PATH
Options
- -t, --storage-type <storage_type>
Which storage backend should we use for writing the copied features.
- Options:
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -j, --max-jobs <max_jobs>
Maximum number of parallel copying processes. By default, one process is spawned for every existing feature file in the INPUT_MANIFEST (e.g., if the features were extracted with 20 jobs, there will typically be 20 files).
Arguments
- INPUT_MANIFEST
Required argument
- OUTPUT_MANIFEST
Required argument
- STORAGE_PATH
Required argument
cut
Group of commands used to create CutSets.
lhotse cut [OPTIONS] COMMAND [ARGS]...
append
Create a new CutSet by appending the cuts in CUT_MANIFESTS. CUT_MANIFESTS are iterated position-wise (the cuts on i’th position in each manfiest are appended to each other). The cuts are appended in the order in which they appear in the input argument list. If CUT_MANIFESTS have different lengths, the script stops once the shortest CutSet is depleted.
lhotse cut append [OPTIONS] [CUT_MANIFESTS]... OUTPUT_CUT_MANIFEST
Arguments
- CUT_MANIFESTS
Optional argument(s)
- OUTPUT_CUT_MANIFEST
Required argument
decompose
If any of these are not preset in any of the cuts, the corresponding file for them will be empty.
lhotse cut decompose [OPTIONS] CUTSET OUTPUT
Arguments
- CUTSET
Required argument
- OUTPUT
Required argument
describe
Describe some statistics of CUTSET, such as the total speech and audio duration.
lhotse cut describe [OPTIONS] CUTSET
Arguments
- CUTSET
Required argument
estimate-bucket-bins
Estimate duration bins for dynamic bucketing. Prints a Python list of num_buckets-1 floats (seconds) which constitute the boundaries between buckets. The bins are estimated in such a way so that each bucket has a roughly equal total duration of data.
lhotse cut estimate-bucket-bins [OPTIONS] CUTSET
Options
- -b, --num-buckets <num_buckets>
Desired number of buckets.
- -s, --sample <sample>
How many samples to use for estimation (first N, by default use full cutset).
Arguments
- CUTSET
Required argument
export-to-webdataset
Export CUTS into a WebDataset tarfile, or a collection of tarfile shards, as specified by WSPECIFIER.
The resulting CutSet contains audio/feature data in addition to metadata, and can be read in Python using ‘CutSet.from_webdataset’ API.
This function is useful for I/O intensive applications where random reads are too slow, and a one-time lengthy export step that enables fast sequential reading is preferable.
See the WebDataset project for more information: https://github.com/webdataset/webdataset
lhotse cut export-to-webdataset [OPTIONS] CUTSET WSPECIFIER
Options
- -s, --shard-size <shard_size>
Number of cuts per shard (sharding disabled if not defined).
- -f, --audio-format <audio_format>
Format in which the audio is encoded (uses torchaudio available formats).
- --audio, --no-audio
Should we load and add audio data.
- --features, --no-features
Should we load and add feature data.
- --custom, --no-custom
Should we load and add custom data.
- --fault-tolerant, --stop-on-fail
Should we omit the cuts for which loading data failed, or stop the execution.
Arguments
- CUTSET
Required argument
- WSPECIFIER
Required argument
mix-by-recording-id
Create a CutSet stored in OUTPUT_CUT_MANIFEST by matching the Cuts from CUT_MANIFESTS by their recording IDs and mixing them together.
lhotse cut mix-by-recording-id [OPTIONS] [CUT_MANIFESTS]...
OUTPUT_CUT_MANIFEST
Arguments
- CUT_MANIFESTS
Optional argument(s)
- OUTPUT_CUT_MANIFEST
Required argument
mix-sequential
Create a CutSet stored in OUTPUT_CUT_MANIFEST by iterating jointly over CUT_MANIFESTS and mixing the Cuts on the same positions. E.g. the first output cut is created from the first cuts in each input manifest. The mix is performed by summing the features from all Cuts. If the CUT_MANIFESTS have different number of Cuts, the mixing ends when the shorter manifest is depleted.
lhotse cut mix-sequential [OPTIONS] [CUT_MANIFESTS]... OUTPUT_CUT_MANIFEST
Arguments
- CUT_MANIFESTS
Optional argument(s)
- OUTPUT_CUT_MANIFEST
Required argument
pad
Create a new CutSet by padding the cuts in CUT_MANIFEST. The cuts will be right-padded, i.e. the padding is placed after the signal ends.
lhotse cut pad [OPTIONS] CUT_MANIFEST OUTPUT_CUT_MANIFEST
Options
- -d, --duration <duration>
Desired duration of cuts after padding. Cuts longer than this won’t be affected. By default, pad to the longest cut duration found in CUT_MANIFEST.
Arguments
- CUT_MANIFEST
Required argument
- OUTPUT_CUT_MANIFEST
Required argument
simple
Create a CutSet stored in OUTPUT_CUT_MANIFEST. Depending on the provided options, it may contain any combination of recording, feature and supervision manifests. Either RECORDING_MANIFEST or FEATURE_MANIFEST has to be provided. When SUPERVISION_MANIFEST is provided, the cuts time span will correspond to that of the supervision segments. Otherwise, that time span corresponds to the one found in features, if available, otherwise recordings.
Hint
--force-eager
must be used when the RECORDING_MANIFEST is not sorted by recording ID.
lhotse cut simple [OPTIONS] OUTPUT_CUT_MANIFEST
Options
- -r, --recording-manifest <recording_manifest>
Optional recording manifest - will be used to attach the recordings to the cuts.
- -f, --feature-manifest <feature_manifest>
Optional feature manifest - will be used to attach the features to the cuts.
- -s, --supervision-manifest <supervision_manifest>
Optional supervision manifest - will be used to attach the supervisions to the cuts.
- --force-eager
Force reading full manifests into memory before creating the manifests (useful when you are not sure about the input manifest sorting).
Arguments
- OUTPUT_CUT_MANIFEST
Required argument
trim-to-alignments
Return a new CutSet with Cuts that have identical spans as the alignments of type type. An additional max_pause is allowed between the alignments to merge contiguous alignment items.
For the case of a multi-channel cut with multiple alignments, we can either trim while respecting the supervision channels (in which case output cut has the same channels as the supervision) or ignore the channels (in which case output cut has the same channels as the input cut).
lhotse cut trim-to-alignments [OPTIONS] CUTS OUTPUT_CUTS
Options
- --type <type>
Alignment type to use for trimming
- --max-pause <max_pause>
Merge alignments separated by a pause shorter than this value
- -d, --delimiter <delimiter>
Delimiter to use for concatenating alignment symbols for merging
- --keep-all-channels, --discard-extra-channels
If
True
, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
Arguments
- CUTS
Required argument
- OUTPUT_CUTS
Required argument
trim-to-supervision-groups
Return a new CutSet with Cuts that have identical spans as the supervision groups. An additional max_pause is allowed to merge contiguous supervision groups.
A supervision group is defined as a set of supervisions that are overlapping or separated by a pause shorter than max_pause.
lhotse cut trim-to-supervision-groups [OPTIONS] CUTS OUTPUT_CUTS
Options
- --max-pause <max_pause>
Merge supervision groups separated by a pause shorter than this value
Arguments
- CUTS
Required argument
- OUTPUT_CUTS
Required argument
trim-to-supervisions
Splits each input cut into as many cuts as there are supervisions. These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded with options.
lhotse cut trim-to-supervisions [OPTIONS] CUTS OUTPUT_CUTS
Options
- --keep-overlapping, --discard-overlapping
when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration, it would discard Sup2 in Cut1 and Sup1 in Cut2.
- -d, --min-duration <min_duration>
An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
- -c, --context-direction <context_direction>
Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
- Options:
center | left | right | random
- --keep-all-channels, --discard-extra-channels
If
True
, the output cut will have the same channels as the input cut. By default, the trimmed cut will have the same channels as the supervision.
Arguments
- CUTS
Required argument
- OUTPUT_CUTS
Required argument
truncate
Truncate the cuts in the CUT_MANIFEST and write them to OUTPUT_CUT_MANIFEST. Cuts shorter than MAX_DURATION will not be modified.
lhotse cut truncate [OPTIONS] CUT_MANIFEST OUTPUT_CUT_MANIFEST
Options
- --preserve-id
Should the cuts preserve IDs (by default, they will get new, random IDs)
- -d, --max-duration <max_duration>
Required The maximum duration in seconds of a cut in the resulting manifest.
- -o, --offset-type <offset_type>
Where should the truncated cut start: “start” - at the start of the original cut, “end” - MAX_DURATION before the end of the original cut, “random” - randomly choose somewhere between “start” and “end” options.
- Options:
start | end | random
- --keep-overflowing-supervisions, --discard-overflowing-supervisions
When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
Arguments
- CUT_MANIFEST
Required argument
- OUTPUT_CUT_MANIFEST
Required argument
download
Command group for download and extract data.
lhotse download [OPTIONS] COMMAND [ARGS]...
adept
ADEPT prosody transfer evaluation corpus download.
lhotse download adept [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
aidatatang-200zh
aidatatang_200zh download. Args:
- target_dir:
It will create a dir aidatatang_200zh to contain all downloaded/extracted files
lhotse download aidatatang-200zh [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
aishell
Aishell download.
lhotse download aishell [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
aishell3
aishell3 download.
lhotse download aishell3 [OPTIONS] [TARGET_DIR]
Arguments
- TARGET_DIR
Optional argument
aishell4
AISHELL-4 download.
lhotse download aishell4 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
ali-meeting
AliMeeting download.
lhotse download ali-meeting [OPTIONS] TARGET_DIR
Options
- --force-download
Arguments
- TARGET_DIR
Required argument
ami
AMI download.
lhotse download ami [OPTIONS] TARGET_DIR
Options
- --annotations <annotations>
To download annotations in a different directory than corpus.
- --mic <mic>
AMI microphone setting.
- Options:
ihm | ihm-mix | sdm | mdm | mdm8-bf
- --url <url>
AMI data downloading URL.
- --force-download <force_download>
If True, download even if file is present.
Arguments
- TARGET_DIR
Required argument
atcosim
ATCOSIM download.
lhotse download atcosim [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
audio-mnist
AudioMNIST dataset download.
lhotse download audio-mnist [OPTIONS] TARGET_DIR
Options
- --force-download <force_download>
If True, download even if file is present.
Arguments
- TARGET_DIR
Required argument
baker-zh
bazker_zh download.
lhotse download baker-zh [OPTIONS] [TARGET_DIR]
Arguments
- TARGET_DIR
Optional argument
but-reverb-db
BUT Reverb DB download.
lhotse download but-reverb-db [OPTIONS] TARGET_DIR
Options
- --force-download <force_download>
If True, download even if file is present.
Arguments
- TARGET_DIR
Required argument
bvcc
BVCC/VoiceMOS challange data cannot be downloaded.
See info and instructions how to obtain BVCC dataset used for VoiceMOS challange: - https://arxiv.org/abs/2105.02373 - https://nii-yamagishilab.github.io/ecooper-demo/VoiceMOS2022/index.html - https://codalab.lisn.upsaclay.fr/competitions/695
lhotse download bvcc [OPTIONS]
chime6
CHiME-6 download.
lhotse download chime6 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
cmu-arctic
CMU Arctic download.
lhotse download cmu-arctic [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
cmu-indic
CMU Indic download.
lhotse download cmu-indic [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
commonvoice
Commonvoice download.
lhotse download commonvoice [OPTIONS] TARGET_DIR
Options
- -l, --languages <languages>
Languages to prepare (scans CORPUS_DIR for language codes by default).
- -v, --release <release>
the name of the CommonVoice release (e.g., ‘cv-corpus-13.0-2023-03-09’).It is used as part of the download URL.
Arguments
- TARGET_DIR
Required argument
daily-talk
Download DailyTalk dataset.
lhotse download daily-talk [OPTIONS] TARGET_DIR
Options
- --force-download
Force download.
Arguments
- TARGET_DIR
Required argument
dipco
DiPCo download.
lhotse download dipco [OPTIONS] TARGET_DIR
Options
- --force-download <force_download>
If True, download even if file is present.
Arguments
- TARGET_DIR
Required argument
earnings21
Earnings21 dataset download.
lhotse download earnings21 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
earnings22
Earnings22 dataset download.
lhotse download earnings22 [OPTIONS]
ears
EARS data download.
lhotse download ears [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
edacc
The Edinburgh International Accents of English Corpus (EDACC) download.
lhotse download edacc [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
fleurs
FLEURS download.
lhotse download fleurs [OPTIONS] TARGET_DIR
Options
- -l, --lang <lang>
Specify which languages to download, e.g., lhotse download fleurs . -l hi_in -l en_us lhotse download fleurs
- --force-download
Specify whether to overwrite an existing archive
Arguments
- TARGET_DIR
Required argument
gigaspeech
Gigaspeech download.
lhotse download gigaspeech [OPTIONS] PASSWORD TARGET_DIR
Options
- --subset <subset>
Which parts of Gigaspeech to download (by default XL + DEV + TEST).
- Options:
auto | XL | L | M | S | XS | DEV | TEST
- --host <host>
Which host to download Gigaspeech.
Arguments
- PASSWORD
Required argument
- TARGET_DIR
Required argument
gigast
GigaST download.
lhotse download gigast [OPTIONS] TARGET_DIR
Options
- -l, --languages <languages>
Languages to download. one of: ‘all’ (downloads all known languages); a single language code (e.g., ‘en’)
- --force-download
Force download
Arguments
- TARGET_DIR
Required argument
grid
Grid audio-visual speech corpus download.
lhotse download grid [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
heroico
heroico download.
lhotse download heroico [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
hifitts
HiFiTTS data download.
lhotse download hifitts [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
himia
HI-MIA and HI_MIA_CW download.
lhotse download himia [OPTIONS] TARGET_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to download. To download multiple parts, pass each with -p Example: -p test -p cw_test Download both HI_MIA and HI_MIA_CW by default All possible data parts are train, dev, test and cw_test
Arguments
- TARGET_DIR
Required argument
icsi
ICSI data download.
lhotse download icsi [OPTIONS] AUDIO_DIR
Options
- --transcripts-dir <transcripts_dir>
To download annotations in a different directory than audio.
- --mic <mic>
ICSI microphone setting.
- Options:
ihm | ihm-mix | sdm | mdm
- --url <url>
ICSI data downloading URL.
- --force-download <force_download>
If True, download even if file is present.
Arguments
- AUDIO_DIR
Required argument
libricss
Download LibriCSS dataset.
lhotse download libricss [OPTIONS] TARGET_DIR
Options
- --force-download
Force download
Arguments
- TARGET_DIR
Required argument
librimix
Mini LibriMix download.
lhotse download librimix [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
librispeech
(Mini) Librispeech download.
lhotse download librispeech [OPTIONS] TARGET_DIR
Options
- --full, --mini
Download Librispeech [default] or mini Librispeech.
Arguments
- TARGET_DIR
Required argument
libritts
LibriTTS data download.
lhotse download libritts [OPTIONS] TARGET_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to download. To prepare multiple parts, pass each with -p Example: -p train-clean-360 -p dev-other
Arguments
- TARGET_DIR
Required argument
librittsr
LibriTTS-R data download.
lhotse download librittsr [OPTIONS] TARGET_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to download. To prepare multiple parts, pass each with -p Example: -p train-clean-360 -p dev-other
Arguments
- TARGET_DIR
Required argument
ljspeech
LJSpeech download.
lhotse download ljspeech [OPTIONS] [TARGET_DIR]
Arguments
- TARGET_DIR
Optional argument
magicdata
Magicdata download.
lhotse download magicdata [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
mdcc
MDCC download.
lhotse download mdcc [OPTIONS] TARGET_DIR
Options
- --force-download
if True, it will download the MDCC data even if it is already present.
Arguments
- TARGET_DIR
Required argument
medical
Medical download.
lhotse download medical [OPTIONS] TARGET_DIR
Options
- --force-download
Force download
Arguments
- TARGET_DIR
Required argument
mtedx
MTEDx download.
lhotse download mtedx [OPTIONS] TARGET_DIR
Options
- -l, --lang <lang>
Specify which languages to download, e.g., lhoste download mtedx . -l de -l fr -l es lhoste download mtedx
Arguments
- TARGET_DIR
Required argument
musan
MUSAN download.
lhotse download musan [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
primewords
Primewords download.
lhotse download primewords [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
reazonspeech
ReazonSpeech download.
lhotse download reazonspeech [OPTIONS] TARGET_DIR
Options
- --subset <subset>
List of dataset parts to prepare (default: small-v1). To prepare multiple parts, pass each with –subset Example: `–subset all
- Options:
auto | tiny | small | medium | large | all | small-v1 | medium-v1 | all-v1
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- TARGET_DIR
Required argument
rir-noise
RIRS and noises download.
lhotse download rir-noise [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
sbcsae
SBCSAE download.
lhotse download sbcsae [OPTIONS] TARGET_DIR
Options
- --force-download
Force download.
Arguments
- TARGET_DIR
Required argument
spatial-librispeech
Spatial-LibriSpeech download.
lhotse download spatial-librispeech [OPTIONS] TARGET_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to download. To prepare multiple parts, pass each with -p Example: -p train -p test
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- TARGET_DIR
Required argument
speechcommands
Speech Commands v0.01 or v0.02 download.
lhotse download speechcommands [OPTIONS] SPEECHCOMMANDS_VERSION TARGET_DIR
Options
- --force-download
Force download
Arguments
- SPEECHCOMMANDS_VERSION
Required argument
- TARGET_DIR
Required argument
spgispeech
SPGISpeech download.
lhotse download spgispeech [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
stcmds
Stcmds download.
lhotse download stcmds [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
tedlium
TED-LIUM v2 download (approx. 35GB).
lhotse download tedlium [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
thchs-30
thchs_30 download.
lhotse download thchs-30 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
this-american-life
This American Life dataset download.
lhotse download this-american-life [OPTIONS] TARGET_DIR
Options
- -f, --force-download
Arguments
- TARGET_DIR
Required argument
timit
TIMIT download.
lhotse download timit [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
uwb-atcc
UWB-ATCC download.
lhotse download uwb-atcc [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
vctk
VCTK download.
lhotse download vctk [OPTIONS] TARGET_DIR
Options
- --use-edinburgh-vctk-url <use_edinburgh_vctk_url>
Arguments
- TARGET_DIR
Required argument
voxceleb1
VoxCeleb1 download.
lhotse download voxceleb1 [OPTIONS] TARGET_DIR
Options
- --force-download
Force download
Arguments
- TARGET_DIR
Required argument
voxceleb2
VoxCeleb2 download.
lhotse download voxceleb2 [OPTIONS] TARGET_DIR
Options
- --force-download
Force download
Arguments
- TARGET_DIR
Required argument
voxconverse
VoxConverse dataset download.
lhotse download voxconverse [OPTIONS] TARGET_DIR
Options
- --force-download
Force download
Arguments
- TARGET_DIR
Required argument
voxpopuli
voxpopuli download.
lhotse download voxpopuli [OPTIONS] TARGET_DIR
Options
- --subset <subset>
- Options:
asr | 10k | 100k | 400k | en | de | fr | es | pl | it | ro | hu | cs | nl | fi | hr | sk | sl | et | lt | pt | bg | el | lv | mt | sv | da | en_v2 | de_v2 | fr_v2 | es_v2 | pl_v2 | it_v2 | ro_v2 | hu_v2 | cs_v2 | nl_v2 | fi_v2 | hr_v2 | sk_v2 | sl_v2 | et_v2 | lt_v2 | pt_v2 | bg_v2 | el_v2 | lv_v2 | mt_v2 | sv_v2 | da_v2
Arguments
- TARGET_DIR
Required argument
xbmu-amdo31
XBMU-AMDO31 download.
lhotse download xbmu-amdo31 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
yesno
yes_no dataset download.
lhotse download yesno [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR
Required argument
feat
Feature extraction related commands.
lhotse feat [OPTIONS] COMMAND [ARGS]...
extract
Extract features for recordings in a given AUDIO_MANIFEST. The features are stored in OUTPUT_DIR, with one file per recording (or segment).
lhotse feat extract [OPTIONS] RECORDING_MANIFEST OUTPUT_DIR
Options
- -f, --feature-manifest <feature_manifest>
Optional manifest specifying feature extractor configuration.
- --storage-type <storage_type>
Select a storage backend for the feature matrices.
- Options:
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -t, --lilcom-tick-power <lilcom_tick_power>
Determines the compression accuracy; the input will be compressed to integer multiples of 2^tick_power
- -r, --root-dir <root_dir>
Root directory - all paths in the manifest will use this as prefix.
- -j, --num-jobs <num_jobs>
Number of parallel processes.
Arguments
- RECORDING_MANIFEST
Required argument
- OUTPUT_DIR
Required argument
extract-cuts
Extract features for cuts in a given CUTSET manifest. The features are stored in STORAGE_PATH, and the output manifest with features is stored in OUTPUT_CUTSET.
lhotse feat extract-cuts [OPTIONS] CUTSET OUTPUT_CUTSET STORAGE_PATH
Options
- -f, --feature-manifest <feature_manifest>
Optional manifest specifying feature extractor configuration.
- --storage-type <storage_type>
Select a storage backend for the feature matrices.
- Options:
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -j, --num-jobs <num_jobs>
Number of parallel processes.
Arguments
- CUTSET
Required argument
- OUTPUT_CUTSET
Required argument
- STORAGE_PATH
Required argument
extract-cuts-batch
Extract features for cuts in a given CUTSET manifest. The features are stored in STORAGE_PATH, and the output manifest with features is stored in OUTPUT_CUTSET.
This version enables CUDA acceleration for feature extractors that support it (e.g., kaldifeat extractors).
$ pip install kaldifeat # note: ensure it’s compiled with CUDA
$ lhotse feat write-default-config -f kaldifeat-fbank feat.yml
$ sed ‘s/device: cpu/device: cuda/’ feat.yml feat-cuda.yml
$ lhotse feat extract-cuts-batch -f feat-cuda.yml cuts.jsonl cuts_with_feats.jsonl feats.h5
lhotse feat extract-cuts-batch [OPTIONS] CUTSET OUTPUT_CUTSET STORAGE_PATH
Options
- -f, --feature-manifest <feature_manifest>
Optional manifest specifying feature extractor configuration. If you want to use CUDA, you should specify the device in this config.
- --storage-type <storage_type>
Select a storage backend for the feature matrices.
- Options:
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -j, --num-jobs <num_jobs>
Number of dataloader workers.
- -b, --batch-duration <batch_duration>
At most this many seconds of audio will be processed in each batch.
Arguments
- CUTSET
Required argument
- OUTPUT_CUTSET
Required argument
- STORAGE_PATH
Required argument
upload
Read an existing FEATURE_MANIFEST, upload the feature matrices it contains to a URL location, and save a new feature OUTPUT_MANIFEST that refers to the uploaded features.
The URL can refer to endpoints such as AWS S3, GCP, Azure, etc. For example: “s3://my-bucket/my-features” is a valid URL.
This script does not currently support credentials, and assumes that you have the write permissions.
lhotse feat upload [OPTIONS] FEATURE_MANIFEST URL OUTPUT_MANIFEST
Options
- -j, --num-jobs <num_jobs>
Arguments
- FEATURE_MANIFEST
Required argument
- URL
Required argument
- OUTPUT_MANIFEST
Required argument
write-default-config
Save a default feature extraction config to OUTPUT_CONFIG.
lhotse feat write-default-config [OPTIONS] OUTPUT_CONFIG
Options
- -f, --feature-type <feature_type>
Which feature extractor type to use.
- Options:
fbank | kaldi-fbank | kaldi-mfcc | kaldi-spectrogram | kaldi-log-spectrogram | kaldifeat-fbank | kaldifeat-mfcc | librosa-fbank | mfcc | opensmile-extractor | spectrogram | s3prl-ssl | whisper-fbank
Arguments
- OUTPUT_CONFIG
Required argument
filter
Filter a MANIFEST according to the rule specified in PREDICATE, and save the result to OUTPUT_MANIFEST. It is intended to work generically with most manifest types - it supports RecordingSet, SupervisionSet and CutSet.
It currently only supports comparison of numerical manifest item attributes, such as: start, duration, end, channel, num_frames, num_features, etc.
lhotse filter [OPTIONS] PREDICATE MANIFEST OUTPUT_MANIFEST
Arguments
- PREDICATE
Required argument
- MANIFEST
Required argument
- OUTPUT_MANIFEST
Required argument
fix
Fix a pair of Lhotse RECORDINGS and SUPERVISIONS manifests. It removes supervisions without corresponding recordings and vice versa, trims the supervisions that exceed the recording, etc. Stores the output files in OUTPUT_DIR under the same names as the input files.
lhotse fix [OPTIONS] RECORDINGS SUPERVISIONS OUTPUT_DIR
Arguments
- RECORDINGS
Required argument
- SUPERVISIONS
Required argument
- OUTPUT_DIR
Required argument
install-sph2pipe
Install the sph2pipe program to handle sphere (.sph) audio files with “shorten” codec compression (needed for older LDC data).
It downloads an archive and then decompresses and compiles the contents.
lhotse install-sph2pipe [OPTIONS]
Options
- --install-dir <install_dir>
Directory where sph2pipe will be downloaded and installed.
- --url <url>
URL from which to download sph2pipe.
kaldi
Kaldi import/export related commands.
lhotse kaldi [OPTIONS] COMMAND [ARGS]...
export
Convert a pair of RecordingSet
and SupervisionSet
manifests into a Kaldi-style data directory.
lhotse kaldi export [OPTIONS] RECORDINGS SUPERVISIONS OUTPUT_DIR
Options
- -u, --map-underscores-to <map_underscores_to>
Optional string with which we will replace all underscores.This helps avoid issues with Kaldi data dir sorting.
- -p, --prefix-spk-id
Prefix utterance ids with speaker ids.This helps avoid issues with Kaldi data dir sorting.
Arguments
- RECORDINGS
Required argument
- SUPERVISIONS
Required argument
- OUTPUT_DIR
Required argument
import
Convert a Kaldi data dir DATA_DIR into a directory MANIFEST_DIR of lhotse manifests. Ignores feats.scp. The SAMPLING_RATE has to be explicitly specified as it is not available to read from DATA_DIR.
lhotse kaldi import [OPTIONS] DATA_DIR SAMPLING_RATE MANIFEST_DIR
Options
- -f, --frame-shift <frame_shift>
Frame shift (in seconds) is required to support reading feats.scp.
- -u, --map-string-to-underscores <map_string_to_underscores>
When specified, we will replace all instances of this string in SupervisonSegment IDs to underscores. This is to help with handling underscores in Kaldi (see ‘export_to_kaldi’).
- -j, --num-jobs <num_jobs>
Number of jobs for computing recording durations.
- -t, --feature-type <feature_type>
Feature type when importing precomputed features from feats.scp
- Default:
kaldi-fbank
- Options:
kaldi-fbank | kaldi-mfcc
- -d, --compute-durations
Compute durations by reading the whole file instead of using reco2dur file
- Default:
False
Arguments
- DATA_DIR
Required argument
- SAMPLING_RATE
Required argument
- MANIFEST_DIR
Required argument
list-audio-backends
List the names of all available audio backends.
lhotse list-audio-backends [OPTIONS]
prepare
Command group with data preparation recipes.
lhotse prepare [OPTIONS] COMMAND [ARGS]...
adept
ADEPT prosody transfer evaluation corpus data preparation.
lhotse prepare adept [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
aidatatang-200zh
aidatatang_200zh ASR data preparation. Args:
- corpus_dir:
It should contain a subdirectory “aidatatang_200zh”
- output_dir:
The output directory.
lhotse prepare aidatatang-200zh [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
aishell
Aishell ASR data preparation.
lhotse prepare aishell [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
aishell2
Aishell2 ASR data preparation.
lhotse prepare aishell2 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
aishell3
aishell3 data preparation.
lhotse prepare aishell3 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
aishell4
AISHELL-4 data preparation.
lhotse prepare aishell4 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --normalize-text
Conduct text normalization (remove punctuation, uppercase, etc.)
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
ali-meeting
AliMeeting data preparation.
lhotse prepare ali-meeting [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --mic <mic>
- Options:
near | far | ihm | sdm | mdm
- --normalize-text <normalize_text>
Type of text normalization to apply (M2MeT style is from the official challenge)
- Options:
none | m2met
- --save-mono
If True and mic is sdm, extract first channel and save as new recording.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
ami
AMI data preparation.
lhotse prepare ami [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --annotations <annotations>
Provide if annotations are download in a different directory than corpus.
- --mic <mic>
AMI microphone setting.
- Options:
ihm | ihm-mix | sdm | mdm | mdm8-bf
- --partition <partition>
Data partition to use (see http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml).
- Options:
scenario-only | full-corpus | full-corpus-asr
- --normalize-text <normalize_text>
Type of text normalization to apply (kaldi style, by default)
- Options:
none | upper | kaldi
- --max-words-per-segment <max_words_per_segment>
Maximum number of words per segment (similar to Kaldi-style segmentation). If None, no segmentation is performed.
- --merge-consecutive
Merge consecutive segments from the same speaker.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
aspire
ASpIRE data preparation.
lhotse prepare aspire [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --mic <mic>
- Options:
single | multi
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
atcosim
ATCOSIM data preparation.
lhotse prepare atcosim [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --silence-sym <silence_sym>
- --breath-sym <breath_sym>
- --foreign-sym <foreign_sym>
- --partial-sym <partial_sym>
- --unknown-sym <unknown_sym>
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
audio-mnist
AudioMNIST corpus data preparation.
lhotse prepare audio-mnist [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
babel
This is a data preparation recipe for the IARPA BABEL corpus (see: https://www.iarpa.gov/index.php/research-programs/babel). It should support all of the languages available in BABEL. It will prepare the data from the “conversational” part of BABEL.
This script should be invoked separately for each language you want to prepare, e.g.: $ lhotse prepare babel /export/corpora5/Babel/IARPA_BABEL_BP_101 data/cantonese $ lhotse prepare babel /export/corpora5/Babel/BABEL_OP1_103 data/bengali
lhotse prepare babel [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
baker-zh
bazker_zh data preparation.
lhotse prepare baker-zh [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
bengaliai-speech
Bengali.AI Speech data preparation.
lhotse prepare bengaliai-speech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
broadcast-news
English Broadcast News 1997 data preparation. It will output three manifests: for recordings, topic sections, and speech segments. It supports the following LDC distributions:
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare broadcast-news [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR
Arguments
- AUDIO_DIR
Required argument
- TRANSCRIPT_DIR
Required argument
- OUTPUT_DIR
Required argument
but-reverb-db
BUT Reverb DB data preparation.
lhotse prepare but-reverb-db [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --parts <parts>
Parts to prepare.
- Default:
silence, rir
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
bvcc
BVCC data preparation.
CORPUS_DIR should contain the following dir structure
./phase1-main/README ./phase1-main/DATA/sets/* ./phase1-main/DATA/wav/* …
./phase1-ood/README ./phase1-ood/DATA/sets/ ./phase1-ood/DATA/wav/ …
Check the READMEs for details.
See ‘lhotse download bvcc’ for links to instructions how to obtain the corpus.
lhotse prepare bvcc [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -nj, --num_jobs <num_jobs>
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
callhome-egyptian
About the Callhome Egyptian Arabic Corpus
The CALLHOME Egyptian Arabic corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary represents is Cairene Arabic.
This recipe uses the speech and transcripts available through LDC. In addition, an Egyptian arabic phonetic lexicon (available via LDC) is used to get word to phoneme mappings for the vocabulary. This datasets are:
Speech : LDC97S45 Transcripts : LDC97T19 Lexicon : LDC99L22 (unused here)
To actually read the audio, you will need the SPH2PIPE binary: you can provide its path, so that we will add it in the manifests (otherwise you might need to modify your PATH environment variable to find sph2pipe).
lhotse prepare callhome-egyptian [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- AUDIO_DIR
Required argument
- TRANSCRIPT_DIR
Required argument
- OUTPUT_DIR
Required argument
callhome-english
CallHome American English corpus preparation.
LDC2001S97
).LDC97S42
andLDC97T14
) will be prepared. The data is not available for free, but canThe data should be located at AUDIO_DIR. Optionally, for the SRE task, RTTM_DIR can be provided that has the contents of http://www.openslr.org/resources/10/; otherwise, we will download it.
To actually read the audio, you will need the SPH2PIPE binary: you can provide its path, so that we will add it in the manifests (otherwise you might need to modify your PATH environment variable to find sph2pipe).
Example:
lhotse prepare callhome-english /export/corpora5/LDC/LDC97S42 –transcript-dir /export/corpora5/LDC/LDC97T14 ./callhome_asr
or
lhotse prepare callhome-english /export/corpora5/LDC/LDC2001S97 ./callhome_sre
lhotse prepare callhome-english [OPTIONS] AUDIO_DIR OUTPUT_DIR
Options
- --rttm-dir <rttm_dir>
- --absolute-paths <absolute_paths>
Whether to return absolute or relative (to the corpus dir) paths for recordings.
- --transcript-dir <transcript_dir>
Path to the LDC97T14 corpus. Please note that providing this path, the ASR corpus will be prepared, not the SRE corpus!
Arguments
- AUDIO_DIR
Required argument
- OUTPUT_DIR
Required argument
cdsd
CDSD ASR data preparation.
lhotse prepare cdsd [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
chime6
CHiME-6 data preparation.
lhotse prepare chime6 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p dev -p eval. By default, all parts are prepared.
- --mic <mic>
CHiME-6 microphone setting.
- Options:
ihm | mdm
- --use-reference-array
If True, use the reference array for the MDM setting. Only the supervision segments have the reference array information in the channel field. The recordings will still have all the channels in the array. Note that reference array is not available for the train set.
- --perform-array-sync
If True, perform array synchronization for the MDM setting.
- --verify-md5-checksums
If True, verify the MD5 checksums of the audio files. This is useful to ensure correct array synchronization. Note that this step is slow, so we recommend only doing it once. You can speed it up by using more jobs.
- -j, --num-jobs <num_jobs>
Number of parallel jobs to run for array synchronization.
- -t, --num-threads-per-job <num_threads_per_job>
Number of threads to use per job for clock drift correction. Large values may require more memory, so we recommend using a job scheduler.
- --sox-path <sox_path>
Path to the sox binary. This must be v14.4.2.
- Default:
/usr/bin/sox
- --normalize-text <normalize_text>
Text normalization method.
- Default:
kaldi
- Options:
none | upper | kaldi
- --use-chime7-split
If True, use the new splits from CHiME-7 challenge.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
cmu-arctic
CMU Arctic data preparation.
lhotse prepare cmu-arctic [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
cmu-indic
CMU Indic data preparation.
lhotse prepare cmu-indic [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
cmu-kids
CMU Kids corpus data preparation.
lhotse prepare cmu-kids [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>
Use absolute paths for recordings
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
commonvoice
Mozilla CommonVoice manifest preparation script. CORPUS_DIR is expected to contain sub-directories that are named with CommonVoice language codes, e.g., “en”, “pl”, etc.
lhotse prepare commonvoice [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -l, --language <language>
Languages to prepare (scans CORPUS_DIR for language codes by default).
- -s, --split <split>
Splits to prepare (available options: train, dev, test, validated, invalidated, other)
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
csj
Prepare Corpus of Spontaneous Japanese
lhotse prepare csj [OPTIONS] CORPUS_DIR MANIFEST_DIR
Options
- -t, --transcript-dir <transcript_dir>
Directory to save parsed transcripts in txt format, with valid and eval sets created from the core and noncore datasets. If not provided, this script will not create valid and eval sets.
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p eval1 -p eval2
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- MANIFEST_DIR
Required argument
cslu-kids
CSLU Kids corpus data preparation.
lhotse prepare cslu-kids [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>
Use absolute paths for recordings
- --normalize-text <normalize_text>
Remove noise tags (<bn>, <bs>) from spontaneous speech transcripts
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
daily-talk
DailyTalk recording and supervision manifest preparation.
lhotse prepare daily-talk [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --num-jobs <num_jobs>
Number of parallel workers.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
dihard3
DIHARD3 data preparation.
lhotse prepare dihard3 [OPTIONS] OUTPUT_DIR
Options
- --dev <dev>
- --eval <eval>
- --uem, --no-uem
Specify whether or not to create UEM supervision
- -j, --num-jobs <num_jobs>
Number of jobs to scan corpus directory for recordings.
Arguments
- OUTPUT_DIR
Required argument
dipco
DiPCo data preparation.
lhotse prepare dipco [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --mic <mic>
DiPCo microphone setting.
- Options:
ihm | mdm
- --normalize-text <normalize_text>
Text normalization method.
- Default:
kaldi
- Options:
none | upper | kaldi
- --use-chime7-offset
If True, offset session IDs (from CHiME-7 challenge).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
earnings21
Earnings21 data preparation.
lhotse prepare earnings21 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --normalize-text, --no-normalize-text
Normalize the text.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
earnings22
Earnings22 data preparation.
lhotse prepare earnings22 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --normalize-text, --no-normalize-text
Normalize the text.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
ears
EARS data preparation.
lhotse prepare ears [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
edacc
The Edinburgh International Accents of English Corpus (EDACC) data preparation.
lhotse prepare edacc [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
emilia
Prepare the Emilia corpus manifests.
lhotse prepare emilia [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -l, --lang <lang>
The language to process. Valid values: zh, en, ja, ko, de, fr
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
eval2000
The Eval2000 corpus preparation.
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare eval2000 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --transcript-dir <transcript_dir>
- --absolute-paths <absolute_paths>
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
fisher-english
The Fisher English Part 1, 2 corpus preparation.
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare fisher-english [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -ad, --audio-dirs <audio_dirs>
Audio dirs, e.g., LDC2004S13 LDC2005S13. Multiple corpora can be provided by repeating -ad.
- -td, --transcript-dirs <transcript_dirs>
Transcript dirs, e.g., LDC2004T19 LDC2005T19. Multiple corpora can be provided by repeating -ad.
- --absolute-paths <absolute_paths>
Whether to return absolute or relative (to the corpus dir) paths for recordings.
- -j, --num-jobs <num_jobs>
Number of concurrent processes scanning the audio files.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
fisher-spanish
The Fisher Spanish corpus preparation.
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare fisher-spanish [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- AUDIO_DIR
Required argument
- TRANSCRIPT_DIR
Required argument
- OUTPUT_DIR
Required argument
fleurs
Fleurs ASR data preparation.
lhotse prepare fleurs [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- -l, --lang <lang>
Specify which languages to prepare, e.g., lhoste prepare librispeech mtedx_corpus data -l de -l fr -l es
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
gale-arabic
GALE Arabic Phases 1 to 4 Broadcast news and conversation data preparation.
lhotse prepare gale-arabic [OPTIONS] OUTPUT_DIR
Options
- -s, --audio <audio>
Paths to audio dirs, e.g., LDC2013S02. Multiple corpora can be provided by repeating -s.
- -t, --transcript <transcript>
Paths to transcript dirs, e.g., LDC2013T17. Multiple corpora can be provided by repeating -t
- --absolute-paths <absolute_paths>
Use absolute paths for recordings
Arguments
- OUTPUT_DIR
Required argument
gale-mandarin
GALE Mandarin Broadcast speech data preparation.
lhotse prepare gale-mandarin [OPTIONS] OUTPUT_DIR
Options
- -s, --audio <audio>
Paths to audio dirs, e.g., LDC2013S08. Multiple corpora can be provided by repeating -s.
- -t, --transcript <transcript>
Paths to transcript dirs, e.g., LDC2013T20. Multiple corpora can be provided by repeating -t
- --absolute-paths <absolute_paths>
Use absolute paths for recordings
- --segment-words <segment_words>
Use ‘jieba’ package to perform word segmentation on the text
Arguments
- OUTPUT_DIR
Required argument
gigaspeech
Gigaspeech ASR data preparation.
lhotse prepare gigaspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --subset <subset>
Which parts of Gigaspeech to download (by default XL + DEV + TEST).
- Options:
auto | XL | L | M | S | XS | DEV | TEST
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
gigast
GigaST data preparation.
lhotse prepare gigast [OPTIONS] CORPUS_DIR MANIFESTS_DIR OUTPUT_DIR
Options
- -l, --language <language>
Languages to download. one of: ‘all’ (downloads all known languages); a single language code (e.g., ‘en’)
- Options:
auto | de | zh
- --subset <subset>
Which parts of Gigaspeech to download (by default XL + DEV + TEST).
- Options:
auto | XL | L | M | S | XS | DEV | TEST
Arguments
- CORPUS_DIR
Required argument
- MANIFESTS_DIR
Required argument
- OUTPUT_DIR
Required argument
grid
Grid audio-visual speech corpus preparation.
lhotse prepare grid [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --with-supervisions, --no-supervisions
Note: using supervisions might discard some recordings that do not have them.
- -j, --jobs <jobs>
The number of parallel jobs.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
heroico
heroico Answers ASR data preparation.
lhotse prepare heroico [OPTIONS] SPEECH_DIR TRANSCRIPT_DIR OUTPUT_DIR
Arguments
- SPEECH_DIR
Required argument
- TRANSCRIPT_DIR
Required argument
- OUTPUT_DIR
Required argument
hifitts
HiFiTTS data preparation.
lhotse prepare hifitts [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many jobs to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
himia
HI_MIA and HI_MIA_CW data preparation.
lhotse prepare himia [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p test -p cw_test Prepare both HI_MIA and HI_MIA_CW by default All possible data parts are train, dev, test and cw_test
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
icmcasr
ICMC-ASR data preparation.
lhotse prepare icmcasr [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --mic <mic>
Microphone type.
- Options:
ihm | sdm | mdm
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
icsi
ICSI data preparation.
lhotse prepare icsi [OPTIONS] AUDIO_DIR OUTPUT_DIR
Options
- --transcripts-dir <transcripts_dir>
- --mic <mic>
ICSI microphone setting.
- Options:
ihm | ihm-mix | sdm | mdm
- --normalize-text <normalize_text>
Type of text normalization to apply (kaldi style, by default)
- Options:
none | upper | kaldi
- --save-to-wav
If True and mic is sdm/ihm/mdm, save the recordings as WAV for faster processing.
Arguments
- AUDIO_DIR
Required argument
- OUTPUT_DIR
Required argument
iwslt22-ta
IWSLT_2022 data preparation. | This is conversational telephone speech collected as 8kHz-sampled data. | The catalog number LDC2022E01 corresponds to the train, dev, and test1 | splits of the iwslt2022 shared task. | To obtaining this data your institution needs to have an LDC subscription. | You also should download the predined splits with | git clone https://github.com/kevinduh/iwslt22-dialect.git
lhotse prepare iwslt22-ta [OPTIONS] CORPUS_DIR SPLITS OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text <normalize_text>
Whether to perform additional text cleaning and normalization from https://aclanthology.org/2022.iwslt-1.29.pdf.
- --langs <langs>
Comma-separated list of language abbreviations for source and target languages
Arguments
- CORPUS_DIR
Required argument
- SPLITS
Required argument
- OUTPUT_DIR
Required argument
kespeech
The KeSpeech corpus preparation.
lhotse prepare kespeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts,pass each with -p Example: -p dev_phase1 -p dev_phase2
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
ksponspeech
KsponSpeech ASR data preparation.
lhotse prepare ksponspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p train -p test
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text <normalize_text>
Type of text normalization to apply.
- Options:
none | default
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
l2-arctic
L2 Arctic data preparation.
lhotse prepare l2-arctic [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
libricss
LibriCSS recording and supervision manifest preparation.
lhotse prepare libricss [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --type <type>
Type of the corpus to prepare
- Default:
mdm
- Options:
ihm | ihm-mix | sdm | mdm
- --segmented, --no-segmented
If True, the manifest will contain Cuts corresponding to 1-minute segments.
- Default:
False
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
librilight
LibriLight data preparation.
lhotse prepare librilight [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
librimix
LibrMix source separation data preparation.
lhotse prepare librimix [OPTIONS] LIBRIMIX_CSV OUTPUT_DIR
Options
- --sampling-rate <sampling_rate>
Sampling rate to set in the RecordingSet manifest.
- --min-segment-seconds <min_segment_seconds>
Remove segments shorter than MIN_SEGMENT_SECONDS.
- --with-precomputed-mixtures, --no-precomputed-mixtures
Optionally create an RecordingSet manifest including the precomputed LibriMix mixtures.
Arguments
- LIBRIMIX_CSV
Required argument
- OUTPUT_DIR
Required argument
librispeech
(Mini) Librispeech ASR data preparation.
lhotse prepare librispeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --alignments-dir <alignments_dir>
Path to the directory with the alignments (optional).
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p train-clean-360 -p dev-other
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text <normalize_text>
Conversion of transcripts to lower-case (originally in upper-case).
- Default:
none
- Options:
none | lower
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
libritts
LibriTTS data preparation.
lhotse prepare libritts [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many jobs to use (can give good speed-ups with slow disks).
- --link-previous-utterance, --no-previous-utterance
If true adds previous utterance id to supervisions. Useful for reconstructing chains of utterances as they were read from LibriVox books. If previous utterance was skipped from LibriTTS datasets previous_utt label is None. 66% of utterances have previous utterance.
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p train-clean-360 -p dev-other
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
librittsr
LibriTTS-R data preparation.
lhotse prepare librittsr [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many jobs to use (can give good speed-ups with slow disks).
- --link-previous-utterance, --no-previous-utterance
If true adds previous utterance id to supervisions. Useful for reconstructing chains of utterances as they were read from LibriVox books. If previous utterance was skipped from LibriTTS datasets previous_utt label is None. 66% of utterances have previous utterance.
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p train-clean-360 -p dev-other
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
ljspeech
LJSpeech data preparation.
lhotse prepare ljspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
magicdata
Magicdata ASR data preparation.
lhotse prepare magicdata [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
mdcc
MDCC data preparation.
lhotse prepare mdcc [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p train -p valid
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
medical
Medical data preparation.
lhotse prepare medical [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
mgb2
mgb2 ASR data preparation.
lhotse prepare mgb2 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --text-cleaning, --no-text-cleaning
Basic text cleaning.
- --buck-walter, --no-buck-walter
Use BuckWalter transliteration.
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --mer-thresh <mer_thresh>
filter out segments based on mer (Match Error Rate).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
mls
Multilingual Librispeech (MLS) data preparation.
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It is available at OpenSLR: http://openslr.org/94
lhotse prepare mls [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --opus, --flac
Which codec should be used (OPUS or FLAC)
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
mtedx
MTEDx ASR data preparation.
lhotse prepare mtedx [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- -l, --lang <lang>
Specify which languages to prepare, e.g., lhoste prepare librispeech mtedx_corpus data -l de -l fr -l es
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
musan
MUSAN data preparation.
lhotse prepare musan [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --use-vocals, --no-vocals
Whether to include vocal music in “music” part.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
must-c
MUST-C speech translation data preparation.
lhotse prepare must-c [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --tgt-lang <tgt_lang>
The target language, e.g., zh, de, fr.
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
nsc
lhotse prepare nsc [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-part <dataset_part>
Which part of NSC should be prepared
- Options:
PART1_CHANNEL0 | PART1_CHANNEL1 | PART1_CHANNEL2 | PART2_CHANNEL0 | PART2_CHANNEL1 | PART2_CHANNEL2 | PART3_SameBoundaryMic | PART3_SameCloseMic | PART3_SeparateIVR | PART3_SeparateStandingMic | PART4_CodeswitchingDiffRoom | PART4_CodeswitchingSameRoom | PART5_Debate | PART5_FinanceEmotion | PART6_CallCentreDesign1 | PART6_CallCentreDesign2 | PART6_CallCentreDesign3
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
peoples-speech
Prepare The People’s Speech corpus manifests.
lhotse prepare peoples-speech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
primewords
Primewords ASR data preparation.
lhotse prepare primewords [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
radio
Data preparation
lhotse prepare radio [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -d, --min-seg-dur <min_seg_dur>
The minimum segment duration
- -j, --num-jobs <num_jobs>
The number of parallel threads to use for data preparation
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
reazonspeech
ReazonSpeech ASR data preparation.
lhotse prepare reazonspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
rir-noise
RIRS and noises data preparation.
lhotse prepare rir-noise [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --parts <parts>
Parts to prepare.
- Default:
point_noise, iso_noise, real_rir, sim_rir
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
sbcsae
SBCSAE data preparation.
lhotse prepare sbcsae [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --geolocation
Include geographic coordinates of speakers’ hometowns in the manifests.
- --omit-realignments
Only output the original corpus segmentation without boundary improvements.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
slu
lhotse prepare slu [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
spatial-librispeech
Spatial-LibriSpeech ASR data preparation.
lhotse prepare spatial-librispeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p train -p test
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text <normalize_text>
Conversion of transcripts to lower-case (originally in upper-case).
- Default:
none
- Options:
none | lower
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
speechcommands
Speech Commands v0.01 or v0.02 data preparation.
lhotse prepare speechcommands [OPTIONS] SPEECHCOMMANDS_VERSION CORPUS_DIR
OUTPUT_DIR
Arguments
- SPEECHCOMMANDS_VERSION
Required argument
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
speechio
SpeechIO data preparation. See https://github.com/SpeechColab/Leaderboard
lhotse prepare speechio [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
spgispeech
SPGISpeech ASR data preparation.
lhotse prepare spgispeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text, --no-normalize-text
Normalize the text.
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
stcmds
Stcmds ASR data preparation.
lhotse prepare stcmds [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
switchboard
The Switchboard corpus preparation.
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare switchboard [OPTIONS] AUDIO_DIR OUTPUT_DIR
Options
- --transcript-dir <transcript_dir>
- --sentiment-dir <sentiment_dir>
Optional path to LDC2020T14 package with sentiment annotations for SWBD.
- --omit-silence, --retain-silence
Should the [silence] segments be kept.
- --absolute-paths <absolute_paths>
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- AUDIO_DIR
Required argument
- OUTPUT_DIR
Required argument
tal-asr
Tal_asr ASR data preparation.
lhotse prepare tal-asr [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
tal-csasr
Tal_csasr ASR data preparation.
lhotse prepare tal-csasr [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
tedlium
TED-LIUM v3 recording and supervision manifest preparation.
lhotse prepare tedlium [OPTIONS] TEDLIUM_DIR OUTPUT_DIR
Options
- -p, --parts <parts>
Which parts of TED-LIUM v3 to prepare (by default all).
- Options:
train | dev | test
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text <normalize_text>
Type of text normalization to apply (no normalization, by default). Selecting kaldi will remove <unk> tokens and join suffixes.
- Options:
none | upper | kaldi
Arguments
- TEDLIUM_DIR
Required argument
- OUTPUT_DIR
Required argument
tedlium2
TED-LIUM v2 recording and supervision manifest preparation.
lhotse prepare tedlium2 [OPTIONS] TEDLIUM_DIR OUTPUT_DIR
Options
- -p, --parts <parts>
Which parts of TED-LIUM v2 to prepare (by default all, i.e., (‘train’, ‘dev’, ‘test’)).
- Options:
train | dev | test
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text <normalize_text>
Type of text normalization to apply (no normalization, by default). Selecting kaldi will remove <unk> tokens and join suffixes.
- Options:
none | upper | kaldi
Arguments
- TEDLIUM_DIR
Required argument
- OUTPUT_DIR
Required argument
thchs-30
thchs_30 ASR data preparation.
lhotse prepare thchs-30 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
this-american-life
This American Life data preparation.
lhotse prepare this-american-life [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
timit
TIMIT data preparation. :param corpus_dir: Pathlike, the path of the data dir. :param output_dir: Pathlike, the path where to write and save the manifests.
lhotse prepare timit [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --num-phones <num_phones>
The number of phones (60, 48 or 39) for modeling. And 48 is regarded as the default value.
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
uwb-atcc
UWB-ATCC data preparation.
lhotse prepare uwb-atcc [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --silence-sym <silence_sym>
- --breath-sym <breath_sym>
- --noise-sym <noise_sym>
- --foreign-sym <foreign_sym>
- --partial-sym <partial_sym>
- --unintelligble-sym <unintelligble_sym>
- --unknown-sym <unknown_sym>
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
vctk
VCTK data preparation.
lhotse prepare vctk [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --use-edinburgh-vctk-url <use_edinburgh_vctk_url>
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
voxceleb
The VoxCeleb corpus preparation.
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages. There are a total of 7000+ speakers and 1 million utterances.
lhotse prepare voxceleb [OPTIONS] OUTPUT_DIR
Options
- -v1, --voxceleb1 <voxceleb1>
Path to VoxCeleb1 dataset.
- -v2, --voxceleb2 <voxceleb2>
Path to VoxCeleb2 dataset.
- -j, --num-jobs <num_jobs>
Number of parallel jobs.
Arguments
- OUTPUT_DIR
Required argument
voxconverse
VoxConverse data preparation.
lhotse prepare voxconverse [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --split-test
Split test part into dev and test parts
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
voxpopuli
voxpopuli data preparation.
lhotse prepare voxpopuli [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --task <task>
The task for which to prepare the VoxPopuli data.
- Default:
asr
- Options:
asr | s2s | lm
- --lang <lang>
The language to prepare (only used if task is asr or lm).
- Default:
en
- Options:
en | de | fr | es | pl | it | ro | hu | cs | nl | fi | hr | sk | sl | et | lt | pt | bg | el | lv | mt | sv | da | en_v2 | de_v2 | fr_v2 | es_v2 | pl_v2 | it_v2 | ro_v2 | hu_v2 | cs_v2 | nl_v2 | fi_v2 | hr_v2 | sk_v2 | sl_v2 | et_v2 | lt_v2 | pt_v2 | bg_v2 | el_v2 | lv_v2 | mt_v2 | sv_v2 | da_v2
- --src-lang <src_lang>
The source language (only used if task is s2s).
- Options:
en | de | fr | es | pl | it | ro | hu | cs | nl | fi | hr | sk | sl | et | lt
- --tgt-lang <tgt_lang>
The target language (only used if task is s2s).
- Options:
en | de | fr | es | pl | it | ro | hu | cs | nl | fi | hr | sk | sl | et | lt | pt | bg | el | lv | mt | sv | da
- -j, --num-jobs <num_jobs>
Number of parallel jobs (can provide small speed-ups).
- Default:
1
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
wenet-speech
The WenetSpeech corpus preparation.
lhotse prepare wenet-speech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts,pass each with -p Example: -p M -p TEST_NET
- -j, --num-jobs <num_jobs>
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
wenetspeech4tts
WenetSpeech4TTS data preparation.
lhotse prepare wenetspeech4tts [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>
How many jobs to use (can give good speed-ups with slow disks).
- -p, --dataset-parts <dataset_parts>
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p Basic -p Premium
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
xbmu-amdo31
XBMU-AMDO31 ASR data preparation.
lhotse prepare xbmu-amdo31 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
yesno
yes_no data preparation.
lhotse prepare yesno [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR
Required argument
- OUTPUT_DIR
Required argument
shar
Lhotse Shar format for optimized I/O commands
lhotse shar [OPTIONS] COMMAND [ARGS]...
compute-features
Compute features for Lhotse Shar cuts stored in SHAR_DIR.
The features are computed sequentially on CPU within shards, and parallelized across shards up to NUM_JOBS concurrent workers.
FEATURE_CONFIG defines the feature extractor type and settings. You can generate default feature extractor settings with: lhotse feat write-default-config –help
lhotse shar compute-features [OPTIONS] SHAR_DIR
Options
- -f, --feature-config <feature_config>
Optional manifest specifying feature extractor configuration (use Fbank by default).
- -c, --compression <compression>
Which compression to use (lilcom is lossy, numpy is lossless).
- Options:
lilcom | numpy
- -j, --num-jobs <num_jobs>
Number of parallel workers.
- -v, --verbose
Arguments
- SHAR_DIR
Required argument
export
Export CutSet from CUTS into Lhotse Shar format in OUTDIR.
This script partitions the input manifest into smaller pieces called shards with SHARD_SIZE cuts per shard. The input is optionally shuffled. In addition to sharding, the user can choose to export AUDIO or FEATURES into sequentially readable tar files with a selected compression type. This typically yields very high speedups vs random read formats such as HDF5, especially on slower disks or clusters, at the expense of a data copy.
The result is readable in Python using: CutSet.from_shar(OUTDIR)
lhotse shar export [OPTIONS] CUTS OUTDIR
Options
- -a, --audio <audio>
Format in which to export audio. Original will save in the same format as the original audio (disabled by default, enabling will make a copy of the data)
- Options:
none | wav | flac | mp3 | opus | original
- -f, --features <features>
Format in which to export features (disabled by default, enabling will make a copy of the data)
- Options:
none | lilcom | numpy
- -c, --custom <custom>
Custom fields to export. Use syntax NAME:FORMAT, e.g.: -c target_recording:flac -c embedding:numpy. Use format options for audio and features depending on the custom fields type, or ‘jsonl’ for metadata.
- -s, --shard-size <shard_size>
The number of cuts in a single shard.
- --shuffle, --no-shuffle
Should we shuffle the cuts before splitting into shards.
- --fault-tolerant, --fast-fail
Should we skip over cuts that failed to load data or raise an error.
- --seed <seed>
Random seed.
- -j, --num-jobs <num_jobs>
Number of parallel workers. We recommend to keep this number low on machines with slow disks as the speed of I/O will likely be the bottleneck.
- -v, --verbose
Arguments
- CUTS
Required argument
- OUTDIR
Required argument
split
Load MANIFEST, split it into NUM_SPLITS equal parts and save as separate manifests in OUTPUT_DIR.
When your manifests are very large, prefer to use “lhotse split-lazy” instead.
lhotse split [OPTIONS] NUM_SPLITS MANIFEST OUTPUT_DIR
Options
- -s, --shuffle
Optionally shuffle the sequence before splitting.
- --pad, --no-pad
Whether to pad the split output idx with zeros (e.g. 00, 01, 02, .., 10).
- -i, --start-idx <start_idx>
Count splits starting from this index.
Arguments
- NUM_SPLITS
Required argument
- MANIFEST
Required argument
- OUTPUT_DIR
Required argument
split-lazy
Load MANIFEST (lazily if in JSONL format) and split it into parts, each with CHUNK_SIZE items. The parts are saved to separate files with pattern “{output_dir}/{manifest.stem}.{chunk_idx}.jsonl.gz”.
Prefer this to “lhotse split” when your manifests are very large.
lhotse split-lazy [OPTIONS] MANIFEST OUTPUT_DIR CHUNK_SIZE
Options
- -i, --start-idx <start_idx>
Count splits starting from this index.
Arguments
- MANIFEST
Required argument
- OUTPUT_DIR
Required argument
- CHUNK_SIZE
Required argument
subset
Load MANIFEST, select the FIRST or LAST number of items and store it in OUTPUT_MANIFEST.
lhotse subset [OPTIONS] MANIFEST OUTPUT_MANIFEST
Options
- --first <first>
- --last <last>
- --cutids <cutids>
A json string or path to json file containing array of cutids strings. E.g. –cutids ‘[“cutid1”, “cutid2”]’.
Arguments
- MANIFEST
Required argument
- OUTPUT_MANIFEST
Required argument
supervision
Commands related to manipulating supervision manifests.
lhotse supervision [OPTIONS] COMMAND [ARGS]...
with-alignment-from-ctm
Add alignments from CTM file to the supervision set.
- param in_supervision_manifest:
Path to input supervision manifest.
- param out_supervision_manifest:
Path to output supervision manifest.
- param ctm_file:
Path to CTM file.
- param alignment_type:
Alignment type (optional, default = word).
- param match_channel:
if True, also match channel between CTM and SupervisionSegment
- param verbose:
Whether to print verbose output.
- return:
A new SupervisionSet with AlignmentItem objects added to the segments.
lhotse supervision with-alignment-from-ctm [OPTIONS] IN_SUPERVISION_MANIFEST
OUT_SUPERVISION_MANIFEST
Options
- --ctm-file <ctm_file>
CTM file containing alignments to add.
- --alignment-type <alignment_type>
Type of alignment to add (default = word).
- --match-channel, --no-match-channel
Whether to match channel between CTM and SupervisionSegment (default = False).
- -v, --verbose
Whether to print verbose output.
Arguments
- IN_SUPERVISION_MANIFEST
Required argument
- OUT_SUPERVISION_MANIFEST
Required argument
validate
Validate a Lhotse manifest file.
lhotse validate [OPTIONS] MANIFEST
Options
- --read-data, --dont-read-data
Should the audio/features data be read from disk to perform additional checks (could be extremely slow for large manifests).
Arguments
- MANIFEST
Required argument
validate-pair
Validate a pair of Lhotse RECORDINGS and SUPERVISIONS manifest files. Checks whether the two manifests are consistent with each other.
lhotse validate-pair [OPTIONS] RECORDINGS SUPERVISIONS
Options
- --read-data, --dont-read-data
Should the audio/features data be read from disk to perform additional checks (could be extremely slow for large manifests).
Arguments
- RECORDINGS
Required argument
- SUPERVISIONS
Required argument
workflows
Workflows using corpus creation tools.
lhotse workflows [OPTIONS] COMMAND [ARGS]...
activity-detection
Use activity detection methods (e.g., Silero VAD) to detect and annotate the segmentation of Lhotse RecordingSets and save the results in the SupervisionSet manifest. The output manifest will be saved in the path specified by OUTPUT_SUPERVISIONS_MANIFEST. If OUTPUT_SUPERVISIONS_MANIFEST is not provided, the output manifest will be saved in the same directory as RECORDINGS_MANIFEST.
Note: this is an experimental feature and it does not guarantee high-quality performance and data annotation.
lhotse workflows activity-detection [OPTIONS]
Options
- -r, --recordings-manifest <recordings_manifest>
Path to an existing recording manifest.
- -o, --output-supervisions-manifest <output_supervisions_manifest>
Path to the output supervisions manifest or a directory where it will be saved.
- -m, --model-name <model_name>
One of activity detector: silero_vad_16k, silero_vad_8k.
- -d, --device <device>
Device on which to run the inference.
- -j, --jobs <jobs>
Number of jobs for audio scanning.
- --force_download
Forced cache clearing and model downloading
align-with-torchaudio
Use a pretrained ASR model from torchaudio to force align IN_CUTS (a Lhotse CutSet) and write the results to OUT_CUTS. It will attach word-level alignment information (start, end, and score) to the supervisions in each cut.
This is based on a tutorial from torchaudio: https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html
In order to use a multilingual alignment model, use –bundle_name MMS_FA. (based on the multilingual tutorial: https://pytorch.org/audio/main/tutorials/forced_alignment_for_multilingual_data_tutorial.html)
Note: this is an experimental feature of Lhotse, and is not guaranteed to yield high quality of data.
lhotse workflows align-with-torchaudio [OPTIONS] IN_CUTS OUT_CUTS
Options
- -n, --bundle-name <bundle_name>
One of torchaudio pretrained ‘bundle’ variants (see: https://pytorch.org/audio/stable/pipelines.html)
- -d, --device <device>
Device on which to run the inference.
- --normalize-text, --dont-normalize-text
By default, we’ll try to normalize the text by making it uppercase and discarding symbols outside of model’s character level vocabulary. If this causes issues, turn the option off and normalize the text yourself.
- -j, --num-jobs <num_jobs>
Number of parallel jobs to run.
- --check-language, --dont-check-language
If False, warnings about non-existent language tags in supervisions will be suppressed.
Arguments
- IN_CUTS
Required argument
- OUT_CUTS
Required argument
annotate-dnsmos
Use Microsoft DNSMOS P.835 prediction model to annotate either RECORDINGS_MANIFEST, RECORDINGS_DIR, or CUTS_MANIFEST. It will predict DNSMOS P.835 score including SIG, NAK, and OVRL.
See the original repo for more details: https://github.com/microsoft/DNS-Challenge/tree/master/DNSMOS
RECORDINGS_MANIFEST, RECORDINGS_DIR, and CUTS_MANIFEST are mutually exclusive. If CUTS_MANIFEST is provided, its supervisions will be overwritten with the results of the inference.
lhotse workflows annotate-dnsmos [OPTIONS] OUT_CUTS
Options
- -m, --recordings-manifest <recordings_manifest>
Path to an existing recording manifest.
- -r, --recordings-dir <recordings_dir>
Directory with recordings. We will create a RecordingSet for it automatically.
- -c, --cuts-manifest <cuts_manifest>
Path to an existing cuts manifest.
- -e, --extension <extension>
Audio file extension to search for. Used with RECORDINGS_DIR.
- -p, --is-personalized-mos <is_personalized_mos>
Flag to indicate if personalized MOS score is needed or regular.
- -j, --jobs <jobs>
Number of jobs for audio scanning.
Arguments
- OUT_CUTS
Required argument
annotate-with-whisper
Use OpenAI Whisper model to annotate either RECORDINGS_MANIFEST, RECORDINGS_DIR, or CUTS_MANIFEST. It will perform automatic segmentation, transcription, and language identification.
RECORDINGS_MANIFEST, RECORDINGS_DIR, and CUTS_MANIFEST are mutually exclusive. If CUTS_MANIFEST is provided, its supervisions will be overwritten with the results of the inference.
Note: this is an experimental feature of Lhotse, and is not guaranteed to yield high quality of data.
lhotse workflows annotate-with-whisper [OPTIONS] OUT_CUTS
Options
- -m, --recordings-manifest <recordings_manifest>
Path to an existing recording manifest.
- -r, --recordings-dir <recordings_dir>
Directory with recordings. We will create a RecordingSet for it automatically.
- -c, --cuts-manifest <cuts_manifest>
Path to an existing cuts manifest.
- -e, --extension <extension>
Audio file extension to search for. Used with RECORDINGS_DIR.
- -n, --model-name <model_name>
One of Whisper variants (base, medium, large, etc.)
- -l, --language <language>
Language spoken in the audio. Inferred by default.
- -d, --device <device>
Device on which to run the inference.
- -j, --jobs <jobs>
Number of jobs for audio scanning.
- --force-nonoverlapping, --keep-overlapping
If True, the Whisper segment time-stamps will be processed to make sure they are non-overlapping.
Arguments
- OUT_CUTS
Required argument
simulate-meetings
Simulate meeting-style mixtures using a provided CutSet containing single-channel cuts. Different simulation techniques can be selected using the –method option. Currently, the following methods are supported:
- independent: each speaker is simulated independently, using the provided cuts as a finite
set of utterances.
- conversational: the speakers are simulated as a group, using overall silence/overlap
statistics.
The number of speakers per meeting is sampled uniformly from the range provided in –num-speakers-per-meeting.
The number of meetings to simulate is controlled by either –num-meetings or –num-repeats. If the former is provided, the same number of meetings will be simulated. If the latter is provided, the provided cuts will be repeated num_repeats times, and the resulting cuts will be used as a finite set of utterances to use for simulation.
The simulated meetings can be optionally reverberated using the RIRs from a provided recording set. If no RIRs are provided, we will use a fast random approximation technique to simulate the reverberation. The RIRs can be provided as a single recording set, or as a directory containing multiple recording sets. In the latter case, the RIRs will be sampled from the provided directory.
lhotse workflows simulate-meetings [OPTIONS] IN_CUTS OUT_CUTS
Options
- --method <method>
The simulation method to use: independent - each speaker is simulated independently, conversational - the speakers are simulated as a group, using overall silence/overlap statistics.
- Options:
independent | conversational
- --loc <loc>
The minimum silence duration between two consecutive utterances from the same speaker.
- Default:
0.0
- --scale <scale>
The scale parameter of the exponential distribution used to sample the silence duration between two consecutive utterances from a speaker.
- Default:
2.0
- --same-spk-pause <same_spk_pause>
The mean pause duration between utterances of the same speaker
- Default:
1.0
- --diff-spk-pause <diff_spk_pause>
The mean pause duration between utterances of different speakers
- Default:
1.0
- --diff-spk-overlap <diff_spk_overlap>
The mean overlap duration between utterances of different speakers
- Default:
2.0
- --prob-diff-spk-overlap <prob_diff_spk_overlap>
The probability of overlap between utterances of different speakers
- Default:
0.5
- -f, --fit-to-supervisions <fit_to_supervisions>
Path to a supervision set to learn the distributions for simulation.
- --reverberate, --dont-reverberate
If True, the simulated meetings will be reverberated.
- --rir-recordings, --rir <rir_recordings>
Path to a recording set containing RIRs. If provided, the simulated meetings will be reverberated using the RIRs from this set. A directory containing recording sets can also be provided, in which case each meeting will use a recording set sampled from this directory.
- -n, --num-meetings <num_meetings>
Number of meetings to simulate. Either this of num_repeats must be provided.
- -r, --num-repeats <num_repeats>
Number of times to repeat each input cut. The resulting cuts will be used as a finite set of utterances to use for simulation. Either this of num_meetings must be provided.
- -s, --num-speakers-per-meeting <num_speakers_per_meeting>
Number of speakers per meeting. One or more integers can be provided (comma-separated). In this case, the number of speakers will be sampled uniformly from the provided list, or using the distribution provided in speaker-count-probs.
- -p, --speaker-count-probs <speaker_count_probs>
A list of probabilities for each speaker count. The length of the list must be equal to the number of elements in num-speakers-per-meeting.
- -d, --max-duration-per-speaker <max_duration_per_speaker>
Maximum duration of a single speaker in a meeting.
- -u, --max-utterances-per-speaker <max_utterances_per_speaker>
Maximum number of utterances per speaker in a meeting.
- --allow-3fold-overlap, --no-3fold-overlap
If True, the simulated meetings will allow more than 2 speakers to overlap. This is only relevant for the conversational method.
- --seed <seed>
Random seed for reproducibility.
- -j, --num-jobs <num_jobs>
Number of parallel jobs to run.
Arguments
- IN_CUTS
Required argument
- OUT_CUTS
Required argument