Command-line interface¶
lhotse¶
The shell entry point to Lhotse, a tool and a library for audio data manipulation in high altitudes.
lhotse [OPTIONS] COMMAND [ARGS]...
Options
- -s, --seed <seed>¶
Random seed.
combine¶
Load MANIFESTS, combine them into a single one, and write it to OUTPUT_MANIFEST.
lhotse combine [OPTIONS] [MANIFESTS]... OUTPUT_MANIFEST
Arguments
- MANIFESTS¶
Optional argument(s)
- OUTPUT_MANIFEST¶
Required argument
copy¶
Load INPUT_MANIFEST and store it to OUTPUT_MANIFEST. Useful for conversion between different serialization formats (e.g. JSON, JSONL, YAML). Automatically supports gzip compression when ‘.gz’ suffix is detected.
lhotse copy [OPTIONS] INPUT_MANIFEST OUTPUT_MANIFEST
Arguments
- INPUT_MANIFEST¶
Required argument
- OUTPUT_MANIFEST¶
Required argument
copy-feats¶
Load INPUT_MANIFEST of type
lhotse.FeatureSet
or lhotse.CutSet, read every feature matrix usingfeatures.load()
orcut.load_features()
, save them in STORAGE_PATH and save the updated manifest to OUTPUT_MANIFEST.
lhotse copy-feats [OPTIONS] INPUT_MANIFEST OUTPUT_MANIFEST STORAGE_PATH
Options
- -t, --storage-type <storage_type>¶
Which storage backend should we use for writing the copied features.
- Options
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -j, --max-jobs <max_jobs>¶
Maximum number of parallel copying processes. By default, one process is spawned for every existing feature file in the INPUT_MANIFEST (e.g., if the features were extracted with 20 jobs, there will typically be 20 files).
Arguments
- INPUT_MANIFEST¶
Required argument
- OUTPUT_MANIFEST¶
Required argument
- STORAGE_PATH¶
Required argument
cut¶
Group of commands used to create CutSets.
lhotse cut [OPTIONS] COMMAND [ARGS]...
append¶
Create a new CutSet by appending the cuts in CUT_MANIFESTS. CUT_MANIFESTS are iterated position-wise (the cuts on i’th position in each manfiest are appended to each other). The cuts are appended in the order in which they appear in the input argument list. If CUT_MANIFESTS have different lengths, the script stops once the shortest CutSet is depleted.
lhotse cut append [OPTIONS] [CUT_MANIFESTS]... OUTPUT_CUT_MANIFEST
Arguments
- CUT_MANIFESTS¶
Optional argument(s)
- OUTPUT_CUT_MANIFEST¶
Required argument
decompose¶
Decompose CUTSET into:
recording set (recordings.jsonl.gz)
feature set (features.jsonl.gz)
supervision set (supervisions.jsonl.gz)
If any of these are not preset in any of the cuts, the corresponding file for them will be empty.
lhotse cut decompose [OPTIONS] CUTSET OUTPUT
Arguments
- CUTSET¶
Required argument
- OUTPUT¶
Required argument
describe¶
Describe some statistics of CUTSET, such as the total speech and audio duration.
lhotse cut describe [OPTIONS] CUTSET
Arguments
- CUTSET¶
Required argument
export-to-webdataset¶
Export CUTS into a WebDataset tarfile, or a collection of tarfile shards, as specified by WSPECIFIER.
WSPECIFIER can be: - a regular path (e.g., “data/cuts.tar”), - a path template for sharding (e.g., “data/shard-06%d.tar”), or - a “pipe:” expression (e.g., “pipe:gzip -c > data/shard-06%d.tar.gz”).
The resulting CutSet contains audio/feature data in addition to metadata, and can be read in Python using ‘CutSet.from_webdataset’ API.
This function is useful for I/O intensive applications where random reads are too slow, and a one-time lengthy export step that enables fast sequential reading is preferable.
See the WebDataset project for more information: https://github.com/webdataset/webdataset
lhotse cut export-to-webdataset [OPTIONS] CUTSET WSPECIFIER
Options
- -s, --shard-size <shard_size>¶
Number of cuts per shard (sharding disabled if not defined).
- -f, --audio-format <audio_format>¶
Format in which the audio is encoded (uses torchaudio available formats).
- --audio, --no-audio¶
Should we load and add audio data.
- --features, --no-features¶
Should we load and add feature data.
- --custom, --no-custom¶
Should we load and add custom data.
- --fault-tolerant, --stop-on-fail¶
Should we omit the cuts for which loading data failed, or stop the execution.
Arguments
- CUTSET¶
Required argument
- WSPECIFIER¶
Required argument
mix-by-recording-id¶
Create a CutSet stored in OUTPUT_CUT_MANIFEST by matching the Cuts from CUT_MANIFESTS by their recording IDs and mixing them together.
lhotse cut mix-by-recording-id [OPTIONS] [CUT_MANIFESTS]...
OUTPUT_CUT_MANIFEST
Arguments
- CUT_MANIFESTS¶
Optional argument(s)
- OUTPUT_CUT_MANIFEST¶
Required argument
mix-sequential¶
Create a CutSet stored in OUTPUT_CUT_MANIFEST by iterating jointly over CUT_MANIFESTS and mixing the Cuts on the same positions. E.g. the first output cut is created from the first cuts in each input manifest. The mix is performed by summing the features from all Cuts. If the CUT_MANIFESTS have different number of Cuts, the mixing ends when the shorter manifest is depleted.
lhotse cut mix-sequential [OPTIONS] [CUT_MANIFESTS]... OUTPUT_CUT_MANIFEST
Arguments
- CUT_MANIFESTS¶
Optional argument(s)
- OUTPUT_CUT_MANIFEST¶
Required argument
pad¶
Create a new CutSet by padding the cuts in CUT_MANIFEST. The cuts will be right-padded, i.e. the padding is placed after the signal ends.
lhotse cut pad [OPTIONS] CUT_MANIFEST OUTPUT_CUT_MANIFEST
Options
- -d, --duration <duration>¶
Desired duration of cuts after padding. Cuts longer than this won’t be affected. By default, pad to the longest cut duration found in CUT_MANIFEST.
Arguments
- CUT_MANIFEST¶
Required argument
- OUTPUT_CUT_MANIFEST¶
Required argument
simple¶
Create a CutSet stored in OUTPUT_CUT_MANIFEST. Depending on the provided options, it may contain any combination of recording, feature and supervision manifests. Either RECORDING_MANIFEST or FEATURE_MANIFEST has to be provided. When SUPERVISION_MANIFEST is provided, the cuts time span will correspond to that of the supervision segments. Otherwise, that time span corresponds to the one found in features, if available, otherwise recordings.
lhotse cut simple [OPTIONS] OUTPUT_CUT_MANIFEST
Options
- -r, --recording-manifest <recording_manifest>¶
Optional recording manifest - will be used to attach the recordings to the cuts.
- -f, --feature-manifest <feature_manifest>¶
Optional feature manifest - will be used to attach the features to the cuts.
- -s, --supervision-manifest <supervision_manifest>¶
Optional supervision manifest - will be used to attach the supervisions to the cuts.
- --force-eager¶
Force reading full manifests into memory before creating the manifests (useful when you are not sure about the input manifest sorting).
Arguments
- OUTPUT_CUT_MANIFEST¶
Required argument
trim-to-supervisions¶
Splits each input cut into as many cuts as there are supervisions. These cuts have identical start times and durations as the supervisions. When there are overlapping supervisions, they can be kept or discarded with options.
For example, the following cut:
Cut
- |-----------------|
Sup1
- |----| Sup2
is transformed into two cuts:
Cut1
- |----|
Sup1
- |----|
Sup2 |-|
Cut2
|-----------| Sup1 |-|
Sup2
lhotse cut trim-to-supervisions [OPTIONS] CUTS OUTPUT_CUTS
Options
- --keep-overlapping, --discard-overlapping¶
when False, it will discard parts of other supervisions that overlap with the main supervision. In the illustration, it would discard Sup2 in Cut1 and Sup1 in Cut2.
- -d, --min-duration <min_duration>¶
An optional duration in seconds; specifying this argument will extend the cuts that would have been shorter than min_duration with actual acoustic context in the recording/features. If there are supervisions present in the context, they are kept when keep_overlapping is true. If there is not enough context, the returned cut will be shorter than min_duration. If the supervision segment is longer than min_duration, the return cut will be longer.
- -c, --context-direction <context_direction>¶
Which direction should the cut be expanded towards to include context. The value of “center” implies equal expansion to left and right; random uniformly samples a value between “left” and “right”.
- Options
center | left | right | random
Arguments
- CUTS¶
Required argument
- OUTPUT_CUTS¶
Required argument
truncate¶
Truncate the cuts in the CUT_MANIFEST and write them to OUTPUT_CUT_MANIFEST. Cuts shorter than MAX_DURATION will not be modified.
lhotse cut truncate [OPTIONS] CUT_MANIFEST OUTPUT_CUT_MANIFEST
Options
- --preserve-id¶
Should the cuts preserve IDs (by default, they will get new, random IDs)
- -d, --max-duration <max_duration>¶
Required The maximum duration in seconds of a cut in the resulting manifest.
- -o, --offset-type <offset_type>¶
Where should the truncated cut start: “start” - at the start of the original cut, “end” - MAX_DURATION before the end of the original cut, “random” - randomly choose somewhere between “start” and “end” options.
- Options
start | end | random
- --keep-overflowing-supervisions, --discard-overflowing-supervisions¶
When a cut is truncated in the middle of a supervision segment, should the supervision be kept.
Arguments
- CUT_MANIFEST¶
Required argument
- OUTPUT_CUT_MANIFEST¶
Required argument
download¶
Command group for download and extract data.
lhotse download [OPTIONS] COMMAND [ARGS]...
adept¶
ADEPT prosody transfer evaluation corpus download.
lhotse download adept [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
aidatatang-200zh¶
- aidatatang_200zh download.
- Args:
- target_dir:
It will create a dir aidatatang_200zh to contain all downloaded/extracted files
lhotse download aidatatang-200zh [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
aishell¶
Aishell download.
lhotse download aishell [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
aishell4¶
AISHELL-4 download.
lhotse download aishell4 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
ali-meeting¶
AliMeeting download.
lhotse download ali-meeting [OPTIONS] TARGET_DIR
Options
- --force-download¶
Arguments
- TARGET_DIR¶
Required argument
ami¶
AMI download.
lhotse download ami [OPTIONS] TARGET_DIR
Options
- --annotations <annotations>¶
To download annotations in a different directory than corpus.
- --mic <mic>¶
AMI microphone setting.
- Options
ihm | ihm-mix | sdm | mdm
- --url <url>¶
AMI data downloading URL.
- --force-download <force_download>¶
If True, download even if file is present.
Arguments
- TARGET_DIR¶
Required argument
bvcc¶
BVCC/VoiceMOS challange data cannot be downloaded.
See info and instructions how to obtain BVCC dataset used for VoiceMOS challange: - https://arxiv.org/abs/2105.02373 - https://nii-yamagishilab.github.io/ecooper-demo/VoiceMOS2022/index.html - https://codalab.lisn.upsaclay.fr/competitions/695
lhotse download bvcc [OPTIONS]
cmu-arctic¶
CMU Arctic download.
lhotse download cmu-arctic [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
cmu-indic¶
CMU Indic download.
lhotse download cmu-indic [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
daily-talk¶
Download DailyTalk dataset.
lhotse download daily-talk [OPTIONS] TARGET_DIR
Options
- --force-download¶
Force download.
Arguments
- TARGET_DIR¶
Required argument
earnings21¶
Earnings21 dataset download.
lhotse download earnings21 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
earnings22¶
Earnings22 dataset download.
lhotse download earnings22 [OPTIONS]
gigaspeech¶
Gigaspeech download.
lhotse download gigaspeech [OPTIONS] PASSWORD TARGET_DIR
Options
- --subset <subset>¶
Which parts of Gigaspeech to download (by default XL + DEV + TEST).
- Options
auto | XL | L | M | S | XS | DEV | TEST
- --host <host>¶
Which host to download Gigaspeech.
Arguments
- PASSWORD¶
Required argument
- TARGET_DIR¶
Required argument
heroico¶
heroico download.
lhotse download heroico [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
hifitts¶
HiFiTTS data download.
lhotse download hifitts [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
icsi¶
ICSI data download.
lhotse download icsi [OPTIONS] AUDIO_DIR
Options
- --transcripts-dir <transcripts_dir>¶
To download annotations in a different directory than audio.
- --mic <mic>¶
ICSI microphone setting.
- Options
ihm | ihm-mix | sdm | mdm
- --url <url>¶
ICSI data downloading URL.
- --force-download <force_download>¶
If True, download even if file is present.
Arguments
- AUDIO_DIR¶
Required argument
libricss¶
Download LibriCSS dataset.
lhotse download libricss [OPTIONS] TARGET_DIR
Options
- --force-download¶
Force download
Arguments
- TARGET_DIR¶
Required argument
librimix¶
Mini LibriMix download.
lhotse download librimix [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
librispeech¶
(Mini) Librispeech download.
lhotse download librispeech [OPTIONS] TARGET_DIR
Options
- --full, --mini¶
Download Librispeech [default] or mini Librispeech.
Arguments
- TARGET_DIR¶
Required argument
libritts¶
LibriTTS data download.
lhotse download libritts [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
ljspeech¶
LJSpeech download.
lhotse download ljspeech [OPTIONS] [TARGET_DIR]
Arguments
- TARGET_DIR¶
Optional argument
magicdata¶
Magicdata download.
lhotse download magicdata [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
mtedx¶
MTEDx download.
lhotse download mtedx [OPTIONS] TARGET_DIR
Options
- -l, --lang <lang>¶
Specify which languages to download, e.g., lhoste download mtedx . -l de -l fr -l es lhoste download mtedx
Arguments
- TARGET_DIR¶
Required argument
musan¶
MUSAN download.
lhotse download musan [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
primewords¶
Primewords download.
lhotse download primewords [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
rir-noise¶
RIRS and noises download.
lhotse download rir-noise [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
spgispeech¶
SPGISpeech download.
lhotse download spgispeech [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
stcmds¶
Stcmds download.
lhotse download stcmds [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
tedlium¶
TED-LIUM v3 download (approx. 11GB).
lhotse download tedlium [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
thchs-30¶
thchs_30 download.
lhotse download thchs-30 [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
timit¶
TIMIT download.
lhotse download timit [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
vctk¶
VCTK download.
lhotse download vctk [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
voxceleb1¶
VoxCeleb1 download.
lhotse download voxceleb1 [OPTIONS] TARGET_DIR
Options
- --force-download¶
Force download
Arguments
- TARGET_DIR¶
Required argument
voxceleb2¶
VoxCeleb2 download.
lhotse download voxceleb2 [OPTIONS] TARGET_DIR
Options
- --force-download¶
Force download
Arguments
- TARGET_DIR¶
Required argument
yesno¶
yes_no dataset download.
lhotse download yesno [OPTIONS] TARGET_DIR
Arguments
- TARGET_DIR¶
Required argument
feat¶
Feature extraction related commands.
lhotse feat [OPTIONS] COMMAND [ARGS]...
extract¶
Extract features for recordings in a given AUDIO_MANIFEST. The features are stored in OUTPUT_DIR, with one file per recording (or segment).
lhotse feat extract [OPTIONS] RECORDING_MANIFEST OUTPUT_DIR
Options
- -f, --feature-manifest <feature_manifest>¶
Optional manifest specifying feature extractor configuration.
- --storage-type <storage_type>¶
Select a storage backend for the feature matrices.
- Options
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -t, --lilcom-tick-power <lilcom_tick_power>¶
Determines the compression accuracy; the input will be compressed to integer multiples of 2^tick_power
- -r, --root-dir <root_dir>¶
Root directory - all paths in the manifest will use this as prefix.
- -j, --num-jobs <num_jobs>¶
Number of parallel processes.
Arguments
- RECORDING_MANIFEST¶
Required argument
- OUTPUT_DIR¶
Required argument
extract-cuts¶
Extract features for cuts in a given CUTSET manifest. The features are stored in STORAGE_PATH, and the output manifest with features is stored in OUTPUT_CUTSET.
lhotse feat extract-cuts [OPTIONS] CUTSET OUTPUT_CUTSET STORAGE_PATH
Options
- -f, --feature-manifest <feature_manifest>¶
Optional manifest specifying feature extractor configuration.
- --storage-type <storage_type>¶
Select a storage backend for the feature matrices.
- Options
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -j, --num-jobs <num_jobs>¶
Number of parallel processes.
Arguments
- CUTSET¶
Required argument
- OUTPUT_CUTSET¶
Required argument
- STORAGE_PATH¶
Required argument
extract-cuts-batch¶
Extract features for cuts in a given CUTSET manifest. The features are stored in STORAGE_PATH, and the output manifest with features is stored in OUTPUT_CUTSET.
This version enables CUDA acceleration for feature extractors that support it (e.g., kaldifeat extractors).
Example usage of kaldifeat fbank with CUDA:
$ pip install kaldifeat # note: ensure it’s compiled with CUDA
$ lhotse feat write-default-config -f kaldifeat-fbank feat.yml
$ sed ‘s/device: cpu/device: cuda/’ feat.yml feat-cuda.yml
$ lhotse feat extract-cuts-batch -f feat-cuda.yml cuts.jsonl cuts_with_feats.jsonl feats.h5
lhotse feat extract-cuts-batch [OPTIONS] CUTSET OUTPUT_CUTSET STORAGE_PATH
Options
- -f, --feature-manifest <feature_manifest>¶
Optional manifest specifying feature extractor configuration. If you want to use CUDA, you should specify the device in this config.
- --storage-type <storage_type>¶
Select a storage backend for the feature matrices.
- Options
chunked_lilcom_hdf5 | kaldiio | lilcom_chunky | lilcom_files | lilcom_hdf5 | lilcom_url | memory_lilcom | memory_raw | numpy_files | numpy_hdf5
- -j, --num-jobs <num_jobs>¶
Number of dataloader workers.
- -b, --batch-duration <batch_duration>¶
At most this many seconds of audio will be processed in each batch.
Arguments
- CUTSET¶
Required argument
- OUTPUT_CUTSET¶
Required argument
- STORAGE_PATH¶
Required argument
upload¶
Read an existing FEATURE_MANIFEST, upload the feature matrices it contains to a URL location, and save a new feature OUTPUT_MANIFEST that refers to the uploaded features.
The URL can refer to endpoints such as AWS S3, GCP, Azure, etc. For example: “s3://my-bucket/my-features” is a valid URL.
This script does not currently support credentials, and assumes that you have the write permissions.
lhotse feat upload [OPTIONS] FEATURE_MANIFEST URL OUTPUT_MANIFEST
Options
- -j, --num-jobs <num_jobs>¶
Arguments
- FEATURE_MANIFEST¶
Required argument
- URL¶
Required argument
- OUTPUT_MANIFEST¶
Required argument
write-default-config¶
Save a default feature extraction config to OUTPUT_CONFIG.
lhotse feat write-default-config [OPTIONS] OUTPUT_CONFIG
Options
- -f, --feature-type <feature_type>¶
Which feature extractor type to use.
- Options
fbank | kaldi-fbank | kaldi-mfcc | kaldifeat-fbank | kaldifeat-mfcc | librosa-fbank | mfcc | opensmile-extractor | spectrogram
Arguments
- OUTPUT_CONFIG¶
Required argument
filter¶
Filter a MANIFEST according to the rule specified in PREDICATE, and save the result to OUTPUT_MANIFEST. It is intended to work generically with most manifest types - it supports RecordingSet, SupervisionSet and CutSet.
The PREDICATE specifies which attribute is used for item selection. Some examples: lhotse filter ‘duration>4.5’ supervision.json output.json lhotse filter ‘num_frames<600’ cuts.json output.json lhotse filter ‘start=0’ cuts.json output.json lhotse filter ‘channel!=0’ audio.json output.json
It currently only supports comparison of numerical manifest item attributes, such as: start, duration, end, channel, num_frames, num_features, etc.
lhotse filter [OPTIONS] PREDICATE MANIFEST OUTPUT_MANIFEST
Arguments
- PREDICATE¶
Required argument
- MANIFEST¶
Required argument
- OUTPUT_MANIFEST¶
Required argument
fix¶
Fix a pair of Lhotse RECORDINGS and SUPERVISIONS manifests. It removes supervisions without corresponding recordings and vice versa, trims the supervisions that exceed the recording, etc. Stores the output files in OUTPUT_DIR under the same names as the input files.
lhotse fix [OPTIONS] RECORDINGS SUPERVISIONS OUTPUT_DIR
Arguments
- RECORDINGS¶
Required argument
- SUPERVISIONS¶
Required argument
- OUTPUT_DIR¶
Required argument
install-sph2pipe¶
Install the sph2pipe program to handle sphere (.sph) audio files with “shorten” codec compression (needed for older LDC data).
It downloads an archive and then decompresses and compiles the contents.
lhotse install-sph2pipe [OPTIONS]
Options
- --install-dir <install_dir>¶
Directory where sph2pipe will be downloaded and installed.
- --url <url>¶
URL from which to download sph2pipe.
kaldi¶
Kaldi import/export related commands.
lhotse kaldi [OPTIONS] COMMAND [ARGS]...
export¶
Convert a pair of
RecordingSet
andSupervisionSet
manifests into a Kaldi-style data directory.
lhotse kaldi export [OPTIONS] RECORDINGS SUPERVISIONS OUTPUT_DIR
Options
- -u, --map-underscores-to <map_underscores_to>¶
Optional string with which we will replace all underscores.This helps avoid issues with Kaldi data dir sorting.
- -p, --prefix-spk-id¶
Prefix utterance ids with speaker ids.This helps avoid issues with Kaldi data dir sorting.
Arguments
- RECORDINGS¶
Required argument
- SUPERVISIONS¶
Required argument
- OUTPUT_DIR¶
Required argument
import¶
Convert a Kaldi data dir DATA_DIR into a directory MANIFEST_DIR of lhotse manifests. Ignores feats.scp. The SAMPLING_RATE has to be explicitly specified as it is not available to read from DATA_DIR.
lhotse kaldi import [OPTIONS] DATA_DIR SAMPLING_RATE MANIFEST_DIR
Options
- -f, --frame-shift <frame_shift>¶
Frame shift (in seconds) is required to support reading feats.scp.
- -u, --map-string-to-underscores <map_string_to_underscores>¶
When specified, we will replace all instances of this string in SupervisonSegment IDs to underscores. This is to help with handling underscores in Kaldi (see ‘export_to_kaldi’).
- -j, --num-jobs <num_jobs>¶
Number of jobs for computing recording durations.
Arguments
- DATA_DIR¶
Required argument
- SAMPLING_RATE¶
Required argument
- MANIFEST_DIR¶
Required argument
prepare¶
Command group with data preparation recipes.
lhotse prepare [OPTIONS] COMMAND [ARGS]...
adept¶
ADEPT prosody transfer evaluation corpus data preparation.
lhotse prepare adept [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
aidatatang-200zh¶
- aidatatang_200zh ASR data preparation.
- Args:
- corpus_dir:
It should contain a subdirectory “aidatatang_200zh”
- output_dir:
The output directory.
lhotse prepare aidatatang-200zh [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
aishell¶
Aishell ASR data preparation.
lhotse prepare aishell [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
aishell2¶
Aishell2 ASR data preparation.
lhotse prepare aishell2 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
aishell4¶
AISHELL-4 data preparation.
lhotse prepare aishell4 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
ali-meeting¶
AliMeeting data preparation.
lhotse prepare ali-meeting [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --mic <mic>¶
- Options
near | far
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
ami¶
AMI data preparation.
lhotse prepare ami [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --annotations <annotations>¶
Provide if annotations are download in a different directory than corpus.
- --mic <mic>¶
AMI microphone setting.
- Options
ihm | ihm-mix | sdm | mdm
- --partition <partition>¶
Data partition to use (see http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml).
- Options
scenario-only | full-corpus | full-corpus-asr
- --normalize-text <normalize_text>¶
Type of text normalization to apply (kaldi style, by default)
- Options
none | upper | kaldi
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
aspire¶
ASpIRE data preparation.
lhotse prepare aspire [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --mic <mic>¶
- Options
single | multi
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
babel¶
This is a data preparation recipe for the IARPA BABEL corpus (see: https://www.iarpa.gov/index.php/research-programs/babel). It should support all of the languages available in BABEL. It will prepare the data from the “conversational” part of BABEL.
This script should be invoked separately for each language you want to prepare, e.g.: $ lhotse prepare babel /export/corpora5/Babel/IARPA_BABEL_BP_101 data/cantonese $ lhotse prepare babel /export/corpora5/Babel/BABEL_OP1_103 data/bengali
lhotse prepare babel [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
broadcast-news¶
English Broadcast News 1997 data preparation. It will output three manifests: for recordings, topic sections, and speech segments. It supports the following LDC distributions:
* 1997 English Broadcast News Train (HUB4)
Speech LDC98S71 Transcripts LDC98T28
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare broadcast-news [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR
Arguments
- AUDIO_DIR¶
Required argument
- TRANSCRIPT_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
bvcc¶
BVCC data preparation.
CORPUS_DIR should contain the following dir structure
./phase1-main/README ./phase1-main/DATA/sets/* ./phase1-main/DATA/wav/* …
./phase1-ood/README ./phase1-ood/DATA/sets/ ./phase1-ood/DATA/wav/ …
Check the READMEs for details.
See ‘lhotse download bvcc’ for links to instructions how to obtain the corpus.
lhotse prepare bvcc [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -nj, --num_jobs <num_jobs>¶
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
callhome-egyptian¶
About the Callhome Egyptian Arabic Corpus
The CALLHOME Egyptian Arabic corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary represents is Cairene Arabic.
This recipe uses the speech and transcripts available through LDC. In addition, an Egyptian arabic phonetic lexicon (available via LDC) is used to get word to phoneme mappings for the vocabulary. This datasets are:
Speech : LDC97S45 Transcripts : LDC97T19 Lexicon : LDC99L22 (unused here)
To actually read the audio, you will need the SPH2PIPE binary: you can provide its path, so that we will add it in the manifests (otherwise you might need to modify your PATH environment variable to find sph2pipe).
lhotse prepare callhome-egyptian [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>¶
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- AUDIO_DIR¶
Required argument
- TRANSCRIPT_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
callhome-english¶
CallHome American English corpus preparation.
Depending on the value of transcript_dir, will prepare either
if transcript_dir = None, the SRE task (expected corpus
LDC2001S97
).The setup will reflect speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. The 2000 NIST SRE is required, and has an LDC catalog number LDC2001S97. The data is not available for free, but can be licensed from the LDC (Linguistic Data Consortium) * otherwise data for ASR task (expected LDC corpora
LDC97S42
andLDC97T14
) will be prepared. The data is not available for free, but can be licensed from the LDC (Linguistic Data Consortium)The data should be located at AUDIO_DIR. Optionally, for the SRE task, RTTM_DIR can be provided that has the contents of http://www.openslr.org/resources/10/; otherwise, we will download it.
To actually read the audio, you will need the SPH2PIPE binary: you can provide its path, so that we will add it in the manifests (otherwise you might need to modify your PATH environment variable to find sph2pipe).
Example:
lhotse prepare callhome-english /export/corpora5/LDC/LDC97S42 –transcript-dir /export/corpora5/LDC/LDC97T14 ./callhome_asr
or
lhotse prepare callhome-english /export/corpora5/LDC/LDC2001S97 ./callhome_sre
lhotse prepare callhome-english [OPTIONS] AUDIO_DIR OUTPUT_DIR
Options
- --rttm-dir <rttm_dir>¶
- --absolute-paths <absolute_paths>¶
Whether to return absolute or relative (to the corpus dir) paths for recordings.
- --transcript-dir <transcript_dir>¶
Path to the LDC97T14 corpus. Please note that providing this path, the ASR corpus will be prepared, not the SRE corpus!
Arguments
- AUDIO_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
cmu-arctic¶
CMU Arctic data preparation.
lhotse prepare cmu-arctic [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
cmu-indic¶
CMU Indic data preparation.
lhotse prepare cmu-indic [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
cmu-kids¶
CMU Kids corpus data preparation.
lhotse prepare cmu-kids [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>¶
Use absolute paths for recordings
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
commonvoice¶
Mozilla CommonVoice manifest preparation script. CORPUS_DIR is expected to contain sub-directories that are named with CommonVoice language codes, e.g., “en”, “pl”, etc.
lhotse prepare commonvoice [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -l, --language <language>¶
Languages to prepare (scans CORPUS_DIR for language codes by default).
- -s, --split <split>¶
Splits to prepare (available options: train, dev, test, validated, invalidated, other)
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
cslu-kids¶
CSLU Kids corpus data preparation.
lhotse prepare cslu-kids [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>¶
Use absolute paths for recordings
- --normalize-text <normalize_text>¶
Remove noise tags (<bn>, <bs>) from spontaneous speech transcripts
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
daily-talk¶
DailyTalk recording and supervision manifest preparation.
lhotse prepare daily-talk [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --num-jobs <num_jobs>¶
Number of parallel workers.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
dihard3¶
DIHARD3 data preparation.
lhotse prepare dihard3 [OPTIONS] OUTPUT_DIR
Options
- --dev <dev>¶
- --eval <eval>¶
- --uem, --no-uem¶
Specify whether or not to create UEM supervision
- -j, --num-jobs <num_jobs>¶
Number of jobs to scan corpus directory for recordings.
Arguments
- OUTPUT_DIR¶
Required argument
earnings21¶
Earnings21 data preparation.
lhotse prepare earnings21 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --normalize-text, --no-normalize-text¶
Normalize the text.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
earnings22¶
Earnings22 data preparation.
lhotse prepare earnings22 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --normalize-text, --no-normalize-text¶
Normalize the text.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
eval2000¶
The Eval2000 corpus preparation.
This is conversational telephone speech collected as 2-channel, 8kHz-sampled data. The catalog number LDC2002S09 for audio corpora and LDC2002T43 for transcripts.
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare eval2000 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>¶
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
fisher-english¶
The Fisher English Part 1, 2 corpus preparation.
This is conversational telephone speech collected as 2-channel, 8kHz-sampled data. The catalog number LDC2004S13 and LDC2005S13 for audio corpora and LDC2004T19 LDC2005T19 for transcripts.
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare fisher-english [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -ad, --audio-dirs <audio_dirs>¶
Audio dirs, e.g., LDC2004S13 LDC2005S13. Multiple corpora can be provided by repeating -ad.
- -td, --transcript-dirs <transcript_dirs>¶
Transcript dirs, e.g., LDC2004T19 LDC2005T19. Multiple corpora can be provided by repeating -ad.
- --absolute-paths <absolute_paths>¶
Whether to return absolute or relative (to the corpus dir) paths for recordings.
- -j, --num-jobs <num_jobs>¶
Number of concurrent processes scanning the audio files.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
fisher-spanish¶
The Fisher Spanish corpus preparation.
This is conversational telephone speech collected as 2-channel μ-law, 8kHz-sampled data. The catalog number LDC2010S01 for audio corpus and LDC2010T04 for transcripts.
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare fisher-spanish [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR
Options
- --absolute-paths <absolute_paths>¶
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- AUDIO_DIR¶
Required argument
- TRANSCRIPT_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
gale-arabic¶
GALE Arabic Phases 1 to 4 Broadcast news and conversation data preparation.
lhotse prepare gale-arabic [OPTIONS] OUTPUT_DIR
Options
- -s, --audio <audio>¶
Paths to audio dirs, e.g., LDC2013S02. Multiple corpora can be provided by repeating -s.
- -t, --transcript <transcript>¶
Paths to transcript dirs, e.g., LDC2013T17. Multiple corpora can be provided by repeating -t
- --absolute-paths <absolute_paths>¶
Use absolute paths for recordings
Arguments
- OUTPUT_DIR¶
Required argument
gale-mandarin¶
GALE Mandarin Broadcast speech data preparation.
lhotse prepare gale-mandarin [OPTIONS] OUTPUT_DIR
Options
- -s, --audio <audio>¶
Paths to audio dirs, e.g., LDC2013S08. Multiple corpora can be provided by repeating -s.
- -t, --transcript <transcript>¶
Paths to transcript dirs, e.g., LDC2013T20. Multiple corpora can be provided by repeating -t
- --absolute-paths <absolute_paths>¶
Use absolute paths for recordings
- --segment-words <segment_words>¶
Use ‘jieba’ package to perform word segmentation on the text
Arguments
- OUTPUT_DIR¶
Required argument
gigaspeech¶
Gigaspeech ASR data preparation.
lhotse prepare gigaspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --subset <subset>¶
Which parts of Gigaspeech to download (by default XL + DEV + TEST).
- Options
auto | XL | L | M | S | XS | DEV | TEST
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
heroico¶
heroico Answers ASR data preparation.
lhotse prepare heroico [OPTIONS] SPEECH_DIR TRANSCRIPT_DIR OUTPUT_DIR
Arguments
- SPEECH_DIR¶
Required argument
- TRANSCRIPT_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
hifitts¶
HiFiTTS data preparation.
lhotse prepare hifitts [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>¶
How many jobs to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
icsi¶
AMI data preparation.
lhotse prepare icsi [OPTIONS] AUDIO_DIR OUTPUT_DIR
Options
- --transcripts-dir <transcripts_dir>¶
- --mic <mic>¶
ICSI microphone setting.
- Options
ihm | ihm-mix | sdm | mdm
- --normalize-text¶
If set, convert all text annotations to upper case (similar to Kaldi)
Arguments
- AUDIO_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
l2-arctic¶
L2 Arctic data preparation.
lhotse prepare l2-arctic [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
libricss¶
LibriCSS recording and supervision manifest preparation.
lhotse prepare libricss [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --type <type>¶
Type of the corpus to prepare
- Options
ihm | ihm-mix | mdm
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
librimix¶
LibrMix source separation data preparation.
lhotse prepare librimix [OPTIONS] LIBRIMIX_CSV OUTPUT_DIR
Options
- --sampling-rate <sampling_rate>¶
Sampling rate to set in the RecordingSet manifest.
- --min-segment-seconds <min_segment_seconds>¶
Remove segments shorter than MIN_SEGMENT_SECONDS.
- --with-precomputed-mixtures, --no-precomputed-mixtures¶
Optionally create an RecordingSet manifest including the precomputed LibriMix mixtures.
Arguments
- LIBRIMIX_CSV¶
Required argument
- OUTPUT_DIR¶
Required argument
librispeech¶
(Mini) Librispeech ASR data preparation.
lhotse prepare librispeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>¶
List of dataset parts to prepare. To prepare multiple parts, pass each with -p Example: -p train-clean-360 -p dev-other
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
libritts¶
LibriTTs data preparation.
lhotse prepare libritts [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>¶
How many jobs to use (can give good speed-ups with slow disks).
- --link-previous-utterance, --no-previous-utterance¶
If true adds previous utterance id to supervisions. Useful for reconstructing chains of utterances as they were read from LibriVox books. If previous utterance was skipped from LibriTTS datasets previous_utt label is None. 66% of utterances have previous utterance.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
ljspeech¶
LJSpeech data preparation.
lhotse prepare ljspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
magicdata¶
Magicdata ASR data preparation.
lhotse prepare magicdata [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
mgb2¶
mgb2 ASR data preparation.
lhotse prepare mgb2 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --text-cleaning, --no-text-cleaning¶
Basic text cleaning.
- --buck-walter, --no-buck-walter¶
Use BuckWalter transliteration.
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
- --mer-thresh <mer_thresh>¶
filter out segments based on mer (Match Error Rate).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
mls¶
Multilingual Librispeech (MLS) data preparation.
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It is available at OpenSLR: http://openslr.org/94
lhotse prepare mls [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --opus, --flac¶
Which codec should be used (OPUS or FLAC)
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
mtedx¶
MTEDx ASR data preparation.
lhotse prepare mtedx [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
- -l, --lang <lang>¶
Specify which languages to prepare, e.g., lhoste prepare librispeech mtedx_corpus data -l de -l fr -l es
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
musan¶
MUSAN data preparation.
lhotse prepare musan [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --use-vocals, --no-vocals¶
Whether to include vocal music in “music” part.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
nsc¶
This is a data preparation recipe for the National Corpus of Speech in Singaporean English.
lhotse prepare nsc [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-part <dataset_part>¶
Which part of NSC should be prepared
- Options
PART3_SameCloseMic | PART3_SeparateIVR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
peoples-speech¶
Prepare The People’s Speech corpus manifests.
lhotse prepare peoples-speech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
primewords¶
Primewords ASR data preparation.
lhotse prepare primewords [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
rir-noise¶
RIRS and noises data preparation.
lhotse prepare rir-noise [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --parts <parts>¶
Parts to prepare.
- Default
point_noise, iso_noise, real_rir, sim_rir
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
spgispeech¶
SPGISpeech ASR data preparation.
lhotse prepare spgispeech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
- --normalize-text, --no-normalize-text¶
Normalize the text.
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
stcmds¶
Stcmds ASR data preparation.
lhotse prepare stcmds [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
switchboard¶
The Switchboard corpus preparation.
This is conversational telephone speech collected as 2-channel, 8kHz-sampled data. We are using just the Switchboard-1 Phase 1 training data. The catalog number LDC97S62 (Switchboard-1 Release 2) corresponds, we believe, to what we have. We also use the Mississippi State transcriptions, which we download separately from http://www.isip.piconepress.com/projects/switchboard/releases/switchboard_word_alignments.tar.gz
This data is not available for free - your institution needs to have an LDC subscription.
lhotse prepare switchboard [OPTIONS] AUDIO_DIR OUTPUT_DIR
Options
- --transcript-dir <transcript_dir>¶
- --sentiment-dir <sentiment_dir>¶
Optional path to LDC2020T14 package with sentiment annotations for SWBD.
- --omit-silence, --retain-silence¶
Should the [silence] segments be kept.
- --absolute-paths <absolute_paths>¶
Whether to return absolute or relative (to the corpus dir) paths for recordings.
Arguments
- AUDIO_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
tal-asr¶
Tal_asr ASR data preparation.
lhotse prepare tal-asr [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
tal-csasr¶
Tal_csasr ASR data preparation.
lhotse prepare tal-csasr [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
tedlium¶
TED-LIUM v3 recording and supervision manifest preparation.
lhotse prepare tedlium [OPTIONS] TEDLIUM_DIR OUTPUT_DIR
Arguments
- TEDLIUM_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
thchs-30¶
thchs_30 ASR data preparation.
lhotse prepare thchs-30 [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
timit¶
- TIMIT data preparation.
- param corpus_dir
Pathlike, the path of the data dir.
- param output_dir
Pathlike, the path where to write and save the manifests.
lhotse prepare timit [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --num-phones <num_phones>¶
The number of phones (60, 48 or 39) for modeling. And 48 is regarded as the default value.
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
vctk¶
VCTK data preparation.
lhotse prepare vctk [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- --use-edinburgh-vctk-url <use_edinburgh_vctk_url>¶
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
voxceleb¶
The VoxCeleb corpus preparation.
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages. There are a total of 7000+ speakers and 1 million utterances.
lhotse prepare voxceleb [OPTIONS] OUTPUT_DIR
Options
- -v1, --voxceleb1 <voxceleb1>¶
Path to VoxCeleb1 dataset.
- -v2, --voxceleb2 <voxceleb2>¶
Path to VoxCeleb2 dataset.
- -j, --num-jobs <num_jobs>¶
Number of parallel jobs.
Arguments
- OUTPUT_DIR¶
Required argument
wenet-speech¶
The WenetSpeech corpus preparation.
lhotse prepare wenet-speech [OPTIONS] CORPUS_DIR OUTPUT_DIR
Options
- -p, --dataset-parts <dataset_parts>¶
List of dataset parts to prepare. To prepare multiple parts,pass each with -p Example: -p M -p TEST_NET
- -j, --num-jobs <num_jobs>¶
How many threads to use (can give good speed-ups with slow disks).
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
yesno¶
yes_no data preparation.
lhotse prepare yesno [OPTIONS] CORPUS_DIR OUTPUT_DIR
Arguments
- CORPUS_DIR¶
Required argument
- OUTPUT_DIR¶
Required argument
split¶
Load MANIFEST, split it into NUM_SPLITS equal parts and save as separate manifests in OUTPUT_DIR.
When your manifests are very large, prefer to use “lhotse split-lazy” instead.
lhotse split [OPTIONS] NUM_SPLITS MANIFEST OUTPUT_DIR
Options
- -s, --shuffle¶
Optionally shuffle the sequence before splitting.
- --pad, --no-pad¶
Whether to pad the split output idx with zeros (e.g. 01, 02, .., 10).
Arguments
- NUM_SPLITS¶
Required argument
- MANIFEST¶
Required argument
- OUTPUT_DIR¶
Required argument
split-lazy¶
Load MANIFEST (lazily if in JSONL format) and split it into parts, each with CHUNK_SIZE items. The parts are saved to separate files with pattern “{output_dir}/{manifest.stem}.{chunk_idx}.jsonl.gz”.
Prefer this to “lhotse split” when your manifests are very large.
lhotse split-lazy [OPTIONS] MANIFEST OUTPUT_DIR CHUNK_SIZE
Arguments
- MANIFEST¶
Required argument
- OUTPUT_DIR¶
Required argument
- CHUNK_SIZE¶
Required argument
subset¶
Load MANIFEST, select the FIRST or LAST number of items and store it in OUTPUT_MANIFEST.
lhotse subset [OPTIONS] MANIFEST OUTPUT_MANIFEST
Options
- --first <first>¶
- --last <last>¶
- --cutids <cutids>¶
A json string or path to json file containing array of cutids strings. E.g. –cutids ‘[“cutid1”, “cutid2”]’.
Arguments
- MANIFEST¶
Required argument
- OUTPUT_MANIFEST¶
Required argument
validate¶
Validate a Lhotse manifest file.
lhotse validate [OPTIONS] MANIFEST
Options
- --read-data, --dont-read-data¶
Should the audio/features data be read from disk to perform additional checks (could be extremely slow for large manifests).
Arguments
- MANIFEST¶
Required argument
validate-pair¶
Validate a pair of Lhotse RECORDINGS and SUPERVISIONS manifest files. Checks whether the two manifests are consistent with each other.
lhotse validate-pair [OPTIONS] RECORDINGS SUPERVISIONS
Options
- --read-data, --dont-read-data¶
Should the audio/features data be read from disk to perform additional checks (could be extremely slow for large manifests).
Arguments
- RECORDINGS¶
Required argument
- SUPERVISIONS¶
Required argument