Command-line interface

lhotse

The shell entry point to Lhotse, a tool and a library for audio data manipulation in high altitudes.

lhotse [OPTIONS] COMMAND [ARGS]...

Options

-s, --seed <seed>

Random seed.

combine

Load MANIFESTS, combine them into a single one, and write it to OUTPUT_MANIFEST.

lhotse combine [OPTIONS] [MANIFESTS]... OUTPUT_MANIFEST

Arguments

MANIFESTS

Optional argument(s)

OUTPUT_MANIFEST

Required argument

copy

Load INPUT_MANIFEST and store it to OUTPUT_MANIFEST. Useful for conversion between different serialization formats (e.g. JSON, JSONL, YAML). Automatically supports gzip compression when ‘.gz’ suffix is detected.

lhotse copy [OPTIONS] INPUT_MANIFEST OUTPUT_MANIFEST

Arguments

INPUT_MANIFEST

Required argument

OUTPUT_MANIFEST

Required argument

cut

Group of commands used to create CutSets.

lhotse cut [OPTIONS] COMMAND [ARGS]...

append

Create a new CutSet by appending the cuts in CUT_MANIFESTS. CUT_MANIFESTS are iterated position-wise (the cuts on i’th position in each manfiest are appended to each other). The cuts are appended in the order in which they appear in the input argument list. If CUT_MANIFESTS have different lengths, the script stops once the shortest CutSet is depleted.

lhotse cut append [OPTIONS] [CUT_MANIFESTS]... OUTPUT_CUT_MANIFEST

Arguments

CUT_MANIFESTS

Optional argument(s)

OUTPUT_CUT_MANIFEST

Required argument

mix-by-recording-id

Create a CutSet stored in OUTPUT_CUT_MANIFEST by matching the Cuts from CUT_MANIFESTS by their recording IDs and mixing them together.

lhotse cut mix-by-recording-id [OPTIONS] [CUT_MANIFESTS]...
                               OUTPUT_CUT_MANIFEST

Arguments

CUT_MANIFESTS

Optional argument(s)

OUTPUT_CUT_MANIFEST

Required argument

mix-sequential

Create a CutSet stored in OUTPUT_CUT_MANIFEST by iterating jointly over CUT_MANIFESTS and mixing the Cuts on the same positions. E.g. the first output cut is created from the first cuts in each input manifest. The mix is performed by summing the features from all Cuts. If the CUT_MANIFESTS have different number of Cuts, the mixing ends when the shorter manifest is depleted.

lhotse cut mix-sequential [OPTIONS] [CUT_MANIFESTS]... OUTPUT_CUT_MANIFEST

Arguments

CUT_MANIFESTS

Optional argument(s)

OUTPUT_CUT_MANIFEST

Required argument

pad

Create a new CutSet by padding the cuts in CUT_MANIFEST. The cuts will be right-padded, i.e. the padding is placed after the signal ends.

lhotse cut pad [OPTIONS] CUT_MANIFEST OUTPUT_CUT_MANIFEST

Options

-d, --duration <duration>

Desired duration of cuts after padding. Cuts longer than this won’t be affected. By default, pad to the longest cut duration found in CUT_MANIFEST.

Arguments

CUT_MANIFEST

Required argument

OUTPUT_CUT_MANIFEST

Required argument

random-mixed

Create a CutSet stored in OUTPUT_CUT_MANIFEST that contains supervision regions from SUPERVISION_MANIFEST and features supplied by FEATURE_MANIFEST. It first creates a trivial CutSet, splits it into two equal, randomized parts and mixes their features. The parameters of the mix are controlled via SNR_RANGE and OFFSET_RANGE.

lhotse cut random-mixed [OPTIONS] SUPERVISION_MANIFEST FEATURE_MANIFEST
                        OUTPUT_CUT_MANIFEST

Options

-s, --snr-range <snr_range>

Range of SNR values (in dB) that will be uniformly sampled in order to mix the signals.

-o, --offset-range <offset_range>

Range of relative offset values (0 - 1), which will offset the “right” signal by this many times the duration of the “left” signal. It is uniformly sampled for each mix operation.

Arguments

SUPERVISION_MANIFEST

Required argument

FEATURE_MANIFEST

Required argument

OUTPUT_CUT_MANIFEST

Required argument

simple

Create a CutSet stored in OUTPUT_CUT_MANIFEST. Depending on the provided options, it may contain any combination of recording, feature and supervision manifests. Either RECORDING_MANIFEST or FEATURE_MANIFEST has to be provided. When SUPERVISION_MANIFEST is provided, the cuts time span will correspond to that of the supervision segments. Otherwise, that time span corresponds to the one found in features, if available, otherwise recordings.

lhotse cut simple [OPTIONS] OUTPUT_CUT_MANIFEST

Options

-r, --recording-manifest <recording_manifest>

Optional recording manifest - will be used to attach the recordings to the cuts.

-f, --feature-manifest <feature_manifest>

Optional feature manifest - will be used to attach the features to the cuts.

-s, --supervision-manifest <supervision_manifest>

Optional supervision manifest - will be used to attach the supervisions to the cuts.

Arguments

OUTPUT_CUT_MANIFEST

Required argument

truncate

Truncate the cuts in the CUT_MANIFEST and write them to OUTPUT_CUT_MANIFEST. Cuts shorter than MAX_DURATION will not be modified.

lhotse cut truncate [OPTIONS] CUT_MANIFEST OUTPUT_CUT_MANIFEST

Options

--preserve-id

Should the cuts preserve IDs (by default, they will get new, random IDs)

-d, --max-duration <max_duration>

Required The maximum duration in seconds of a cut in the resulting manifest.

-o, --offset-type <offset_type>

Where should the truncated cut start: “start” - at the start of the original cut, “end” - MAX_DURATION before the end of the original cut, “random” - randomly choose somewhere between “start” and “end” options.

Options

start | end | random

--keep-overflowing-supervisions, --discard-overflowing-supervisions

When a cut is truncated in the middle of a supervision segment, should the supervision be kept.

Arguments

CUT_MANIFEST

Required argument

OUTPUT_CUT_MANIFEST

Required argument

windowed

Create a CutSet stored in OUTPUT_CUT_MANIFEST from feature regions in FEATURE_MANIFEST. The feature matrices are traversed in windows with CUT_SHIFT increments, creating cuts of constant CUT_DURATION.

lhotse cut windowed [OPTIONS] FEATURE_MANIFEST OUTPUT_CUT_MANIFEST

Options

-d, --cut-duration <cut_duration>

How long should the cuts be in seconds.

-s, --cut-shift <cut_shift>

How much to shift the cutting window in seconds (by default the shift is equal to CUT_DURATION).

--keep-shorter-windows, --discard-shorter-windows

When true, the last window will be used to create a Cut even if its duration is shorter than CUT_DURATION.

Arguments

FEATURE_MANIFEST

Required argument

OUTPUT_CUT_MANIFEST

Required argument

download

Command group for download and extract data.

lhotse download [OPTIONS] COMMAND [ARGS]...

adept

ADEPT prosody transfer evaluation corpus download.

lhotse download adept [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

aishell

Aishell download.

lhotse download aishell [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

aishell4

AISHELL-4 download.

lhotse download aishell4 [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

ami

AMI download.

lhotse download ami [OPTIONS] TARGET_DIR

Options

--annotations <annotations>

To download annotations in a different directory than corpus.

--mic <mic>

AMI microphone setting.

Options

ihm | ihm-mix | sdm | mdm

--url <url>

AMI data downloading URL.

--force-download <force_download>

If True, download even if file is present.

Arguments

TARGET_DIR

Required argument

cmu-arctic

CMU Arctic download.

lhotse download cmu-arctic [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

cmu-indic

CMU Indic download.

lhotse download cmu-indic [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

gigaspeech

Gigaspeech download.

lhotse download gigaspeech [OPTIONS] PASSWORD TARGET_DIR

Options

--subset <subset>

Which parts of Gigaspeech to download (by default XL + DEV + TEST).

Options

auto | XL | L | M | S | XS | DEV | TEST

--host <host>

Which host to download Gigaspeech.

Arguments

PASSWORD

Required argument

TARGET_DIR

Required argument

heroico

heroico download.

lhotse download heroico [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

hifitts

HiFiTTS data download.

lhotse download hifitts [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

libricss

Download LibriCSS dataset.

lhotse download libricss [OPTIONS] TARGET_DIR

Options

--force-download

Force download

Arguments

TARGET_DIR

Required argument

librimix

Mini LibriMix download.

lhotse download librimix [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

librispeech

(Mini) Librispeech download.

lhotse download librispeech [OPTIONS] TARGET_DIR

Options

--full, --mini

Download Librispeech [default] or mini Librispeech.

Arguments

TARGET_DIR

Required argument

libritts

LibriTTS data download.

lhotse download libritts [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

ljspeech

LJSpeech download.

lhotse download ljspeech [OPTIONS] [TARGET_DIR]

Arguments

TARGET_DIR

Optional argument

mtedx

MTEDx download.

lhotse download mtedx [OPTIONS] TARGET_DIR

Options

-l, --lang <lang>

Specify which languages to download, e.g., lhoste download mtedx . -l de -l fr -l es lhoste download mtedx

Arguments

TARGET_DIR

Required argument

musan

MUSAN download.

lhotse download musan [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

tedlium

TED-LIUM v3 download (approx. 11GB).

lhotse download tedlium [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

timit

TIMIT download.

lhotse download timit [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

vctk

VCTK download.

lhotse download vctk [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

yesno

yes_no dataset download.

lhotse download yesno [OPTIONS] TARGET_DIR

Arguments

TARGET_DIR

Required argument

feat

Feature extraction related commands.

lhotse feat [OPTIONS] COMMAND [ARGS]...

extract

Extract features for recordings in a given AUDIO_MANIFEST. The features are stored in OUTPUT_DIR, with one file per recording (or segment).

lhotse feat extract [OPTIONS] RECORDING_MANIFEST OUTPUT_DIR

Options

-f, --feature-manifest <feature_manifest>

Optional manifest specifying feature extractor configuration.

--storage-type <storage_type>

Select a storage backend for the feature matrices.

Options

chunked_lilcom_hdf5 | lilcom_files | lilcom_hdf5 | lilcom_url | numpy_files | numpy_hdf5

-t, --lilcom-tick-power <lilcom_tick_power>

Determines the compression accuracy; the input will be compressed to integer multiples of 2^tick_power

-r, --root-dir <root_dir>

Root directory - all paths in the manifest will use this as prefix.

-j, --num-jobs <num_jobs>

Number of parallel processes.

Arguments

RECORDING_MANIFEST

Required argument

OUTPUT_DIR

Required argument

extract-cuts

Extract features for cuts in a given CUTSET manifest. The features are stored in STORAGE_PATH, and the output manifest with features is stored in OUTPUT_CUTSET.

lhotse feat extract-cuts [OPTIONS] CUTSET OUTPUT_CUTSET STORAGE_PATH

Options

-f, --feature-manifest <feature_manifest>

Optional manifest specifying feature extractor configuration.

--storage-type <storage_type>

Select a storage backend for the feature matrices.

Options

chunked_lilcom_hdf5 | lilcom_files | lilcom_hdf5 | lilcom_url | numpy_files | numpy_hdf5

-j, --num-jobs <num_jobs>

Number of parallel processes.

Arguments

CUTSET

Required argument

OUTPUT_CUTSET

Required argument

STORAGE_PATH

Required argument

extract-cuts-batch

Extract features for cuts in a given CUTSET manifest. The features are stored in STORAGE_PATH, and the output manifest with features is stored in OUTPUT_CUTSET.

This version enables CUDA acceleration for feature extractors that support it (e.g., kaldifeat extractors).

Example usage of kaldifeat fbank with CUDA:

$ pip install kaldifeat # note: ensure it’s compiled with CUDA

$ lhotse feat write-default-config -f kaldifeat-fbank feat.yml

$ sed ‘s/device: cpu/device: cuda/’ feat.yml feat-cuda.yml

$ lhotse feat extract-cuts-batch -f feat-cuda.yml cuts.jsonl cuts_with_feats.jsonl feats.h5

lhotse feat extract-cuts-batch [OPTIONS] CUTSET OUTPUT_CUTSET STORAGE_PATH

Options

-f, --feature-manifest <feature_manifest>

Optional manifest specifying feature extractor configuration. If you want to use CUDA, you should specify the device in this config.

--storage-type <storage_type>

Select a storage backend for the feature matrices.

Options

chunked_lilcom_hdf5 | lilcom_files | lilcom_hdf5 | lilcom_url | numpy_files | numpy_hdf5

-j, --num-jobs <num_jobs>

Number of dataloader workers.

-b, --batch-duration <batch_duration>

At most this many seconds of audio will be processed in each batch.

Arguments

CUTSET

Required argument

OUTPUT_CUTSET

Required argument

STORAGE_PATH

Required argument

upload

Read an existing FEATURE_MANIFEST, upload the feature matrices it contains to a URL location, and save a new feature OUTPUT_MANIFEST that refers to the uploaded features.

The URL can refer to endpoints such as AWS S3, GCP, Azure, etc. For example: “s3://my-bucket/my-features” is a valid URL.

This script does not currently support credentials, and assumes that you have the write permissions.

lhotse feat upload [OPTIONS] FEATURE_MANIFEST URL OUTPUT_MANIFEST

Options

-j, --num-jobs <num_jobs>

Arguments

FEATURE_MANIFEST

Required argument

URL

Required argument

OUTPUT_MANIFEST

Required argument

write-default-config

Save a default feature extraction config to OUTPUT_CONFIG.

lhotse feat write-default-config [OPTIONS] OUTPUT_CONFIG

Options

-f, --feature-type <feature_type>

Which feature extractor type to use.

Options

fbank | kaldi-fbank | kaldi-mfcc | kaldifeat-fbank | kaldifeat-mfcc | librosa-fbank | mfcc | spectrogram

Arguments

OUTPUT_CONFIG

Required argument

filter

Filter a MANIFEST according to the rule specified in PREDICATE, and save the result to OUTPUT_MANIFEST. It is intended to work generically with most manifest types - it supports RecordingSet, SupervisionSet and CutSet.

The PREDICATE specifies which attribute is used for item selection. Some examples:
lhotse filter ‘duration>4.5’ supervision.json output.json
lhotse filter ‘num_frames<600’ cuts.json output.json
lhotse filter ‘start=0’ cuts.json output.json
lhotse filter ‘channel!=0’ audio.json output.json

It currently only supports comparison of numerical manifest item attributes, such as: start, duration, end, channel, num_frames, num_features, etc.

lhotse filter [OPTIONS] PREDICATE MANIFEST OUTPUT_MANIFEST

Arguments

PREDICATE

Required argument

MANIFEST

Required argument

OUTPUT_MANIFEST

Required argument

fix

Fix a pair of Lhotse RECORDINGS and SUPERVISIONS manifests. It removes supervisions without corresponding recordings and vice versa, trims the supervisions that exceed the recording, etc. Stores the output files in OUTPUT_DIR under the same names as the input files.

lhotse fix [OPTIONS] RECORDINGS SUPERVISIONS OUTPUT_DIR

Arguments

RECORDINGS

Required argument

SUPERVISIONS

Required argument

OUTPUT_DIR

Required argument

install-sph2pipe

Install the sph2pipe program to handle sphere (.sph) audio files with “shorten” codec compression (needed for older LDC data).

It downloads an archive and then decompresses and compiles the contents.

lhotse install-sph2pipe [OPTIONS]

Options

--install-dir <install_dir>

Directory where sph2pipe will be downloaded and installed.

--url <url>

URL from which to download sph2pipe.

kaldi

Kaldi import/export related commands.

lhotse kaldi [OPTIONS] COMMAND [ARGS]...

export

Convert a pair of RecordingSet and SupervisionSet manifests into a Kaldi-style data directory.

lhotse kaldi export [OPTIONS] RECORDINGS SUPERVISIONS OUTPUT_DIR

Options

-u, --map-underscores-to <map_underscores_to>

Optional string with which we will replace all underscores.This helps avoid issues with Kaldi data dir sorting.

Arguments

RECORDINGS

Required argument

SUPERVISIONS

Required argument

OUTPUT_DIR

Required argument

import

Convert a Kaldi data dir DATA_DIR into a directory MANIFEST_DIR of lhotse manifests. Ignores feats.scp. The SAMPLING_RATE has to be explicitly specified as it is not available to read from DATA_DIR.

lhotse kaldi import [OPTIONS] DATA_DIR SAMPLING_RATE MANIFEST_DIR

Options

-f, --frame-shift <frame_shift>

Frame shift (in seconds) is required to support reading feats.scp.

-u, --map-string-to-underscores <map_string_to_underscores>

When specified, we will replace all instances of this string in SupervisonSegment IDs to underscores. This is to help with handling underscores in Kaldi (see ‘export_to_kaldi’).

-j, --num-jobs <num_jobs>

Number of jobs for computing recording durations.

Arguments

DATA_DIR

Required argument

SAMPLING_RATE

Required argument

MANIFEST_DIR

Required argument

prepare

Command group with data preparation recipes.

lhotse prepare [OPTIONS] COMMAND [ARGS]...

adept

ADEPT prosody transfer evaluation corpus data preparation.

lhotse prepare adept [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

aishell

Aishell ASR data preparation.

lhotse prepare aishell [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

aishell4

AISHELL-4 data preparation.

lhotse prepare aishell4 [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

ami

AMI data preparation.

lhotse prepare ami [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

--annotations <annotations>

Provide if annotations are download in a different directory than corpus.

--mic <mic>

AMI microphone setting.

Options

ihm | ihm-mix | sdm | mdm

--partition <partition>

Data partition to use (see http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml).

Options

scenario-only | full-corpus | full-corpus-asr

--normalize-text

If set, convert all text annotations to upper case (similar to Kaldi)

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

babel

This is a data preparation recipe for the IARPA BABEL corpus (see: https://www.iarpa.gov/index.php/research-programs/babel). It should support all of the languages available in BABEL. It will prepare the data from the “conversational” part of BABEL.

This script should be invoked separately for each language you want to prepare, e.g.: $ lhotse prepare babel /export/corpora5/Babel/IARPA_BABEL_BP_101 data/cantonese $ lhotse prepare babel /export/corpora5/Babel/BABEL_OP1_103 data/bengali

lhotse prepare babel [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

broadcast-news

English Broadcast News 1997 data preparation. It will output three manifests: for recordings, topic sections, and speech segments. It supports the following LDC distributions:

* 1997 English Broadcast News Train (HUB4)
Speech LDC98S71
Transcripts LDC98T28

This data is not available for free - your institution needs to have an LDC subscription.

lhotse prepare broadcast-news [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR

Arguments

AUDIO_DIR

Required argument

TRANSCRIPT_DIR

Required argument

OUTPUT_DIR

Required argument

callhome-egyptian

About the Callhome Egyptian Arabic Corpus

The CALLHOME Egyptian Arabic corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary represents is Cairene Arabic.

This recipe uses the speech and transcripts available through LDC. In addition, an Egyptian arabic phonetic lexicon (available via LDC) is used to get word to phoneme mappings for the vocabulary. This datasets are:

Speech : LDC97S45 Transcripts : LDC97T19 Lexicon : LDC99L22 (unused here)

To actually read the audio, you will need the SPH2PIPE binary: you can provide its path, so that we will add it in the manifests (otherwise you might need to modify your PATH environment variable to find sph2pipe).

lhotse prepare callhome-egyptian [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR

Options

--absolute-paths <absolute_paths>

Whether to return absolute or relative (to the corpus dir) paths for recordings.

Arguments

AUDIO_DIR

Required argument

TRANSCRIPT_DIR

Required argument

OUTPUT_DIR

Required argument

callhome-english

CallHome American English corpus preparation.

Depending on the value of transcript_dir, will prepare either
* if transcript_dir = None, the SRE task (expected corpus LDC2001S97).
The setup will reflect speaker diarization on a portion of CALLHOME used in
the 2000 NIST speaker recognition evaluation. The 2000 NIST SRE is
required, and has an LDC catalog number LDC2001S97. The data is not
available for free, but can be licensed from the LDC (Linguistic Data
Consortium)
* otherwise data for ASR task (expected LDC corpora LDC97S42 and
LDC97T14) will be prepared. The data is not available for free, but can
be licensed from the LDC (Linguistic Data Consortium)

The data should be located at AUDIO_DIR. Optionally, for the SRE task, RTTM_DIR can be provided that has the contents of http://www.openslr.org/resources/10/; otherwise, we will download it.

To actually read the audio, you will need the SPH2PIPE binary: you can provide its path, so that we will add it in the manifests (otherwise you might need to modify your PATH environment variable to find sph2pipe).

Example:

lhotse prepare callhome-english /export/corpora5/LDC/LDC97S42 –transcript-dir /export/corpora5/LDC/LDC97T14 ./callhome_asr

or

lhotse prepare callhome-english /export/corpora5/LDC/LDC2001S97 ./callhome_sre

lhotse prepare callhome-english [OPTIONS] AUDIO_DIR OUTPUT_DIR

Options

--rttm-dir <rttm_dir>
--absolute-paths <absolute_paths>

Whether to return absolute or relative (to the corpus dir) paths for recordings.

--transcript-dir <transcript_dir>

Path to the LDC97T14 corpus. Please note that providing this path, the ASR corpus will be prepared, not the SRE corpus!

Arguments

AUDIO_DIR

Required argument

OUTPUT_DIR

Required argument

cmu-arctic

CMU Arctic data preparation.

lhotse prepare cmu-arctic [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

cmu-indic

CMU Indic data preparation.

lhotse prepare cmu-indic [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

cmu-kids

CMU Kids corpus data preparation.

lhotse prepare cmu-kids [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

--absolute-paths <absolute_paths>

Use absolute paths for recordings

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

commonvoice

Mozilla CommonVoice manifest preparation script. CORPUS_DIR is expected to contain sub-directories that are named with CommonVoice language codes, e.g., “en”, “pl”, etc.

lhotse prepare commonvoice [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

-l, --language <language>

Languages to prepare (scans CORPUS_DIR for language codes by default).

-s, --split <split>

Splits to prepare (available options: train, dev, test, validated, invalidated, other)

-j, --num-jobs <num_jobs>

How many threads to use (can give good speed-ups with slow disks).

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

cslu-kids

CSLU Kids corpus data preparation.

lhotse prepare cslu-kids [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

--absolute-paths <absolute_paths>

Use absolute paths for recordings

--normalize-text <normalize_text>

Remove noise tags (<bn>, <bs>) from spontaneous speech transcripts

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

dihard3

DIHARD3 data preparation.

lhotse prepare dihard3 [OPTIONS] OUTPUT_DIR

Options

--dev <dev>
--eval <eval>
--uem, --no-uem

Specify whether or not to create UEM supervision

-j, --num-jobs <num_jobs>

Number of jobs to scan corpus directory for recordings.

Arguments

OUTPUT_DIR

Required argument

fisher-english

The Fisher Spanish corpus preparation.

This is conversational telephone speech collected as 2-channel μ-law, 8kHz-sampled data.
The catalog number LDC2010S01 for audio corpus and LDC2010T04 for transcripts.

This data is not available for free - your institution needs to have an LDC subscription.

lhotse prepare fisher-english [OPTIONS] AUDIO_DIR TRANSCRIPT_DIR OUTPUT_DIR

Options

--absolute-paths <absolute_paths>

Whether to return absolute or relative (to the corpus dir) paths for recordings.

Arguments

AUDIO_DIR

Required argument

TRANSCRIPT_DIR

Required argument

OUTPUT_DIR

Required argument

gale-arabic

GALE Arabic Phases 1 to 4 Broadcast news and conversation data preparation.

lhotse prepare gale-arabic [OPTIONS] OUTPUT_DIR

Options

-s, --audio <audio>

Paths to audio dirs, e.g., LDC2013S02. Multiple corpora can be provided by repeating -s.

-t, --transcript <transcript>

Paths to transcript dirs, e.g., LDC2013T17. Multiple corpora can be provided by repeating -t

--absolute-paths <absolute_paths>

Use absolute paths for recordings

Arguments

OUTPUT_DIR

Required argument

gale-mandarin

GALE Mandarin Broadcast speech data preparation.

lhotse prepare gale-mandarin [OPTIONS] OUTPUT_DIR

Options

-s, --audio <audio>

Paths to audio dirs, e.g., LDC2013S08. Multiple corpora can be provided by repeating -s.

-t, --transcript <transcript>

Paths to transcript dirs, e.g., LDC2013T20. Multiple corpora can be provided by repeating -t

--absolute-paths <absolute_paths>

Use absolute paths for recordings

--segment-words <segment_words>

Use ‘jieba’ package to perform word segmentation on the text

Arguments

OUTPUT_DIR

Required argument

gigaspeech

Gigaspeech ASR data preparation.

lhotse prepare gigaspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

--subset <subset>

Which parts of Gigaspeech to download (by default XL + DEV + TEST).

Options

auto | XL | L | M | S | XS | DEV | TEST

-j, --num-jobs <num_jobs>

How many threads to use (can give good speed-ups with slow disks).

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

heroico

heroico Answers ASR data preparation.

lhotse prepare heroico [OPTIONS] SPEECH_DIR TRANSCRIPT_DIR OUTPUT_DIR

Arguments

SPEECH_DIR

Required argument

TRANSCRIPT_DIR

Required argument

OUTPUT_DIR

Required argument

hifitts

HiFiTTS data preparation.

lhotse prepare hifitts [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

-j, --num-jobs <num_jobs>

How many jobs to use (can give good speed-ups with slow disks).

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

l2-arctic

L2 Arctic data preparation.

lhotse prepare l2-arctic [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

libricss

LibriCSS recording and supervision manifest preparation.

lhotse prepare libricss [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

--type <type>

Type of the corpus to prepare

Options

replay | mix

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

librimix

LibrMix source separation data preparation.

lhotse prepare librimix [OPTIONS] LIBRIMIX_CSV OUTPUT_DIR

Options

--sampling-rate <sampling_rate>

Sampling rate to set in the RecordingSet manifest.

--min-segment-seconds <min_segment_seconds>

Remove segments shorter than MIN_SEGMENT_SECONDS.

--with-precomputed-mixtures, --no-precomputed-mixtures

Optionally create an RecordingSet manifest including the precomputed LibriMix mixtures.

Arguments

LIBRIMIX_CSV

Required argument

OUTPUT_DIR

Required argument

librispeech

(Mini) Librispeech ASR data preparation.

lhotse prepare librispeech [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

-j, --num-jobs <num_jobs>

How many threads to use (can give good speed-ups with slow disks).

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

libritts

LibriTTs data preparation.

lhotse prepare libritts [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

-j, --num-jobs <num_jobs>

How many jobs to use (can give good speed-ups with slow disks).

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

ljspeech

LJSpeech data preparation.

lhotse prepare ljspeech [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

mls

Multilingual Librispeech (MLS) data preparation.

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It is available at OpenSLR: http://openslr.org/94

lhotse prepare mls [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

--opus, --flac

Which codec should be used (OPUS or FLAC)

-j, --num-jobs <num_jobs>

How many threads to use (can give good speed-ups with slow disks).

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

mtedx

MTEDx ASR data preparation.

lhotse prepare mtedx [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

-j, --num-jobs <num_jobs>

How many threads to use (can give good speed-ups with slow disks).

-l, --lang <lang>

Specify which languages to prepare, e.g., lhoste prepare librispeech mtedx_corpus data -l de -l fr -l es

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

musan

MUSAN data preparation.

lhotse prepare musan [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

--use-vocals, --no-vocals

Whether to include vocal music in “music” part.

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

nsc

This is a data preparation recipe for the National Corpus of Speech in Singaporean English.

lhotse prepare nsc [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

-p, --dataset-part <dataset_part>

Which part of NSC should be prepared

Options

PART3_SameCloseMic | PART3_SeparateIVR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

switchboard

The Switchboard corpus preparation.

This is conversational telephone speech collected as 2-channel, 8kHz-sampled
data. We are using just the Switchboard-1 Phase 1 training data.
The catalog number LDC97S62 (Switchboard-1 Release 2) corresponds, we believe,
to what we have. We also use the Mississippi State transcriptions, which
we download separately from

This data is not available for free - your institution needs to have an LDC subscription.

lhotse prepare switchboard [OPTIONS] AUDIO_DIR OUTPUT_DIR

Options

--transcript-dir <transcript_dir>
--sentiment-dir <sentiment_dir>

Optional path to LDC2020T14 package with sentiment annotations for SWBD.

--omit-silence, --retain-silence

Should the [silence] segments be kept.

--absolute-paths <absolute_paths>

Whether to return absolute or relative (to the corpus dir) paths for recordings.

Arguments

AUDIO_DIR

Required argument

OUTPUT_DIR

Required argument

tedlium

TED-LIUM v3 recording and supervision manifest preparation.

lhotse prepare tedlium [OPTIONS] TEDLIUM_DIR OUTPUT_DIR

Arguments

TEDLIUM_DIR

Required argument

OUTPUT_DIR

Required argument

timit

TIMIT data preparation. :param corpus_dir: Pathlike, the path of the data dir. :param output_dir: Pathlike, the path where to write and save the manifests.

lhotse prepare timit [OPTIONS] CORPUS_DIR OUTPUT_DIR

Options

-p, --num-phones <num_phones>

The number of phones (60, 48 or 39) for modeling. And 48 is regarded as the default value.

-j, --num-jobs <num_jobs>

How many threads to use (can give good speed-ups with slow disks).

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

vctk

VCTK data preparation.

lhotse prepare vctk [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

yesno

yes_no data preparation.

lhotse prepare yesno [OPTIONS] CORPUS_DIR OUTPUT_DIR

Arguments

CORPUS_DIR

Required argument

OUTPUT_DIR

Required argument

split

Load MANIFEST, split it into NUM_SPLITS equal parts and save as separate manifests in OUTPUT_DIR.

lhotse split [OPTIONS] NUM_SPLITS MANIFEST OUTPUT_DIR

Options

-s, --shuffle

Optionally shuffle the sequence before splitting.

Arguments

NUM_SPLITS

Required argument

MANIFEST

Required argument

OUTPUT_DIR

Required argument

subset

Load MANIFEST, select the FIRST or LAST number of items and store it in OUTPUT_MANIFEST.

lhotse subset [OPTIONS] MANIFEST OUTPUT_MANIFEST

Options

--first <first>
--last <last>
--cutids <cutids>

A json string or path to json file containing array of cutids strings. E.g. –cutids ‘[“cutid1”, “cutid2”]’.

Arguments

MANIFEST

Required argument

OUTPUT_MANIFEST

Required argument

validate

Validate a Lhotse manifest file.

lhotse validate [OPTIONS] MANIFEST

Options

--read-data, --dont-read-data

Should the audio/features data be read from disk to perform additional checks (could be extremely slow for large manifests).

Arguments

MANIFEST

Required argument