pipeline

pipelines.pipeline(
    vad_model,
    emissions_model,
    processor,
    audio_paths,
    audio_dir,
    speeches=None,
    sample_rate=16000,
    chunk_size=30,
    alignment_strategy='speech',
    text_normalizer_fn=text_normalizer,
    tokenizer=None,
    start_wildcard=False,
    end_wildcard=False,
    blank_id=0,
    word_boundary='|',
    indent=2,
    ndigits=5,
    batch_size_files=1,
    num_workers_files=2,
    prefetch_factor_files=1,
    batch_size_features=8,
    num_workers_features=4,
    streaming=True,
    save_json=True,
    save_msgpack=False,
    save_emissions=True,
    return_alignments=False,
    delete_emissions=False,
    output_vad_dir='output/vad',
    output_emissions_dir='output/emissions',
    output_alignments_dir='output/alignments',
    device='cuda',
)

Complete pipeline to run VAD, extract emissions, and perform alignment.

Parameters

Name Type Description Default
vad_model object The loaded VAD model. required
emissions_model object The loaded ASR model. required
processor Wav2Vec2Processor Wav2Vec2Processor to preprocess the audio. required
audio_paths list List of paths to audio files (relative to audio_dir). required
audio_dir str Base directory with audio files relative to audio_paths. required
speeches list[list[SpeechSegment]] or None List of SpeechSegment objects to run VAD and alignment only on specific segments of the audio. If alignment_strategy is ‘speech’, the text needs to be supplied in the SpeechSegment objects. If alignment_strategy is ‘chunk’ and ASR transcriptions are used, there is no need to supply text in the SpeechSegment objects. None
sample_rate int Sample rate to resample audio to. 16000
chunk_size int When alignment_strategy is set to speech, SpeechSegments are split into chunk_size sized chunks for feature extraction. 30
alignment_strategy str Strategy for aligning features to text. One of ‘speech’ or ‘chunk’. If speech, audio is split into chunk_size sized chunks based on SpeechSegments. If chunk, VAD chunks are used as basis for feature extraction and alignment. NOTE: chunk currently only works with ASR. The individual VAD chunks won’t contain the relevant text information for alignment. "speech"
text_normalizer_fn callable Function to normalize text according to regex rules. text_normalizer
tokenizer object Optional tokenizer for custom segmentation of text (e.g. sentence segmentation, or paragraph segmentation). The tokenizer should either i) be a PunktTokenizer from nltk, or ii) directly return a list of spans (start_char, end_char) when called on a string. None
start_wildcard bool Whether to add a wildcard token at the start of the segments. False
end_wildcard bool Whether to add a wildcard token at the end of the segments. False
blank_id int ID of the blank token in the tokenizer. 0
word_boundary str Token indicating word boundaries in the tokenizer. "|"
indent int Indentation level for saved JSON files. None to disable pretty formatting. 2
ndigits int Number of decimal digits to round the alignment times and scores to. 5
batch_size_files int Batch size for the file DataLoader. 1
num_workers_files int Number of workers for the file DataLoader. 2
prefetch_factor_files int Prefetch factor for the file DataLoader. 1
batch_size_features int Batch size for the feature DataLoader. 8
num_workers_features int Number of workers for the feature DataLoader. 4
streaming bool Whether to use streaming loading of audio files. False
save_json bool Whether to save the output files as JSON. True
save_msgpack bool Whether to save the output files as Msgpack. False
save_emissions bool Whether to save the raw emissions as .npy files. True
return_alignments bool Whether to return the alignment mappings. False
delete_emissions bool Whether to delete the emissions files after alignment to save space. False
output_vad_dir str Directory to save the VAD output files. "output/vad"
output_emissions_dir str Directory to save the emissions output files. "output/emissions"
output_alignments_dir str Directory to save alignment output files. "output/alignments"
device str Device to run the alignment on (e.g. “cuda” or “cpu”). "cuda"

Returns

Name Type Description
list[list[SpeechSegment]] or None If return_alignments is True, returns a list of alignment mappings for each audio file. Otherwise, returns None (the alignments are saved to disk only).