pipeline

pipelines.pipeline(
    vad_model,
    emissions_model,
    processor,
    audio_paths,
    audio_dir,
    speeches=None,
    sample_rate=16000,
    chunk_size=30,
    alignment_strategy='speech',
    text_normalizer_fn=text_normalizer,
    tokenizer=None,
    start_wildcard=False,
    end_wildcard=False,
    blank_id=0,
    word_boundary='|',
    indent=2,
    ndigits=5,
    num_workers_files=2,
    prefetch_factor_files=1,
    batch_size_features=8,
    num_workers_features=4,
    streaming=True,
    save_json=True,
    save_msgpack=False,
    save_emissions=True,
    return_alignments=False,
    delete_emissions=False,
    output_vad_dir='output/vad',
    output_emissions_dir='output/emissions',
    output_alignments_dir='output/alignments',
    device='cuda',
)

Complete pipeline to run VAD, extract emissions, and perform alignment.

Parameters

Name	Type	Description	Default
vad_model	object	The loaded VAD model.	required
emissions_model	object	The loaded ASR model.	required
processor	Wav2Vec2Processor	Wav2Vec2Processor to preprocess the audio.	required
audio_paths	list	List of paths to audio files (relative to `audio_dir`).	required
audio_dir	str	Base directory with audio files relative to `audio_paths`.	required
speeches	list[list[SpeechSegment]] or None	List of SpeechSegment objects to run VAD and alignment only on specific segments of the audio. If `alignment_strategy` is ‘speech’, the text needs to be supplied in the SpeechSegment objects. If `alignment_strategy` is ‘chunk’ and ASR transcriptions are used, there is no need to supply text in the SpeechSegment objects.	`None`
sample_rate	int	Sample rate to resample audio to.	`16000`
chunk_size	int	When `alignment_strategy` is set to `speech`, SpeechSegments are split into `chunk_size` sized chunks for feature extraction.	`30`
alignment_strategy	str	Strategy for aligning features to text. One of ‘speech’ or ‘chunk’. If `speech`, audio is split into `chunk_size` sized chunks based on SpeechSegments. If `chunk`, VAD chunks are used as basis for feature extraction and alignment. NOTE: `chunk` currently only works with ASR. The individual VAD chunks won’t contain the relevant text information for alignment.	`"speech"`
text_normalizer_fn	callable	Function to normalize text according to regex rules.	`text_normalizer`
tokenizer	object	Optional tokenizer for custom segmentation of text (e.g. sentence segmentation, or paragraph segmentation). The tokenizer should either i) be a PunktTokenizer from nltk, or ii) directly return a list of spans (start_char, end_char) when called on a string.	`None`
start_wildcard	bool	Whether to add a wildcard token at the start of the segments.	`False`
end_wildcard	bool	Whether to add a wildcard token at the end of the segments.	`False`
blank_id	int	ID of the blank token in the tokenizer.	`0`
word_boundary	str	Token indicating word boundaries in the tokenizer.	`"\|"`
indent	int	Indentation level for saved JSON files. `None` to disable pretty formatting.	`2`
ndigits	int	Number of decimal digits to round the alignment times and scores to.	`5`
num_workers_files	int	Number of workers for the file DataLoader.	`2`
prefetch_factor_files	int	Prefetch factor for the file DataLoader.	`1`
batch_size_features	int	Batch size for the feature DataLoader.	`8`
num_workers_features	int	Number of workers for the feature DataLoader.	`4`
streaming	bool	Whether to use streaming loading of audio files.	`False`
save_json	bool	Whether to save the output files as JSON.	`True`
save_msgpack	bool	Whether to save the output files as Msgpack.	`False`
save_emissions	bool	Whether to save the raw emissions as .npy files.	`True`
return_alignments	bool	Whether to return the alignment mappings.	`False`
delete_emissions	bool	Whether to delete the emissions files after alignment to save space.	`False`
output_vad_dir	str	Directory to save the VAD output files.	`"output/vad"`
output_emissions_dir	str	Directory to save the emissions output files.	`"output/emissions"`
output_alignments_dir	str	Directory to save alignment output files.	`"output/alignments"`
device	str	Device to run the alignment on (e.g. “cuda” or “cpu”).	`"cuda"`

Returns

Name	Type	Description
	list[list[SpeechSegment]] or None	If `return_alignments` is True, returns a list of alignment mappings for each audio file. Otherwise, returns `None` (the alignments are saved to disk only).