Complete pipeline to run VAD, extract emissions, and perform alignment.
Parameters
Name
Type
Description
Default
vad_model
object
The loaded VAD model.
required
emissions_model
object
The loaded ASR model.
required
processor
Wav2Vec2Processor
Wav2Vec2Processor to preprocess the audio.
required
audio_paths
list
List of paths to audio files (relative to audio_dir).
required
audio_dir
str
Base directory with audio files relative to audio_paths.
required
speeches
list[list[SpeechSegment]] or None
List of SpeechSegment objects to run VAD and alignment only on specific segments of the audio. If alignment_strategy is ‘speech’, the text needs to be supplied in the SpeechSegment objects. If alignment_strategy is ‘chunk’ and ASR transcriptions are used, there is no need to supply text in the SpeechSegment objects.
None
sample_rate
int
Sample rate to resample audio to.
16000
chunk_size
int
When alignment_strategy is set to speech, SpeechSegments are split into chunk_size sized chunks for feature extraction.
30
alignment_strategy
str
Strategy for aligning features to text. One of ‘speech’ or ‘chunk’. If speech, audio is split into chunk_size sized chunks based on SpeechSegments. If chunk, VAD chunks are used as basis for feature extraction and alignment. NOTE: chunk currently only works with ASR. The individual VAD chunks won’t contain the relevant text information for alignment.
"speech"
text_normalizer_fn
callable
Function to normalize text according to regex rules.
text_normalizer
tokenizer
object
Optional tokenizer for custom segmentation of text (e.g. sentence segmentation, or paragraph segmentation). The tokenizer should either i) be a PunktTokenizer from nltk, or ii) directly return a list of spans (start_char, end_char) when called on a string.
None
start_wildcard
bool
Whether to add a wildcard token at the start of the segments.
False
end_wildcard
bool
Whether to add a wildcard token at the end of the segments.
False
blank_id
int
ID of the blank token in the tokenizer.
0
word_boundary
str
Token indicating word boundaries in the tokenizer.
"|"
indent
int
Indentation level for saved JSON files. None to disable pretty formatting.
2
ndigits
int
Number of decimal digits to round the alignment times and scores to.
5
batch_size_files
int
Batch size for the file DataLoader.
1
num_workers_files
int
Number of workers for the file DataLoader.
2
prefetch_factor_files
int
Prefetch factor for the file DataLoader.
1
batch_size_features
int
Batch size for the feature DataLoader.
8
num_workers_features
int
Number of workers for the feature DataLoader.
4
streaming
bool
Whether to use streaming loading of audio files.
False
save_json
bool
Whether to save the output files as JSON.
True
save_msgpack
bool
Whether to save the output files as Msgpack.
False
save_emissions
bool
Whether to save the raw emissions as .npy files.
True
return_alignments
bool
Whether to return the alignment mappings.
False
delete_emissions
bool
Whether to delete the emissions files after alignment to save space.
False
output_vad_dir
str
Directory to save the VAD output files.
"output/vad"
output_emissions_dir
str
Directory to save the emissions output files.
"output/emissions"
output_alignments_dir
str
Directory to save alignment output files.
"output/alignments"
device
str
Device to run the alignment on (e.g. “cuda” or “cpu”).
"cuda"
Returns
Name
Type
Description
list[list[SpeechSegment]] or None
If return_alignments is True, returns a list of alignment mappings for each audio file. Otherwise, returns None (the alignments are saved to disk only).