alignment_pipeline

pipelines.alignment_pipeline(
    dataloader,
    text_normalizer_fn,
    processor,
    tokenizer=None,
    alignment_strategy='speech',
    start_wildcard=False,
    end_wildcard=False,
    blank_id=0,
    word_boundary='|',
    chunk_size=30,
    ndigits=5,
    indent=2,
    save_json=True,
    save_msgpack=False,
    return_alignments=False,
    delete_emissions=False,
    remove_wildcards=True,
    emissions_dir='output/emissions',
    output_dir='output/alignments',
    device='cuda',
)

Perform alignment on speech segments or VAD chunks using emissions.

Speech based alignment is typically used when aligning human transcriptions, while chunk based alignment is typically used to align the output of ASR models.

Parameters

Name	Type	Description	Default
dataloader	torch.utils.data.DataLoader	DataLoader loading AudioMetadata objects from JSON or Msgpack files.	required
text_normalizer_fn	callable	Function to normalize text according to regex rules.	required
processor	Wav2Vec2Processor	Wav2Vec2Processor to preprocess the audio.	required
tokenizer	object	Optional tokenizer for custom segmentation of text (e.g. sentence segmentation, or paragraph segmentation). The tokenizer should either i) be a PunktTokenizer from nltk, or ii) directly return a list of spans (start_char, end_char) when called on a string.	`None`
alignment_strategy	str	Strategy for aligning features to text. One of ‘speech’ or ‘chunk’. If `speech`, alignments are performed on SpeechSegments. If `chunk`, alignments are performed on VAD chunks.	`"speech"`
start_wildcard	bool	Whether to add a wildcard token at the start of the segments.	`False`
end_wildcard	bool	Whether to add a wildcard token at the end of the segments.	`False`
blank_id	int	ID of the blank token in the tokenizer.	`0`
word_boundary	str	Token indicating word boundaries in the tokenizer.	`"\|"`
chunk_size	int	Maximum chunk size in seconds.	`30`
ndigits	int	Number of decimal digits to round the alignment times and scores to.	`5`
indent	int	Indentation level for saved JSON files. `None` to disable pretty formatting.	`2`
save_json	bool	Whether to save alignment metadata in JSON format.	`True`
save_msgpack	bool	Whether to save alignment metadata in Msgpack format.	`False`
return_alignments	bool	Whether to return the alignment mappings.	`False`
delete_emissions	bool	Whether to delete the emissions files after alignment to save space.	`False`
remove_wildcards	bool	Whether to remove wildcard tokens from the final alignment.	`True`
emissions_dir	str	Directory where the emissions are stored.	`"output/emissions"`
output_dir	str	Directory to save alignment outputs.	`"output/alignments"`
device	str	Device to run the alignment on (e.g. “cuda” or “cpu”).	`"cuda"`

Returns

Name	Type	Description
	list[list[SpeechSegment]] or None	If `return_alignments` is True, returns a list of alignment mappings for each audio file. Otherwise, returns `None`.