alignment_pipeline

pipelines.alignment_pipeline(
    dataloader,
    text_normalizer_fn,
    processor,
    tokenizer=None,
    alignment_strategy='speech',
    start_wildcard=False,
    end_wildcard=False,
    blank_id=0,
    word_boundary='|',
    chunk_size=30,
    ndigits=5,
    indent=2,
    save_json=True,
    save_msgpack=False,
    return_alignments=False,
    delete_emissions=False,
    remove_wildcards=True,
    emissions_dir='output/emissions',
    output_dir='output/alignments',
    device='cuda',
)

Perform alignment on speech segments or VAD chunks using emissions.

Speech based alignment is typically used when aligning human transcriptions, while chunk based alignment is typically used to align the output of ASR models.

Parameters

Name Type Description Default
dataloader torch.utils.data.DataLoader DataLoader loading AudioMetadata objects from JSON or Msgpack files. required
text_normalizer_fn callable Function to normalize text according to regex rules. required
processor Wav2Vec2Processor Wav2Vec2Processor to preprocess the audio. required
tokenizer object Optional tokenizer for custom segmentation of text (e.g. sentence segmentation, or paragraph segmentation). The tokenizer should either i) be a PunktTokenizer from nltk, or ii) directly return a list of spans (start_char, end_char) when called on a string. None
alignment_strategy str Strategy for aligning features to text. One of ‘speech’ or ‘chunk’. If speech, alignments are performed on SpeechSegments. If chunk, alignments are performed on VAD chunks. "speech"
start_wildcard bool Whether to add a wildcard token at the start of the segments. False
end_wildcard bool Whether to add a wildcard token at the end of the segments. False
blank_id int ID of the blank token in the tokenizer. 0
word_boundary str Token indicating word boundaries in the tokenizer. "|"
chunk_size int Maximum chunk size in seconds. 30
ndigits int Number of decimal digits to round the alignment times and scores to. 5
indent int Indentation level for saved JSON files. None to disable pretty formatting. 2
save_json bool Whether to save alignment metadata in JSON format. True
save_msgpack bool Whether to save alignment metadata in Msgpack format. False
return_alignments bool Whether to return the alignment mappings. False
delete_emissions bool Whether to delete the emissions files after alignment to save space. False
remove_wildcards bool Whether to remove wildcard tokens from the final alignment. True
emissions_dir str Directory where the emissions are stored. "output/emissions"
output_dir str Directory to save alignment outputs. "output/alignments"
device str Device to run the alignment on (e.g. “cuda” or “cpu”). "cuda"

Returns

Name Type Description
list[list[SpeechSegment]] or None If return_alignments is True, returns a list of alignment mappings for each audio file. Otherwise, returns None.