alignment_pipeline
pipelines.alignment_pipeline(
dataloader,
text_normalizer_fn,
processor,
tokenizer=None,
alignment_strategy='speech',
start_wildcard=False,
end_wildcard=False,
blank_id=0,
word_boundary='|',
chunk_size=30,
ndigits=5,
indent=2,
save_json=True,
save_msgpack=False,
return_alignments=False,
delete_emissions=False,
remove_wildcards=True,
emissions_dir='output/emissions',
output_dir='output/alignments',
device='cuda',
)Perform alignment on speech segments or VAD chunks using emissions.
Speech based alignment is typically used when aligning human transcriptions, while chunk based alignment is typically used to align the output of ASR models.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| dataloader | torch.utils.data.DataLoader | DataLoader loading AudioMetadata objects from JSON or Msgpack files. | required |
| text_normalizer_fn | callable | Function to normalize text according to regex rules. | required |
| processor | Wav2Vec2Processor | Wav2Vec2Processor to preprocess the audio. | required |
| tokenizer | object | Optional tokenizer for custom segmentation of text (e.g. sentence segmentation, or paragraph segmentation). The tokenizer should either i) be a PunktTokenizer from nltk, or ii) directly return a list of spans (start_char, end_char) when called on a string. | None |
| alignment_strategy | str | Strategy for aligning features to text. One of ‘speech’ or ‘chunk’. If speech, alignments are performed on SpeechSegments. If chunk, alignments are performed on VAD chunks. |
"speech" |
| start_wildcard | bool | Whether to add a wildcard token at the start of the segments. | False |
| end_wildcard | bool | Whether to add a wildcard token at the end of the segments. | False |
| blank_id | int | ID of the blank token in the tokenizer. | 0 |
| word_boundary | str | Token indicating word boundaries in the tokenizer. | "|" |
| chunk_size | int | Maximum chunk size in seconds. | 30 |
| ndigits | int | Number of decimal digits to round the alignment times and scores to. | 5 |
| indent | int | Indentation level for saved JSON files. None to disable pretty formatting. |
2 |
| save_json | bool | Whether to save alignment metadata in JSON format. | True |
| save_msgpack | bool | Whether to save alignment metadata in Msgpack format. | False |
| return_alignments | bool | Whether to return the alignment mappings. | False |
| delete_emissions | bool | Whether to delete the emissions files after alignment to save space. | False |
| remove_wildcards | bool | Whether to remove wildcard tokens from the final alignment. | True |
| emissions_dir | str | Directory where the emissions are stored. | "output/emissions" |
| output_dir | str | Directory to save alignment outputs. | "output/alignments" |
| device | str | Device to run the alignment on (e.g. “cuda” or “cpu”). | "cuda" |
Returns
| Name | Type | Description |
|---|---|---|
| list[list[SpeechSegment]] or None | If return_alignments is True, returns a list of alignment mappings for each audio file. Otherwise, returns None. |