easytranscriber supports custom regex-based text normalization functions to preprocess ASR outputs before alignment. By reconciling the outputs of the ASR model and the emissions model, the forced alignment algorithm has an opportunity to perform better.
Wav2vec2 models tend to produce all lowercase or all uppercase outputs without punctuation, whereas Whisper models output mixed case text with punctuation. Whisper furthermore outputs non-verbal tokens (often within parentheses or brackets), symbols and abbreviations.
Normalizing these outputs can substantially improve the quality of forced alignment.
Note
The normalizations are reversible, meaning that the original text can be recovered after forced alignment has been performed1.
1 Whitespace is not always recoverable, depending on the regex patterns used
Normalization
Let’s apply some basic normalization to our example from A Tale of Two Cities. To explore the effect of our normalizations, we will import the SpanMapNormalizer class from the easyaligner library, and remove punctuation.
from easyaligner.text.normalization import SpanMapNormalizertext ="""Book 1. Chapter 1, The Period. It was the best of times. It was the worst of times. It was the age of wisdom. It was the age of foolishness. It was the epoch of belief. It was the epoch of incredulity. It was the season of light. It was the season of darkness. It was the spring of hope."""normalizer = SpanMapNormalizer(text)normalizer.transform(r"[^\w\s]", "") # Remove punctuation and special charactersprint(normalizer.current_text)
Book 1 Chapter 1 The Period It was the best of times It was the worst of times
It was the age of wisdom It was the age of foolishness It was the epoch of belief
It was the epoch of incredulity It was the season of light
It was the season of darkness It was the spring of hope
book 1 chapter 1 the period it was the best of times it was the worst of times
it was the age of wisdom it was the age of foolishness it was the epoch of belief
it was the epoch of incredulity it was the season of light
it was the season of darkness it was the spring of hope
We may also want to convert the numbers to their word forms. The library num2words2 can help with this:
2pip install num2words
from num2words import num2wordsnormalizer.transform(r"\d+", lambda m: num2words(int(m.group()))) # Convert numbers to wordsprint(normalizer.current_text)
book one chapter one the period it was the best of times it was the worst of times
it was the age of wisdom it was the age of foolishness it was the epoch of belief
it was the epoch of incredulity it was the season of light
it was the season of darkness it was the spring of hope
When you’re feeling done, it’s a good idea to always apply whitespace normalization as the final transformation step:
normalizer.transform(r"\s+", " ") # Normalize whitespace to a single spacenormalizer.transform(r"^\s+|\s+$", "") # Strip leading and trailing whitespaceprint(normalizer.current_text)
book one chapter one the period it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness it was the epoch of belief it was the epoch of incredulity it was the season of light it was the season of darkness it was the spring of hope
Our text is now ready for forced alignment. However, if we want to recover the original text, it does not suffice to pass only the normalized text to the forced alignment algorithm. We also need to pass a mapping between the original text and the normalized text. Any user supplied text normalization function in easytranscriber needs to return the following two objects:
mapping = normalizer.get_token_map()normalized_tokens = [item["normalized_token"] for item in mapping]
Let’s inspect what’s in the first five items of the mapping:
We can see that normalized_token and text contain the token in the normalized and original text, respectively, while start_char and end_char indicate the character indices of the token in the original text.
Tip
When you are done testing your transformations, combine them into a function that takes in a string and outputs the normalized tokens and the mapping. See below for an example of how the default normalization function in easytranscriber is implemented.
Default text_normalizer
easytranscriber provides a conservative default text normalization function. This default function is applied3 unless the user specificies their own function.
Here is the default normalization function, for reference:
def text_normalizer(text: str) ->str:""" Default text normalization function. Applies - Unicode normalization (NFKC) - Lowercasing - Normalization of whitespace - Remove parentheses and special characters Parameters ---------- text : str Input text to normalize. Returns ------- tuple Tuple containing (normalized_tokens, mapping). """# Unicode normalization normalizer = SpanMapNormalizer(text)# # Remove parentheses, brackets, stars, and their content# normalizer.transform(r"\(.*?\)", "")# normalizer.transform(r"\[.*?\]", "")# normalizer.transform(r"\*.*?\*", "")# Unicode normalization on tokens and lowercasing normalizer.transform(r"\S+", lambda m: unicodedata.normalize("NFKC", m.group())) normalizer.transform(r"\S+", lambda m: m.group().lower()) normalizer.transform(r"[^\w\s]", "") # Remove punctuation and special characters normalizer.transform(r"\s+", " ") # Normalize whitespace to a single space normalizer.transform(r"^\s+|\s+$", "") # Strip leading and trailing whitespace mapping = normalizer.get_token_map() normalized_tokens = [item["normalized_token"] for item in mapping]return normalized_tokens, mapping
In many cases you may want, or need, to be more careful with how removal of punctuation and special characters is applied. Words with hyphens, em dashes, or scores in sports games (e.g. 3-2) are examples where you may want to insert a whitespace instead of removing the characters entirely. Beware, also, that the ordering of the transformations can sometimes matter.
Warning
Avoid overly broad regex patterns. A pattern that matches everything will produce a useless mapping.
It is highly recommended to inspect the intermediate outputs of the applied transformations as described in the previous section.
Sentence tokenization
easytranscriber supports passing a tokenizer to the pipeline function that segments the input text according to the user’s preferences. The best matching start and end timestamps will be assigned to each tokenized segment based on the outputs from forced alignment.
For sentence tokenization, we recommend using nltk.tokenize.punkt.PunktTokenizer. The load_tokenizer function from the easyaligner library provides a convenient way to load an appropriate tokenizer for your language:
from easyaligner.text import load_tokenizertokenizer = load_tokenizer(language="swedish")
PunktTokenizer maintains lists of abbreviations to avoid incorrectly splitting sentences. We can inspect the loaded tokenizer’s list of abbreviations as follows:
print("Current abbreviations:", tokenizer._params.abbrev_types)print(f"Length of abbreviations: {len(tokenizer._params.abbrev_types)}")
When you are done: pass the tokenizer to the pipeline function in easytranscriber.
Arbitrary tokenization
The tokenizer argument of the pipeline function accepts any function that takes in a string and outputs a list of (start_char, end_char) tuples. Custom tokenization functions can therefore be defined based on any arbitrary segmentation of the source text.
In our case, however, the source text is Whisper’s ASR output. Since Whisper inference is restricted to 30-second audio chunks, a coarser level of tokenization (e.g. paragraph-level) is generally infeasible.
Arbitrary tokenization is more relevant when aligning existing ground truth texts with longer audio segments. easytranscriber is built on top of easyaligner, which is designed for this purpose. A more in-depth guide on custom tokenization will be available in the easyaligner documentation.