Text processing

easytranscriber supports custom regex-based text normalization functions to preprocess ASR outputs before alignment. By reconciling the outputs of the ASR model and the emissions model, the forced alignment algorithm has an opportunity to perform better.

Wav2vec2 models tend to produce all lowercase or all uppercase outputs without punctuation, whereas Whisper models output mixed case text with punctuation. Whisper furthermore outputs non-verbal tokens (often within parentheses or brackets), symbols and abbreviations.

Normalizing these outputs can substantially improve the quality of forced alignment.

Note

The normalizations are reversible, meaning that the original text can be recovered after forced alignment has been performed1.

1 Whitespace is not always recoverable, depending on the regex patterns used

Normalization

Let’s apply some basic normalization to our example from A Tale of Two Cities. To explore the effect of our normalizations, we will import the SpanMapNormalizer class from the easyaligner library, and remove punctuation.

from easyaligner.text.normalization import SpanMapNormalizer

text = """Book 1. Chapter 1, The Period. It was the best of times. It was the worst of times. 
It was the age of wisdom. It was the age of foolishness. It was the epoch of belief. 
It was the epoch of incredulity. It was the season of light. 
It was the season of darkness. It was the spring of hope."""

normalizer = SpanMapNormalizer(text)
normalizer.transform(r"[^\w\s]", "")  # Remove punctuation and special characters
print(normalizer.current_text)
Book 1 Chapter 1 The Period It was the best of times It was the worst of times 
It was the age of wisdom It was the age of foolishness It was the epoch of belief 
It was the epoch of incredulity It was the season of light 
It was the season of darkness It was the spring of hope

Let’s make it all lowercase as well:

normalizer.transform(r"\S+", lambda m: m.group().lower())  #
print(normalizer.current_text)
book 1 chapter 1 the period it was the best of times it was the worst of times 
it was the age of wisdom it was the age of foolishness it was the epoch of belief 
it was the epoch of incredulity it was the season of light 
it was the season of darkness it was the spring of hope

We may also want to convert the numbers to their word forms. The library num2words2 can help with this:

2 pip install num2words

from num2words import num2words
normalizer.transform(r"\d+", lambda m: num2words(int(m.group())))  # Convert numbers to words
print(normalizer.current_text)
book one chapter one the period it was the best of times it was the worst of times 
it was the age of wisdom it was the age of foolishness it was the epoch of belief 
it was the epoch of incredulity it was the season of light 
it was the season of darkness it was the spring of hope

When you’re feeling done, it’s a good idea to always apply whitespace normalization as the final transformation step:

normalizer.transform(r"\s+", " ")  # Normalize whitespace to a single space
normalizer.transform(r"^\s+|\s+$", "")  # Strip leading and trailing whitespace
print(normalizer.current_text)
book one chapter one the period it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness it was the epoch of belief it was the epoch of incredulity it was the season of light it was the season of darkness it was the spring of hope

Our text is now ready for forced alignment. However, if we want to recover the original text, it does not suffice to pass only the normalized text to the forced alignment algorithm. We also need to pass a mapping between the original text and the normalized text. Any user supplied text normalization function in easytranscriber needs to return the following two objects:

mapping = normalizer.get_token_map()
normalized_tokens = [item["normalized_token"] for item in mapping]

Let’s inspect what’s in the first five items of the mapping:

for item in mapping[:5]:
    print(item)
{'normalized_token': 'book', 'text': 'Book', 'start_char': 0, 'end_char': 4}
{'normalized_token': 'one', 'text': '1', 'start_char': 5, 'end_char': 6}
{'normalized_token': 'chapter', 'text': 'Chapter', 'start_char': 8, 'end_char': 15}
{'normalized_token': 'one', 'text': '1', 'start_char': 16, 'end_char': 17}
{'normalized_token': 'the', 'text': 'The', 'start_char': 19, 'end_char': 22}

We can see that normalized_token and text contain the token in the normalized and original text, respectively, while start_char and end_char indicate the character indices of the token in the original text.

Tip

When you are done testing your transformations, combine them into a function that takes in a string and outputs the normalized tokens and the mapping. See below for an example of how the default normalization function in easytranscriber is implemented.

Default text_normalizer

easytranscriber provides a conservative default text normalization function. This default function is applied3 unless the user specificies their own function.

3 See pipeline and the arg text_normalizer_fn

Here is the default normalization function, for reference:

def text_normalizer(text: str) -> str:
    """
    Default text normalization function.

    Applies
        - Unicode normalization (NFKC)
        - Lowercasing
        - Normalization of whitespace
        - Remove parentheses and special characters

    Parameters
    ----------
    text : str
        Input text to normalize.

    Returns
    -------
    tuple
        Tuple containing (normalized_tokens, mapping).
    """
    # Unicode normalization
    normalizer = SpanMapNormalizer(text)
    # # Remove parentheses, brackets, stars, and their content
    # normalizer.transform(r"\(.*?\)", "")
    # normalizer.transform(r"\[.*?\]", "")
    # normalizer.transform(r"\*.*?\*", "")

    # Unicode normalization on tokens and lowercasing
    normalizer.transform(r"\S+", lambda m: unicodedata.normalize("NFKC", m.group()))
    normalizer.transform(r"\S+", lambda m: m.group().lower())
    normalizer.transform(r"[^\w\s]", "")  # Remove punctuation and special characters
    normalizer.transform(r"\s+", " ")  # Normalize whitespace to a single space
    normalizer.transform(r"^\s+|\s+$", "")  # Strip leading and trailing whitespace

    mapping = normalizer.get_token_map()
    normalized_tokens = [item["normalized_token"] for item in mapping]
    return normalized_tokens, mapping

In many cases you may want, or need, to be more careful with how removal of punctuation and special characters is applied. Words with hyphens, em dashes, or scores in sports games (e.g. 3-2) are examples where you may want to insert a whitespace instead of removing the characters entirely. Beware, also, that the ordering of the transformations can sometimes matter.

Warning

Avoid overly broad regex patterns. A pattern that matches everything will produce a useless mapping.

It is highly recommended to inspect the intermediate outputs of the applied transformations as described in the previous section.

Sentence tokenization

easytranscriber supports passing a tokenizer to the pipeline function that segments the input text according to the user’s preferences. The best matching start and end timestamps will be assigned to each tokenized segment based on the outputs from forced alignment.

For sentence tokenization, we recommend using nltk.tokenize.punkt.PunktTokenizer. The load_tokenizer function from the easyaligner library provides a convenient way to load an appropriate tokenizer for your language:

from easyaligner.text import load_tokenizer
tokenizer = load_tokenizer(language="swedish")

PunktTokenizer maintains lists of abbreviations to avoid incorrectly splitting sentences. We can inspect the loaded tokenizer’s list of abbreviations as follows:

print("Current abbreviations:", tokenizer._params.abbrev_types)
print(f"Length of abbreviations: {len(tokenizer._params.abbrev_types)}")
Current abbreviations: {'rif', 'kap', 'ital', 'dna', 'p.g.a', 'o.d', 'm.m', 'e.m', 'ppm', 'osv', 's.k', 'bf', 'jaha', 'f.m', 'hushålln.-sällsk', 'resp', 'z.b', 'föreläsn.-fören', 'ordf', 'landtm.-förb', 'o.s.v', 'åk', 'hrm', 'bl.a', 'p', 'f.d', 'ex', 'rskr', 't.o.m', 'fig', 'f.n', 'mom', 'prop', 'postst', 't.ex', 'mm', 'aig', 'm.fl', 'dir'}
Length of abbreviations: 39

One may however want to add custom abbreviations to the tokenizer depending on the domain of one’s data:

new_abbreviations = {
    "d.v.s",
    "dvs",
    "fr.o.m",
    "kungl",
    "m.m",
    "milj",
    "o.s.v",
    "t.o.m",
    "milj.kr",
}
tokenizer._params.abbrev_types.update(new_abbreviations)
print("Updated abbreviations:", tokenizer._params.abbrev_types)
print(f"Length of abbreviations: {len(tokenizer._params.abbrev_types)}")
Updated abbreviations: {'rif', 'kap', 'ital', 'dna', 'p.g.a', 'o.d', 'm.m', 'e.m', 'ppm', 'osv', 's.k', 'bf', 'jaha', 'dvs', 'f.m', 'hushålln.-sällsk', 'fr.o.m', 'resp', 'z.b', 'milj', 'föreläsn.-fören', 'ordf', 'landtm.-förb', 'o.s.v', 'åk', 'hrm', 'bl.a', 'p', 'f.d', 'ex', 'rskr', 't.o.m', 'fig', 'f.n', 'mom', 'milj.kr', 'prop', 'd.v.s', 'postst', 't.ex', 'mm', 'aig', 'm.fl', 'kungl', 'dir'}
Length of abbreviations: 45

When you are done: pass the tokenizer to the pipeline function in easytranscriber.

Arbitrary tokenization

The tokenizer argument of the pipeline function accepts any function that takes in a string and outputs a list of (start_char, end_char) tuples. Custom tokenization functions can therefore be defined based on any arbitrary segmentation of the source text.

In our case, however, the source text is Whisper’s ASR output. Since Whisper inference is restricted to 30-second audio chunks, a coarser level of tokenization (e.g. paragraph-level) is generally infeasible.

Arbitrary tokenization is more relevant when aligning existing ground truth texts with longer audio segments. easytranscriber is built on top of easyaligner, which is designed for this purpose. A more in-depth guide on custom tokenization will be available in the easyaligner documentation.