Text processing

easyaligner supports custom regex-based text normalization functions to preprocess text before forced alignment. The ground-truth text is normalized to bring its style and formatting as close as possible to the wav2vec2 output.

Wav2vec2 models tend to produce all-lowercase or all-uppercase outputs, without punctuation. Ground-truth text transcripts, on the other hand, tend to have casing, punctuation, symbols and numbers. Normalizing the text to match the model’s output before alignment can substantially improve alignment quality.

Note

The normalizations are reversible, meaning that the original text can be recovered after forced alignment has been performed1.

1 Whitespace is not always recoverable, depending on the regex patterns used

Normalization

Let’s apply some basic normalization on an example text derived from running automatic speech recognition on the audiobook of A Tale of Two Cities. To explore the effect of our normalizations, step by step, we will import the SpanMapNormalizer class from the easyaligner library. We begin by removing remove punctuation and other special characters.

from easyaligner.text.normalization import SpanMapNormalizer

text = """Book 1. Chapter 1, The Period. It was the best of times. It was the worst of times.
It was the age of wisdom. It was the age of foolishness. It was the epoch of belief.
It was the epoch of incredulity. It was the season of light.
It was the season of darkness. It was the spring of hope."""

normalizer = SpanMapNormalizer(text)
normalizer.transform(r"[^\w\s]", "")  # Remove punctuation and special characters
print(normalizer.current_text)
Book 1 Chapter 1 The Period It was the best of times It was the worst of times
It was the age of wisdom It was the age of foolishness It was the epoch of belief
It was the epoch of incredulity It was the season of light
It was the season of darkness It was the spring of hope

Let’s make the text all lowercase as well:

normalizer.transform(r"\S+", lambda m: m.group().lower())
print(normalizer.current_text)
book 1 chapter 1 the period it was the best of times it was the worst of times
it was the age of wisdom it was the age of foolishness it was the epoch of belief
it was the epoch of incredulity it was the season of light
it was the season of darkness it was the spring of hope

We may also want to convert the numbers to their word forms – as this is how wav2vec2 transcribes them. The library num2words2 can help with this:

2 pip install num2words

from num2words import num2words
normalizer.transform(r"\d+", lambda m: num2words(int(m.group())))  # Convert numbers to words
print(normalizer.current_text)
book one chapter one the period it was the best of times it was the worst of times
it was the age of wisdom it was the age of foolishness it was the epoch of belief
it was the epoch of incredulity it was the season of light
it was the season of darkness it was the spring of hope

When you’re feeling done with the transformations, it’s always a good idea to apply whitespace normalization as the final transformation step:

normalizer.transform(r"\s+", " ")  # Normalize whitespace to a single space
normalizer.transform(r"^\s+|\s+$", "")  # Strip leading and trailing whitespace
print(normalizer.current_text)
book one chapter one the period it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness it was the epoch of belief it was the epoch of incredulity it was the season of light it was the season of darkness it was the spring of hope

Our text is now ready for forced alignment. However, if we want to recover the original text, it does not suffice to pass only the normalized text to the forced alignment algorithm. We also need to pass a mapping between the original text and the normalized text. Any user supplied text normalization function in easyaligner needs to return the following two objects:

mapping = normalizer.get_token_map()
normalized_tokens = [item["normalized_token"] for item in mapping]

Let’s inspect what’s in the first five items of the mapping:

for item in mapping[:5]:
    print(item)
{'normalized_token': 'book', 'text': 'Book', 'start_char': 0, 'end_char': 4}
{'normalized_token': 'one', 'text': '1', 'start_char': 5, 'end_char': 6}
{'normalized_token': 'chapter', 'text': 'Chapter', 'start_char': 8, 'end_char': 15}
{'normalized_token': 'one', 'text': '1', 'start_char': 16, 'end_char': 17}
{'normalized_token': 'the', 'text': 'The', 'start_char': 19, 'end_char': 22}

We can see that normalized_token and text contain the token in the normalized and original text, respectively, while start_char and end_char indicate the character indices of the token in the original text.

Tip

When you are done testing your transformations, combine them into a function that takes in a string and outputs the normalized tokens and the mapping. See below for an example of how the default normalization function in easyaligner is implemented.

Default text_normalizer

easyaligner provides a conservative default text normalization function. This default function is applied3 unless the user specifies their own function.

3 See pipeline and the arg text_normalizer_fn

Here is the default normalization function, for reference:

import unicodedata


def text_normalizer(text: str):
    """
    Default text normalization function.

    Applies:
        - Unicode normalization (NFKC)
        - Lowercasing
        - Normalization of whitespace
        - Removal of punctuation and special characters

    Parameters
    ----------
    text : str
        Input text to normalize.

    Returns
    -------
    tuple
        Tuple containing (normalized_tokens, mapping).
    """
    normalizer = SpanMapNormalizer(text)
    normalizer.transform(r"\S+", lambda m: unicodedata.normalize("NFKC", m.group()))
    normalizer.transform(r"\S+", lambda m: m.group().lower())
    normalizer.transform(r"[^\w\s]", "")  # Remove punctuation and special characters
    normalizer.transform(r"\s+", " ")  # Normalize whitespace to a single space
    normalizer.transform(r"^\s+|\s+$", "")  # Strip leading and trailing whitespace

    mapping = normalizer.get_token_map()
    normalized_tokens = [item["normalized_token"] for item in mapping]
    return normalized_tokens, mapping

In many cases you may want, or need, to be more careful with how removal of punctuation and special characters is applied. Words with hyphens, em dashes, or scores in sports games (e.g. 3-2) are examples where you may want to insert a whitespace instead of removing the characters entirely. Beware, also, that the ordering of the transformations can sometimes matter.

Warning

Avoid overly broad regex patterns. A pattern that matches everything will produce a useless mapping.

It is highly recommended to inspect the intermediate outputs of the applied transformations as described in the previous section.

Sentence tokenization

easyaligner supports passing a tokenizer to the pipeline function that segments the input text according to the user’s preferences. The best matching start and end timestamps will be assigned to each tokenized segment based on the word-level outputs from forced alignment.

For sentence tokenization, we recommend using nltk.tokenize.punkt.PunktTokenizer. The load_tokenizer function from easyaligner provides a convenient way to load an appropriate tokenizer for your language:

from easyaligner.text import load_tokenizer
tokenizer = load_tokenizer(language="english")

PunktTokenizer maintains lists of abbreviations to avoid incorrectly splitting sentences. We can inspect the loaded tokenizer’s list of abbreviations as follows:

print("Current abbreviations:", tokenizer._params.abbrev_types)
print(f"Length of abbreviations: {len(tokenizer._params.abbrev_types)}")
Current abbreviations: {'l.f', 'c', 'e.l', 'va', 'w.r', 'd', 'ariz', 'nev', 'n.h', 'calif', 'sw', 'h.f', 'd.c', 'r.t', 'wash', 'n.y', 'aug', 'ct', 'ft', 's.s', 'reps', 'g', 'n.j', 'n.c', 'jan', 'feb', 'j.b', 'u.k', 'w', 'a.g', 'sr', 'kan', 'mrs', 'v', 'colo', 't.j', 'p', 'lt', 'conn', 'tenn', 'l.p', 'mg', 'h.c', 'r.k', 'd.w', 'sept', 'corp', 'a.m.e', 'okla', 'maj', 'a.d', 'ga', 'a.m', 'cie', 'p.a.m', 'st', 'w.c', 'e.h', 'fla', 'u.s', 'm.j', 'w.va', 'chg', 'j.r', 'm.b.a', 'ph.d', 'g.f', 's.a', 'm.d.c', 'oct', 'd.h', 'e', 'u.s.s.r', 'c.i.t', 'cos', '. . ', 's', 'ms', 'a.a', 'f.g', 'h.m', 'prof', 'fri', 'mr', 'ala', 'dec', 's.a.y', 'r.h', 'inc', 's.p.a', 'ltd', 'a.s', 'j.k', 'h', 'a.t', 'j.j', 'jr', 'minn', 'u.n', 'g.d', 'mich', 'c.o.m.b', 'bros', 'r.i', 's.g', 'b.v', 'wis', 'sep', 'ore', 'w.w', 't', 'a.h', 'ky', 'k', 'tues', 'n.d', 'wed', 'col', 'm', 'r.a', 'ok', 'co', 'r', 'f.j', 's.c', 'vs', 'messrs', 'e.m', 'i.m.s', 'ave', 'l.a', 'c.v', 'f', 'u.s.a', 'yr', 'ill', 'adm', 'j.p', 'sen', 'r.j', 'a.c', 'n', 'g.k', 'gen', 'p.m', 'n.v', 'vt', 'rep', 'l', 'j.c', 'b.f', 'dr', 'pa', 'e.f', 'nov', 'n.m'}
Length of abbreviations: 156

One may however want to add custom abbreviations to the tokenizer depending on the domain of one’s data:

new_abbreviations = {"rev", "capt"}
tokenizer._params.abbrev_types.update(new_abbreviations)
print("Updated abbreviations:", tokenizer._params.abbrev_types)
print(f"Length of abbreviations: {len(tokenizer._params.abbrev_types)}")
Updated abbreviations: {'l.f', 'c', 'e.l', 'va', 'w.r', 'd', 'rev', 'ariz', 'nev', 'n.h', 'calif', 'sw', 'h.f', 'd.c', 'r.t', 'wash', 'n.y', 'aug', 'ct', 'ft', 's.s', 'reps', 'g', 'n.j', 'n.c', 'jan', 'feb', 'j.b', 'u.k', 'w', 'a.g', 'sr', 'kan', 'mrs', 'v', 'colo', 't.j', 'p', 'lt', 'conn', 'tenn', 'l.p', 'mg', 'h.c', 'r.k', 'd.w', 'sept', 'corp', 'a.m.e', 'okla', 'maj', 'a.d', 'ga', 'a.m', 'cie', 'p.a.m', 'st', 'w.c', 'e.h', 'fla', 'u.s', 'm.j', 'w.va', 'chg', 'j.r', 'm.b.a', 'ph.d', 'g.f', 's.a', 'm.d.c', 'oct', 'd.h', 'e', 'u.s.s.r', 'c.i.t', 'cos', '. . ', 's', 'ms', 'a.a', 'f.g', 'h.m', 'prof', 'fri', 'mr', 'ala', 'dec', 's.a.y', 'r.h', 'inc', 's.p.a', 'ltd', 'a.s', 'j.k', 'h', 'a.t', 'j.j', 'jr', 'minn', 'u.n', 'g.d', 'mich', 'c.o.m.b', 'bros', 'r.i', 's.g', 'b.v', 'wis', 'sep', 'ore', 'w.w', 't', 'a.h', 'ky', 'k', 'tues', 'n.d', 'wed', 'col', 'm', 'r.a', 'ok', 'co', 'r', 'f.j', 's.c', 'vs', 'messrs', 'e.m', 'i.m.s', 'ave', 'l.a', 'c.v', 'f', 'u.s.a', 'yr', 'ill', 'adm', 'j.p', 'sen', 'r.j', 'a.c', 'n', 'g.k', 'gen', 'p.m', 'n.v', 'vt', 'rep', 'l', 'j.c', 'b.f', 'capt', 'dr', 'pa', 'e.f', 'nov', 'n.m'}
Length of abbreviations: 158

When it comes to text segmentation, there are two main options for passing relevant information to the pipeline function. One can either apply the tokenizer directly and pass the pre-computed (start, end) character indices as text_spans in the SpeechSegment object, or alternatively, pass the tokenizer to the pipeline function without pre-computing the character spans. The example below shows an example where the tokenized spans are pre-computed:

from easyaligner.data.datamodel import SpeechSegment

text = text.strip() # We recommend always stripping leading and trailing whitespace
span_list = list(tokenizer.span_tokenize(text))  # start, end character indices for each sentence
speeches = [[SpeechSegment(speech_id=0, text=text, text_spans=span_list, start=None, end=None)]]

pipeline(
    ...,
    speeches=speeches,
    tokenizer=tokenizer,
)

Pre-computing spans can be helpful for debugging purposes when creating custom tokenizers. If the tokenizer works as expected, one can simply pass the tokenizer, and the character spans will be computed as part of the pipeline.

Paragraph tokenization

For long-form texts, such as books, sentence-level tokenization may be too fine-grained. Sentence tokenizers are also fallible. They may split sentences at an incorrect location. These failures may lead to paragraph breaks being present in the middle of an AlignmentSegment, causing problems for the easysearch interactive search interface to display the results with identical formatting to the original text.

easyaligner therefore includes a paragraph_tokenizer that splits on double newlines, to ensure that each AlignmentSegment corresponds to exactly one paragraph:

from easyaligner.text import paragraph_tokenizer

sample = "First paragraph.\n\nSecond paragraph.\n\nThird paragraph."
print(paragraph_tokenizer(sample))
[(0, 16), (18, 35), (37, 53)]

Here is the source code for reference:

import re

def paragraph_tokenizer(text: str) -> list[tuple[int, int]]:
    """
    Tokenize text into paragraphs based on double newlines.

    Returns character spans (start, end) for each non-empty paragraph.
    Suitable as a drop-in replacement for PunktTokenizer when paragraph-level
    alignment granularity is desired.

    Parameters
    ----------
    text : str
        The text to tokenize into paragraphs.

    Returns
    -------
    list of tuple[int, int]
        List of (start_char, end_char) spans, one per paragraph.
    """
    spans = []
    start = 0
    for m in re.finditer(r'\r?\n\r?\n', text):
        spans.append((start, m.start()))
        start = m.end()
    spans.append((start, len(text)))
    return [(s, e) for s, e in spans if text[s:e].strip()]

paragraph_tokenizer is a plain callable (i.e. a function). No pre-computation of character spans is needed. Just pass it directly to the pipeline:

from easyaligner.text import paragraph_tokenizer

pipeline(
    ...,
    tokenizer=paragraph_tokenizer,
)

Using this tokenizer, each alignment segment in the output JSON will correspond to one paragraph, and paragraph breaks will appear at the correct locations in the easysearch search interface.

Arbitrary tokenization

The tokenizer argument of the pipeline function accepts any function that takes in a string and outputs a list of (start_char, end_char) tuples. Custom tokenization functions can therefore be defined based on any arbitrary segmentation of the source text.

Here, we create a highly specific tokenizer to split an ebook text at chapter heading markers, such as “CHAPTER I”, “CHAPTER II”, “CHAPTER XIV”.

import re

def chapter_tokenizer(text: str) -> list[tuple[int, int]]:
    """Split text at chapter headings (e.g. 'CHAPTER I.')."""
    spans = []
    boundaries = [m.start() for m in re.finditer(r"CHAPTER [IVXLC]+\.", text)]
    boundaries.append(len(text))
    for start, end in zip(boundaries, boundaries[1:]):
        if text[start:end].strip():
            spans.append((start, end))
    return spans

Pass it to the pipeline the same way as paragraph_tokenizer:

pipeline(
    ...,
    tokenizer=chapter_tokenizer,
)

Try it for yourself!