normalization.SpanMapNormalizer

text.normalization.SpanMapNormalizer(text)

Apply regex text transformations while keeping track of the character spans in the original text.

Parameters

Name Type Description Default
text str The input text to be normalized. required

Example

from easyaligner.text.normalization import SpanMapNormalizer

text = '''Book 1. Chapter 1, The Period. It was the best of times. It was the worst of times.
It was the age of wisdom. It was the age of foolishness. It was the epoch of belief.
It was the epoch of incredulity. It was the season of light.
It was the season of darkness. It was the spring of hope.'''

normalizer = SpanMapNormalizer(text)
normalizer.transform(r"[^\w\s]", "")  # Remove punctuation and special characters
normalizer.transform(r"\S+", lambda m: m.group().lower()) # Lowercase
normalizer.transform(r"\s+", " ")  # Normalize whitespace to a single space
normalizer.transform(r"^\s+|\s+$", "")  # Strip leading and trailing whitespace
print(normalizer.current_text)
print(normalizer.get_token_map())

Methods

Name Description
get_token_map Tokenize the current text and create a mapping of normalized tokens to the
transform Apply a regex transformation to the current text, while keeping track of the