Tutorial 1: Simple case

This tutorial will demonstrate the most straightforward case for aligning text and audio. It assumes we have access to a transcript that corresponds to everything spoken in the audio.

For the tutorial, we will be using:

easyaligner can handle some extra speech at the beginning and end of the audio with its start_wildcard and end_wildcard arguments. For best results, however, the transcript should cover as much of the spoken audio as possible.

NoteScenario

A text transcript that covers (almost) everything that is spoken in the audio.

Download text and audio

Let’s begin with downloading the book from Project Gutenberg, and then apply a regular expression to extract the first chapter’s text.

import re
import requests

url = "https://www.gutenberg.org/cache/epub/98/pg98.txt"
full_text = requests.get(url).text

# Extract Chapter I text (i.e. the text between "CHAPTER I. The Period" and "CHAPTER II.")
match = re.search(
    r"(?<=CHAPTER I\.\r\nThe Period\r\n)[\s\S]+?(?=CHAPTER II\.)",
    full_text,
)
text = match.group().strip()

Before we move on to forced alignment, here’s the text of the first chapter for reference. Note especially the paragraph breaks and punctuation. One of easyaligner’s main features is that it will retain the original text’s formatting after alignment.

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way–in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.

There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever.

It was the year of Our Lord one thousand seven hundred and seventy-five. Spiritual revelations were conceded to England at that favoured period, as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, of whom a prophetic private in the Life Guards had heralded the sublime appearance by announcing that arrangements were made for the swallowing up of London and Westminster. Even the Cock-lane ghost had been laid only a round dozen of years, after rapping out its messages, as the spirits of this very year last past (supernaturally deficient in originality) rapped out theirs. Mere messages in the earthly order of events had lately come to the English Crown and People, from a congress of British subjects in America: which, strange to relate, have proved more important to the human race than any communications yet received through any of the chickens of the Cock-lane brood.

France, less favoured on the whole as to matters spiritual than her sister of the shield and trident, rolled with exceeding smoothness down hill, making paper money and spending it. Under the guidance of her Christian pastors, she entertained herself, besides, with such humane achievements as sentencing a youth to have his hands cut off, his tongue torn out with pincers, and his body burned alive, because he had not kneeled down in the rain to do honour to a dirty procession of monks which passed within his view, at a distance of some fifty or sixty yards. It is likely enough that, rooted in the woods of France and Norway, there were growing trees, when that sufferer was put to death, already marked by the Woodman, Fate, to come down and be sawn into boards, to make a certain movable framework with a sack and a knife in it, terrible in history. It is likely enough that in the rough outhouses of some tillers of the heavy lands adjacent to Paris, there were sheltered from the weather that very day, rude carts, bespattered with rustic mire, snuffed about by pigs, and roosted in by poultry, which the Farmer, Death, had already set apart to be his tumbrils of the Revolution. But that Woodman and that Farmer, though they work unceasingly, work silently, and no one heard them as they went about with muffled tread: the rather, forasmuch as to entertain any suspicion that they were awake, was to be atheistical and traitorous.

In England, there was scarcely an amount of order and protection to justify much national boasting. Daring burglaries by armed men, and highway robberies, took place in the capital itself every night; families were publicly cautioned not to go out of town without removing their furniture to upholsterers’ warehouses for security; the highwayman in the dark was a City tradesman in the light, and, being recognised and challenged by his fellow-tradesman whom he stopped in his character of “the Captain,” gallantly shot him through the head and rode away; the mail was waylaid by seven robbers, and the guard shot three dead, and then got shot dead himself by the other four, “in consequence of the failure of his ammunition:” after which the mail was robbed in peace; that magnificent potentate, the Lord Mayor of London, was made to stand and deliver on Turnham Green, by one highwayman, who despoiled the illustrious creature in sight of all his retinue; prisoners in London gaols fought battles with their turnkeys, and the majesty of the law fired blunderbusses in among them, loaded with rounds of shot and ball; thieves snipped off diamond crosses from the necks of noble lords at Court drawing-rooms; musketeers went into St. Giles’s, to search for contraband goods, and the mob fired on the musketeers, and the musketeers fired on the mob, and nobody thought any of these occurrences much out of the common way. In the midst of them, the hangman, ever busy and ever worse than useless, was in constant requisition; now, stringing up long rows of miscellaneous criminals; now, hanging a housebreaker on Saturday who had been taken on Tuesday; now, burning people in the hand at Newgate by the dozen, and now burning pamphlets at the door of Westminster Hall; to-day, taking the life of an atrocious murderer, and to-morrow of a wretched pilferer who had robbed a farmer’s boy of sixpence.

All these things, and a thousand like them, came to pass in and close upon the dear old year one thousand seven hundred and seventy-five. Environed by them, while the Woodman and the Farmer worked unheeded, those two of the large jaws, and those other two of the plain and the fair faces, trod with stir enough, and carried their divine rights with a high hand. Thus did the year one thousand seven hundred and seventy-five conduct their Greatnesses, and myriads of small creatures–the creatures of this chronicle among the rest–along the roads that lay before them.

Next, we download the audio of the first chapter and save it to disk.

from huggingface_hub import snapshot_download

snapshot_download(
    "Lauler/easytranscriber_tutorials",
    repo_type="dataset",
    local_dir="data/tutorials",
    allow_patterns="tale-of-two-cities_short-en/*",
)

Here’s the audio we’ll be aligning with the text:

Sample audio
A Tale of Two Cities — Chapter 1 (LibriVox)

Align text and audio

The process for aligning the text and audio looks similar to the simple example in the overview section. This time around, however, we will be aligning the entire first chapter of A Tale of Two Cities, as opposed to only the first paragraph.

from pathlib import Path

from transformers import (
    AutoModelForCTC,
    Wav2Vec2Processor,
)

from easyaligner.text import load_tokenizer
from easyaligner.data.datamodel import SpeechSegment
from easyaligner.pipelines import pipeline
from easyaligner.text import text_normalizer
from easyaligner.vad.pyannote import load_vad_model

text = text.strip()

# The alignments will be organized according to how the text is tokenized
tokenizer = load_tokenizer(language="english")  # sentence tokenizer
span_list = list(tokenizer.span_tokenize(text))  # start, end character indices for each sentence
speeches = [[SpeechSegment(speech_id=0, text=text, text_spans=span_list, start=None, end=None)]]

# Load models and run pipeline
model_vad = load_vad_model()
model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda").half()
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

# File(s) to align
audio_dir = "data/tutorials/tale-of-two-cities_short-en"
audio_files = [file.name for file in Path(audio_dir).glob("*")]

pipeline(
    vad_model=model_vad,
    emissions_model=model,
    processor=processor,
    audio_paths=audio_files,
    audio_dir=audio_dir,
    speeches=speeches,
    alignment_strategy="speech",
    text_normalizer_fn=text_normalizer,
    tokenizer=tokenizer,
    start_wildcard=True,
    end_wildcard=True,
    blank_id=processor.tokenizer.pad_token_id,
    word_boundary="|",
)
Tip

A list of suitable emission models for different languages can be found in the WhisperX library.

If your audio and transcript are multilingual, you can try using the Massively Multilingual Speech model from Meta: mms-1b-all.

Result: Force aligned output

The text transcript below the audio player is highlighted in sync with the words spoken in the audio.

Sample audio
A Tale of Two Cities — Chapter 1 (LibriVox)

Tip

You can click anywhere in the text to jump to that point in the audio. The text is also highlighted when you drag the audio slider!