Overview

easyaligner is a forced alignment library for aligning text transcripts with audio. It is designed with a focus on ease of use, flexibility, and performance. The library can be used for a variety of applications, including

Aligning e-texts with audiobook recordings to create interactive reading experiences (see the interactive demo below).
Aligning podcast transcripts to enable features like chapter navigation and keyword search.
Aligning protocols and recordings of parliamentary debates for research and accessibility purposes.
Fixing misaligned subtitles in videos, or creating new subtitles from transcripts.
Creating large-scale speech recognition and speech synthesis datasets for AI model training.

The aligned outputs can be segmented at any level of granularity (sentence, paragraph, etc.), while preserving the original text’s formatting. Outputs include precise word-level timestamps.

easyaligner supports aligning both from ground-truth transcripts, as well as from ASR (automatic speech transcription) model outputs. Check out the library easytranscriber for an example where easyaligner is used as a backend to align ASR outputs.

Installation

With GPU support

pip install easyaligner --extra-index-url https://download.pytorch.org/whl/cu128

Using uv

When installing with uv, it will select the appropriate PyTorch version automatically (CPU for macOS, CUDA for Linux/Windows/ARM):

uv pip install easyaligner

Usage

Below is a minimal example of the simplest alignment scenario: a text transcript that covers the full audio. Tutorial 1 demonstrates the same scenario in more detail.

The example below uses a 57-second snippet from a LibriVox recording of A Tale of Two Cities, corresponding to the first paragraph of Chapter I. The text used for alignment is assigned directly to the text variable as a string.

from pathlib import Path

from transformers import (
    AutoModelForCTC,
    Wav2Vec2Processor,
)
from huggingface_hub import snapshot_download

from easyaligner.text import load_tokenizer
from easyaligner.data.datamodel import SpeechSegment
from easyaligner.pipelines import pipeline
from easyaligner.text import text_normalizer
from easyaligner.vad.pyannote import load_vad_model

filepath_pattern = "tale-of-two-cities_align-en/taleoftwocities_01_dickens_64kb_align.mp3"

# Download mp3 from Hugging Face Hub
snapshot_download(
    "Lauler/easytranscriber_tutorials",
    repo_type="dataset",
    local_dir="data/tutorials",
    allow_patterns=filepath_pattern,
)

# File(s) to align
filepath = Path("data/tutorials") / filepath_pattern
audio_dir = filepath.parent
audio_files = [filepath.name]

text = """
It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it
was the epoch of incredulity, it was the season of Light, it was the
season of Darkness, it was the spring of hope, it was the winter of
despair, we had everything before us, we had nothing before us, we were
all going direct to Heaven, we were all going direct the other way--in
short, the period was so far like the present period, that some of its
noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
"""

text = text.strip()

# The alignments will be organized according to how the text is tokenized
tokenizer = load_tokenizer(language="english")  # sentence tokenizer
span_list = list(tokenizer.span_tokenize(text))  # start, end character indices for each sentence
speeches = [[SpeechSegment(speech_id=0, text=text, text_spans=span_list, start=None, end=None)]]

# Load models and run pipeline
model_vad = load_vad_model()
model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda").half()
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

pipeline(
    vad_model=model_vad,
    emissions_model=model,
    processor=processor,
    audio_paths=audio_files,
    audio_dir=audio_dir,
    speeches=speeches,
    alignment_strategy="speech",
    text_normalizer_fn=text_normalizer,
    tokenizer=tokenizer,
    start_wildcard=True,
    end_wildcard=True,
    blank_id=processor.tokenizer.pad_token_id,
    word_boundary="|",
)

SpeechSegment

SpeechSegments allow specifying regions of the audio to be processed independently.

You can find out more about how it’s used in Tutorial 2 and Tutorial 3.

Tutorials

We recommend checking out the tutorials for more advanced use cases. The guides show how to handle cases where the ground-truth text only covers part of the spoken content in the audio, and where the relevant audio region is either known in advance, or unknown.

Scenario

Align text and audio: text transcript covers all the spoken content in the audio.

Scenario

Known audio region: text covers only part of the spoken content in the audio. But we know the relevant audio region in advance.

Scenario

Locate relevant audio region: text covers only part of spoken content in the audio, and we don’t know the relevant audio region in advance.

Results: Force aligned output

The text transcript below the audio player is highlighted in sync with the words spoken in the audio.

Sample audio

A Tale of Two Cities — Chapter 1 (LibriVox)

Tip

You can click anywhere in the text to jump to that point in the audio. The text is also highlighted when you drag the audio slider!

Tip

To browse and search your own alignments with the same synchronized playback, see easysearch.

Outputs

By default, easyaligner outputs a JSON file for each stage of the pipeline (VAD, emissions, forced alignment). The final aligned output can be found in output/alignments. The directory structure will look as follows:

output
├── vad          ← SpeechSegments with AudioChunks (VAD boundaries)
├── emissions    ← + emission file paths (.npy)
└── alignments   ← + AlignmentSegments with word timestamps

Reading the output

Let’s read the final aligned output and print one of the aligned segments. We can use a convenience function from easyaligner that reads the file as an AudioMetadata object.

from easyaligner.data.utils import read_json
from pprint import pprint

results = read_json("output/alignments/taleoftwocities_01_dickens_64kb_align.json")
# Print the 1st aligned segment of the first speech
pprint(results.speeches[0].alignments[0].to_dict())

{'duration': 52.01999,
 'end': 56.99155,
 'id': '0-0',
 'score': 0.97471,
 'start': 4.97156,
 'text': 'It was the best of times, it was the worst of times, it was the age '
         'of\n'
         'wisdom, it was the age of foolishness, it was the epoch of belief, '
         'it\n'
         'was the epoch of incredulity, it was the season of Light, it was '
         'the\n'
         'season of Darkness, it was the spring of hope, it was the winter of\n'
         'despair, we had everything before us, we had nothing before us, we '
         'were\n'
         'all going direct to Heaven, we were all going direct the other '
         'way--in\n'
         'short, the period was so far like the present period, that some of '
         'its\n'
         'noisiest authorities insisted on its being received, for good or '
         'for\n'
         'evil, in the superlative degree of comparison only.',
 'words': [WordSegment(text='It ', start=4.97156, end=5.05162, score=0.63607),
           WordSegment(text='was ', start=6.67286, end=6.77294, score=0.99967),
           WordSegment(text='the ', start=6.853, end=6.95308, score=0.91687),
           WordSegment(text='best ', start=7.27333, end=7.59357, score=0.99961),
           WordSegment(text='of ', start=7.73368, end=7.77371, score=0.99951),
           WordSegment(text='times, ', start=7.8938, end=8.57433, score=0.96887),
           WordSegment(text='it ', start=8.83453, end=8.89457, score=0.99642),
           WordSegment(text='was ', start=8.97463, end=9.09473, score=0.99854),
           WordSegment(text='the ', start=9.21482, end=9.35493, score=0.93623),
           WordSegment(text='worst ', start=9.53506, end=9.9754, score=0.95508),
           WordSegment(text='of ', start=10.11551, end=10.15554, score=0.99585),
           WordSegment(text='times, ', start=10.29565, end=10.85608, score=0.98755),
           WordSegment(text='it ', start=11.17633, end=11.21636, score=0.99683),
           WordSegment(text='was ', start=11.25639, end=11.33645, score=0.99935),
           WordSegment(text='the ', start=11.3965, end=11.47656, score=1.0),
           WordSegment(text='age ', start=11.63668, end=11.83683, score=0.92168),
           WordSegment(text='of\n', start=11.9169, end=11.95693, score=1.0),
           WordSegment(text='wisdom, ', start=12.15708, end=12.6975, score=0.99573),
           WordSegment(text='it ', start=13.55816, end=13.59819, score=0.99951),
           WordSegment(text='was ', start=13.63822, end=13.7383, score=0.99658),
           WordSegment(text='the ', start=13.81836, end=13.91843, score=0.99492),
           WordSegment(text='age ', start=14.03853, end=14.19865, score=1.0),
           WordSegment(text='of ', start=14.27871, end=14.31874, score=0.99976),
           WordSegment(text='foolishness, ', start=14.61897, end=15.41959, score=0.97715),
           WordSegment(text='it ', start=15.57971, end=15.61974, score=0.99927),
           WordSegment(text='was ', start=15.67979, end=15.75985, score=0.99951),
           WordSegment(text='the ', start=15.8199, end=15.91997, score=0.9043),
           WordSegment(text='epoch ', start=16.10011, end=16.4804, score=0.98312),
           WordSegment(text='of ', start=16.58048, end=16.62051, score=0.99927),
           WordSegment(text='belief, ', start=16.80065, end=17.28102, score=0.99219),
           WordSegment(text='it\n', start=17.86147, end=17.9015, score=0.99976),
           WordSegment(text='was ', start=17.96154, end=18.0416, score=0.99951),
           WordSegment(text='the ', start=18.10165, end=18.18171, score=0.99935),
           WordSegment(text='epoch ', start=18.32182, end=18.70211, score=0.98622),
           WordSegment(text='of ', start=18.76216, end=18.80219, score=1.0),
           WordSegment(text='incredulity, ', start=18.90227, end=20.10319, score=0.99176),
           WordSegment(text='it ', start=20.46347, end=20.5035, score=0.99976),
           WordSegment(text='was ', start=20.56354, end=20.66362, score=0.99967),
           WordSegment(text='the ', start=20.7637, end=20.84376, score=0.97803),
           WordSegment(text='season ', start=21.00388, end=21.48425, score=0.94775),
           WordSegment(text='of ', start=21.58433, end=21.62436, score=0.99976),
           WordSegment(text='Light, ', start=21.90457, end=22.2048, score=0.97866),
           WordSegment(text='it ', start=23.12551, end=23.16554, score=0.99951),
           WordSegment(text='was ', start=23.20557, end=23.28564, score=0.99886),
           WordSegment(text='the\n', start=23.3657, end=23.44576, score=0.98047),
           WordSegment(text='season ', start=23.6259, end=24.06624, score=0.97488),
           WordSegment(text='of ', start=24.16631, end=24.20634, score=0.99951),
           WordSegment(text='Darkness, ', start=24.50657, end=25.38725, score=0.99609),
           WordSegment(text='it ', start=25.9677, end=26.02774, score=0.97852),
           WordSegment(text='was ', start=26.06777, end=26.18787, score=0.9873),
           WordSegment(text='the ', start=26.26793, end=26.368, score=0.99414),
           WordSegment(text='spring ', start=26.52813, end=27.1486, score=0.98706),
           WordSegment(text='of ', start=27.28871, end=27.34876, score=0.97754),
           WordSegment(text='hope, ', start=27.86916, end=28.16939, score=0.99487),
           WordSegment(text='it ', start=29.19017, end=29.23021, score=0.99976),
           WordSegment(text='was ', start=29.29025, end=29.43036, score=0.91406),
           WordSegment(text='the ', start=29.53044, end=29.65053, score=0.901),
           WordSegment(text='winter ', start=29.85068, end=30.41111, score=0.9529),
           WordSegment(text='of\n', start=30.53121, end=30.57124, score=1.0),
           WordSegment(text='despair, ', start=30.75137, end=31.75214, score=0.9973),
           WordSegment(text='we ', start=33.33336, end=33.41342, score=0.99023),
           WordSegment(text='had ', start=33.47347, end=33.59356, score=0.98547),
           WordSegment(text='everything ', start=33.7737, end=34.17401, score=0.99512),
           WordSegment(text='before ', start=34.27408, end=34.79448, score=0.99988),
           WordSegment(text='us, ', start=34.91458, end=34.99464, score=0.90137),
           WordSegment(text='we ', start=35.41496, end=35.49502, score=1.0),
           WordSegment(text='had ', start=35.55507, end=35.67516, score=0.99967),
           WordSegment(text='nothing ', start=35.89533, end=36.29564, score=0.99432),
           WordSegment(text='before ', start=36.39571, end=36.87608, score=0.95166),
           WordSegment(text='us, ', start=36.99618, end=37.07624, score=0.99382),
           WordSegment(text='we ', start=37.71673, end=37.79679, score=0.88672),
           WordSegment(text='were\n', start=37.83682, end=37.97693, score=0.94443),
           WordSegment(text='all ', start=38.05699, end=38.1971, score=0.99414),
           WordSegment(text='going ', start=38.27716, end=38.57739, score=0.98493),
           WordSegment(text='direct ', start=38.63744, end=38.9777, score=0.963),
           WordSegment(text='to ', start=39.05776, end=39.15784, score=0.98568),
           WordSegment(text='Heaven, ', start=39.25791, end=39.61819, score=0.96994),
           WordSegment(text='we ', start=40.47885, end=40.55891, score=1.0),
           WordSegment(text='were ', start=40.61896, end=40.73905, score=0.97646),
           WordSegment(text='all ', start=40.75907, end=40.89918, score=0.95156),
           WordSegment(text='going ', start=40.95922, end=41.21942, score=0.98568),
           WordSegment(text='direct ', start=41.27947, end=41.63975, score=0.98721),
           WordSegment(text='the ', start=41.73982, end=41.81988, score=0.92399),
           WordSegment(text='other ', start=41.95999, end=42.14013, score=1.0),
           WordSegment(text='way--in\n', start=42.22019, end=43.00079, score=1.0),
           WordSegment(text='short, ', start=43.16091, end=43.44113, score=0.96647),
           WordSegment(text='the ', start=43.68131, end=43.78139, score=0.89404),
           WordSegment(text='period ', start=43.88147, end=44.28178, score=0.99761),
           WordSegment(text='was ', start=44.36184, end=44.46192, score=0.99919),
           WordSegment(text='so ', start=44.62204, end=44.78216, score=0.8584),
           WordSegment(text='far ', start=44.98232, end=45.20248, score=0.99984),
           WordSegment(text='like ', start=45.28255, end=45.46268, score=1.0),
           WordSegment(text='the ', start=45.54275, end=45.62281, score=0.91443),
           WordSegment(text='present ', start=45.72288, end=46.10318, score=0.94043),
           WordSegment(text='period, ', start=46.20325, end=46.62358, score=0.99731),
           WordSegment(text='that ', start=46.86376, end=46.98385, score=0.99609),
           WordSegment(text='some ', start=47.06392, end=47.24405, score=0.94648),
           WordSegment(text='of ', start=47.3041, end=47.34413, score=1.0),
           WordSegment(text='its\n', start=47.42419, end=47.54428, score=0.93408),
           WordSegment(text='noisiest ', start=47.70441, end=48.28485, score=0.986),
           WordSegment(text='authorities ', start=48.42496, end=49.08547, score=0.97692),
           WordSegment(text='insisted ', start=49.18555, end=49.84605, score=0.97595),
           WordSegment(text='on ', start=49.92612, end=49.96615, score=0.99976),
           WordSegment(text='its ', start=50.02619, end=50.14629, score=0.95781),
           WordSegment(text='being ', start=50.22635, end=50.46653, score=0.99729),
           WordSegment(text='received, ', start=50.52658, end=51.08701, score=0.98027),
           WordSegment(text='for ', start=51.48732, end=51.60741, score=1.0),
           WordSegment(text='good ', start=51.68747, end=51.90764, score=0.99961),
           WordSegment(text='or ', start=52.44805, end=52.5081, score=0.92708),
           WordSegment(text='for\n', start=52.58816, end=52.70825, score=0.99902),
           WordSegment(text='evil, ', start=52.88839, end=53.14859, score=0.95829),
           WordSegment(text='in ', start=53.52889, end=53.56892, score=1.0),
           WordSegment(text='the ', start=53.62896, end=53.70902, score=0.88257),
           WordSegment(text='superlative ', start=53.82912, end=54.74982, score=0.95436),
           WordSegment(text='degree ', start=54.82989, end=55.23019, score=0.97236),
           WordSegment(text='of ', start=55.29024, end=55.33027, score=1.0),
           WordSegment(text='comparison ', start=55.45036, end=56.27099, score=0.99927),
           WordSegment(text='only.', start=56.6713, end=56.99155, score=0.97573)]}

Schema

See the reference page for a detailed overview of the data models. Below is a simplified schema of the output after forced alignment of the quick-start example.

AudioMetadata
├── audio_path       "taleoftwocities_01_dickens_64kb_align.mp3"
├── sample_rate      16000
├── duration         57.519
├── metadata         null
└── speeches[]
    └── SpeechSegment
        ├── speech_id       0
        ├── start           1.769
        ├── end             57.052
        ├── text            "It was the best of times, it was the worst of
        │                    times, it was the age of wisdom..."
        ├── text_spans      [(0, 612)]          ← one span: the paragraph is a single sentence (no full stops)
        ├── duration        55.283
        ├── audio_frames    884520
        ├── probs_path      "taleoftwocities_01_dickens_64kb_align/0.npy"
        ├── metadata        null
        │
        ├── chunks[]                            ← VAD segments
        │   ├── [0] AudioChunk
        │   │   ├── start         1.769
        │   │   ├── end           28.162
        │   │   ├── duration      26.393
        │   │   ├── audio_frames  422280
        │   │   └── num_logits    1319
        │   └── [1] AudioChunk
        │       ├── start         29.039
        │       ├── end           57.052
        │       ├── duration      28.013
        │       ├── audio_frames  448200
        │       └── num_logits    1400
        │
        └── alignments[]                        ← segment-level, with word timestamps
            └── [0] AlignmentSegment
                ├── id            "0-0"
                ├── start         4.972
                ├── end           56.992
                ├── duration      52.020
                ├── score         0.975
                ├── text          "It was the best of times, it was the worst of
                │                  times, it was the age of wisdom..."
                └── words[]
                    ├── { text: "It ",      start: 4.972,  end: 5.052,  score: 0.631 }
                    ├── { text: "was ",     start: 6.673,  end: 6.773,  score: 1.000 }
                    ├── { text: "the ",     start: 6.853,  end: 6.953,  score: 0.920 }
                    ├── { text: "best ",    start: 7.273,  end: 7.594,  score: 1.000 }
                    ├── { text: "of ",      start: 7.734,  end: 7.774,  score: 1.000 }
                    ├── { text: "times, ",  start: 7.894,  end: 8.574,  score: 0.969 }
                    └── ... (119 words total)