from pathlib import Path
from transformers import (
AutoModelForCTC,
Wav2Vec2Processor,
)
from huggingface_hub import snapshot_download
from easyaligner.text import load_tokenizer
from easyaligner.data.datamodel import SpeechSegment
from easyaligner.pipelines import pipeline
from easyaligner.text import text_normalizer
from easyaligner.vad.pyannote import load_vad_model
snapshot_download(
"Lauler/easytranscriber_tutorials",
repo_type="dataset",
local_dir="data/tutorials",
allow_patterns="tale-of-two-cities_align-en/*",
)
text = """
It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it
was the epoch of incredulity, it was the season of Light, it was the
season of Darkness, it was the spring of hope, it was the winter of
despair, we had everything before us, we had nothing before us, we were
all going direct to Heaven, we were all going direct the other way--in
short, the period was so far like the present period, that some of its
noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.
"""
text = text.strip()
# The alignments will be organized according to how the text is tokenized
tokenizer = load_tokenizer(language="english") # sentence tokenizer
span_list = list(tokenizer.span_tokenize(text)) # start, end character indices for each sentence
speeches = [[SpeechSegment(speech_id=0, text=text, text_spans=span_list, start=None, end=None)]]
# Load models and run pipeline
model_vad = load_vad_model()
model = (
AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda").half()
)
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
# File(s) to align
audio_files = [file.name for file in Path("data/tutorials/tale-of-two-cities_align-en").glob("*")]
pipeline(
vad_model=model_vad,
emissions_model=model,
processor=processor,
audio_paths=audio_files,
audio_dir="data/tutorials/tale-of-two-cities_align-en",
speeches=speeches,
alignment_strategy="speech",
text_normalizer_fn=text_normalizer,
tokenizer=tokenizer,
start_wildcard=True,
end_wildcard=True,
blank_id=processor.tokenizer.pad_token_id,
word_boundary="|",
)Overview
easyaligner is a forced alignment library for aligning text transcripts with audio. The aligned outputs can be segmented at any level of granularity (sentence, paragraph, etc.), while preserving the original text’s formatting. Outputs include precise word-level timestamps.
The library supports aligning both from ground-truth transcripts, as well as from ASR (automatic speech transcription) model outputs. Check out the library easytranscriber for an example where easyaligner is used as a backend to align ASR outputs.
Installation
With GPU support
pip install easyaligner --extra-index-url https://download.pytorch.org/whl/cu128Using uv
When installing with uv, it will select the appropriate PyTorch version automatically (CPU for macOS, CUDA for Linux/Windows/ARM):
uv pip install easyalignerUsage
Below is a minimal example of the simplest alignment scenario: a text transcript that covers the full audio. Tutorial 1 demonstrates the same scenario in more detail.
The example below uses a 57-second snippet from a LibriVox recording of A Tale of Two Cities, corresponding to the first paragraph of Chapter I. The text used for alignment is assigned directly to the text variable as a string.
SpeechSegments allow specifying regions of the audio to be processed independently.
You can find out more about how it’s used in Tutorial 2 and Tutorial 3.
Tutorials
We recommend checking out the tutorials for more advanced use cases. The guides show how to handle cases where the ground-truth text only covers part of the spoken content in the audio, and where the relevant audio region is either known in advance, or unknown.
Align text and audio: text transcript covers all the spoken content in the audio.
Known audio region: text covers only part of the spoken content in the audio. But we know the relevant audio region in advance.
Locate relevant audio region: text covers only part of spoken content in the audio, and we don’t know the relevant audio region in advance.
Results: Force aligned output
The text transcript below the audio player is highlighted in sync with the words spoken in the audio.
You can click anywhere in the text to jump to that point in the audio. The text is also highlighted when you drag the audio slider!
To browse and search your own alignments with the same synchronized playback, see easysearch.
Outputs
By default, easyaligner outputs a JSON file for each stage of the pipeline (VAD, emissions, forced alignment). The final aligned output can be found in output/alignments. The directory structure will look as follows:
output
├── vad ← SpeechSegments with AudioChunks (VAD boundaries)
├── emissions ← + emission file paths (.npy)
└── alignments ← + AlignmentSegments with word timestamps
Reading the output
Let’s read the final aligned output and print one of the aligned segments. We can use a convenience function from easyaligner that reads the file as an AudioMetadata object.
from easyaligner.data.utils import read_json
from pprint import pprint
results = read_json("output/alignments/taleoftwocities_01_dickens_64kb_align.json")
# Print the 1st aligned segment of the first speech
pprint(results.speeches[0].alignments[0].to_dict()){'duration': 52.01999,
'end': 56.99155,
'id': '0-0',
'score': 0.97476,
'start': 4.97156,
'text': 'It was the best of times, it was the worst of times, it was the age '
'of\n'
'wisdom, it was the age of foolishness, it was the epoch of belief, '
'it\n'
'was the epoch of incredulity, it was the season of Light, it was '
'the\n'
'season of Darkness, it was the spring of hope, it was the winter of\n'
'despair, we had everything before us, we had nothing before us, we '
'were\n'
'all going direct to Heaven, we were all going direct the other '
'way--in\n'
'short, the period was so far like the present period, that some of '
'its\n'
'noisiest authorities insisted on its being received, for good or '
'for\n'
'evil, in the superlative degree of comparison only.',
'words': [WordSegment(text='It ', start=4.97156, end=5.05162, score=0.63086),
WordSegment(text='was ', start=6.67286, end=6.77294, score=0.99967),
WordSegment(text='the ', start=6.853, end=6.95308, score=0.92004),
WordSegment(text='best ', start=7.27333, end=7.59357, score=0.99961),
WordSegment(text='of ', start=7.73368, end=7.77371, score=0.99951),
WordSegment(text='times, ', start=7.8938, end=8.57433, score=0.96924),
WordSegment(text='it ', start=8.83453, end=8.89457, score=0.99544),
WordSegment(text='was ', start=8.97463, end=9.09473, score=0.99854),
WordSegment(text='the ', start=9.21482, end=9.35493, score=0.93916),
WordSegment(text='worst ', start=9.53506, end=9.9754, score=0.95497),
WordSegment(text='of ', start=10.11551, end=10.15554, score=0.99585),
WordSegment(text='times, ', start=10.29565, end=10.85608, score=0.98682),
WordSegment(text='it ', start=11.17633, end=11.21636, score=0.99683),
WordSegment(text='was ', start=11.25639, end=11.33645, score=0.99919),
WordSegment(text='the ', start=11.3965, end=11.47656, score=1.0),
WordSegment(text='age ', start=11.63668, end=11.83683, score=0.91875),
WordSegment(text='of\n', start=11.9169, end=11.95693, score=1.0),
WordSegment(text='wisdom, ', start=12.15708, end=12.6975, score=0.99548),
WordSegment(text='it ', start=13.55816, end=13.59819, score=0.99951),
WordSegment(text='was ', start=13.63822, end=13.7383, score=0.99646),
WordSegment(text='the ', start=13.81836, end=13.91843, score=0.99414),
WordSegment(text='age ', start=14.03853, end=14.19865, score=1.0),
WordSegment(text='of ', start=14.27871, end=14.31874, score=0.99976),
WordSegment(text='foolishness, ', start=14.61897, end=15.41959, score=0.97705),
WordSegment(text='it ', start=15.57971, end=15.61974, score=0.99927),
WordSegment(text='was ', start=15.67979, end=15.75985, score=0.99951),
WordSegment(text='the ', start=15.83991, end=15.91997, score=0.99939),
WordSegment(text='epoch ', start=16.10011, end=16.4804, score=0.98382),
WordSegment(text='of ', start=16.58048, end=16.62051, score=0.99927),
WordSegment(text='belief, ', start=16.80065, end=17.28102, score=0.99182),
WordSegment(text='it\n', start=17.86147, end=17.9015, score=0.99951),
WordSegment(text='was ', start=17.96154, end=18.0416, score=0.99951),
WordSegment(text='the ', start=18.10165, end=18.18171, score=0.99935),
WordSegment(text='epoch ', start=18.32182, end=18.70211, score=0.98676),
WordSegment(text='of ', start=18.76216, end=18.80219, score=1.0),
WordSegment(text='incredulity, ', start=18.90227, end=20.10319, score=0.99255),
WordSegment(text='it ', start=20.46347, end=20.5035, score=0.99976),
WordSegment(text='was ', start=20.56354, end=20.66362, score=0.99967),
WordSegment(text='the ', start=20.7637, end=20.84376, score=0.97864),
WordSegment(text='season ', start=21.00388, end=21.48425, score=0.94658),
WordSegment(text='of ', start=21.58433, end=21.62436, score=0.99951),
WordSegment(text='Light, ', start=21.90457, end=22.2048, score=0.97893),
WordSegment(text='it ', start=23.12551, end=23.16554, score=0.99951),
WordSegment(text='was ', start=23.20557, end=23.28564, score=0.99886),
WordSegment(text='the\n', start=23.3657, end=23.44576, score=0.98014),
WordSegment(text='season ', start=23.6259, end=24.06624, score=0.97461),
WordSegment(text='of ', start=24.16631, end=24.20634, score=0.99951),
WordSegment(text='Darkness, ', start=24.50657, end=25.38725, score=0.99574),
WordSegment(text='it ', start=25.9677, end=26.02774, score=0.9821),
WordSegment(text='was ', start=26.06777, end=26.18787, score=0.98711),
WordSegment(text='the ', start=26.26793, end=26.368, score=0.99414),
WordSegment(text='spring ', start=26.52813, end=27.1486, score=0.98633),
WordSegment(text='of ', start=27.28871, end=27.34876, score=0.97786),
WordSegment(text='hope, ', start=27.86916, end=28.16939, score=0.99536),
WordSegment(text='it ', start=29.19017, end=29.23021, score=0.99951),
WordSegment(text='was ', start=29.29025, end=29.43036, score=0.91426),
WordSegment(text='the ', start=29.53044, end=29.65053, score=0.90833),
WordSegment(text='winter ', start=29.85068, end=30.41111, score=0.95028),
WordSegment(text='of\n', start=30.53121, end=30.57124, score=1.0),
WordSegment(text='despair, ', start=30.75137, end=31.75214, score=0.997),
WordSegment(text='we ', start=33.33336, end=33.41342, score=0.98991),
WordSegment(text='had ', start=33.47347, end=33.59356, score=0.98547),
WordSegment(text='everything ', start=33.7737, end=34.17401, score=0.99474),
WordSegment(text='before ', start=34.27408, end=34.79448, score=0.99988),
WordSegment(text='us, ', start=34.91458, end=34.99464, score=0.90869),
WordSegment(text='we ', start=35.41496, end=35.49502, score=1.0),
WordSegment(text='had ', start=35.55507, end=35.67516, score=0.99967),
WordSegment(text='nothing ', start=35.89533, end=36.29564, score=0.99396),
WordSegment(text='before ', start=36.39571, end=36.87608, score=0.95527),
WordSegment(text='us, ', start=36.99618, end=37.07624, score=0.99447),
WordSegment(text='we ', start=37.71673, end=37.79679, score=0.88672),
WordSegment(text='were\n', start=37.83682, end=37.97693, score=0.85181),
WordSegment(text='all ', start=38.05699, end=38.1971, score=0.99341),
WordSegment(text='going ', start=38.27716, end=38.57739, score=0.98354),
WordSegment(text='direct ', start=38.63744, end=38.9777, score=0.96343),
WordSegment(text='to ', start=39.05776, end=39.15784, score=0.98568),
WordSegment(text='Heaven, ', start=39.25791, end=39.61819, score=0.97201),
WordSegment(text='we ', start=40.47885, end=40.55891, score=1.0),
WordSegment(text='were ', start=40.61896, end=40.73905, score=0.97734),
WordSegment(text='all ', start=40.75907, end=40.89918, score=0.95488),
WordSegment(text='going ', start=40.95922, end=41.21942, score=0.98551),
WordSegment(text='direct ', start=41.27947, end=41.63975, score=0.98828),
WordSegment(text='the ', start=41.73982, end=41.81988, score=0.92448),
WordSegment(text='other ', start=41.95999, end=42.14013, score=1.0),
WordSegment(text='way--in\n', start=42.22019, end=43.00079, score=1.0),
WordSegment(text='short, ', start=43.16091, end=43.44113, score=0.9668),
WordSegment(text='the ', start=43.68131, end=43.78139, score=0.89551),
WordSegment(text='period ', start=43.88147, end=44.28178, score=0.99772),
WordSegment(text='was ', start=44.36184, end=44.46192, score=0.99919),
WordSegment(text='so ', start=44.62204, end=44.78216, score=0.85278),
WordSegment(text='far ', start=44.98232, end=45.20248, score=0.99984),
WordSegment(text='like ', start=45.28255, end=45.46268, score=1.0),
WordSegment(text='the ', start=45.54275, end=45.62281, score=0.91345),
WordSegment(text='present ', start=45.72288, end=46.10318, score=0.93546),
WordSegment(text='period, ', start=46.20325, end=46.62358, score=0.99717),
WordSegment(text='that ', start=46.86376, end=46.98385, score=0.99609),
WordSegment(text='some ', start=47.06392, end=47.24405, score=0.94922),
WordSegment(text='of ', start=47.3041, end=47.34413, score=1.0),
WordSegment(text='its\n', start=47.42419, end=47.54428, score=0.93579),
WordSegment(text='noisiest ', start=47.70441, end=48.28485, score=0.98397),
WordSegment(text='authorities ', start=48.42496, end=49.08547, score=0.97602),
WordSegment(text='insisted ', start=49.18555, end=49.84605, score=0.97929),
WordSegment(text='on ', start=49.92612, end=49.96615, score=0.99976),
WordSegment(text='its ', start=50.02619, end=50.14629, score=0.95586),
WordSegment(text='being ', start=50.22635, end=50.46653, score=0.99707),
WordSegment(text='received, ', start=50.52658, end=51.08701, score=0.97939),
WordSegment(text='for ', start=51.48732, end=51.60741, score=1.0),
WordSegment(text='good ', start=51.68747, end=51.90764, score=0.99961),
WordSegment(text='or ', start=52.44805, end=52.5081, score=0.92513),
WordSegment(text='for\n', start=52.58816, end=52.70825, score=0.99902),
WordSegment(text='evil, ', start=52.88839, end=53.14859, score=0.95912),
WordSegment(text='in ', start=53.52889, end=53.56892, score=1.0),
WordSegment(text='the ', start=53.62896, end=53.70902, score=0.87646),
WordSegment(text='superlative ', start=53.82912, end=54.74982, score=0.95378),
WordSegment(text='degree ', start=54.82989, end=55.23019, score=0.97207),
WordSegment(text='of ', start=55.29024, end=55.33027, score=1.0),
WordSegment(text='comparison ', start=55.45036, end=56.27099, score=0.99941),
WordSegment(text='only.', start=56.6713, end=56.99155, score=0.9767)]}
Schema
See the reference page for a detailed overview of the data models. Below is a simplified schema of the output after forced alignment of the quick-start example.
AudioMetadata
├── audio_path "taleoftwocities_01_dickens_64kb_align.mp3"
├── sample_rate 16000
├── duration 57.519
├── metadata null
└── speeches[]
└── SpeechSegment
├── speech_id 0
├── start 1.769
├── end 57.052
├── text "It was the best of times, it was the worst of
│ times, it was the age of wisdom..."
├── text_spans [(0, 612)] ← one span: the paragraph is a single sentence (no full stops)
├── duration 55.283
├── audio_frames 884520
├── probs_path "taleoftwocities_01_dickens_64kb_align/0.npy"
├── metadata null
│
├── chunks[] ← VAD segments
│ ├── [0] AudioChunk
│ │ ├── start 1.769
│ │ ├── end 28.162
│ │ ├── duration 26.393
│ │ ├── audio_frames 422280
│ │ └── num_logits 1319
│ └── [1] AudioChunk
│ ├── start 29.039
│ ├── end 57.052
│ ├── duration 28.013
│ ├── audio_frames 448200
│ └── num_logits 1400
│
└── alignments[] ← segment-level, with word timestamps
└── [0] AlignmentSegment
├── id "0-0"
├── start 4.972
├── end 56.992
├── duration 52.020
├── score 0.975
├── text "It was the best of times, it was the worst of
│ times, it was the age of wisdom..."
└── words[]
├── { text: "It ", start: 4.972, end: 5.052, score: 0.631 }
├── { text: "was ", start: 6.673, end: 6.773, score: 1.000 }
├── { text: "the ", start: 6.853, end: 6.953, score: 0.920 }
├── { text: "best ", start: 7.273, end: 7.594, score: 1.000 }
├── { text: "of ", start: 7.734, end: 7.774, score: 1.000 }
├── { text: "times, ", start: 7.894, end: 8.574, score: 0.969 }
└── ... (119 words total)