Pipelines

The overview guide shows how to run the full transcription pipeline with a single pipeline() call. Under the hood, this runs four stages in sequence: VAD, transcription, emission extraction, and forced alignment.

For more control – e.g. tuning parameters per stage, parallelizing on different hardware, or resuming from intermediate outputs – you can run each stage independently. This page walks through each step.

Setup

We’ll use a longer version of the A Tale of Two Cities dataset (8 chapters) from the same tutorial repository used in the overview.

import torch
from pathlib import Path
from huggingface_hub import snapshot_download

snapshot_download(
    "Lauler/easytranscriber_tutorials",
    repo_type="dataset",
    local_dir="data/tutorials",
    allow_patterns="tale-of-two-cities_long-en/*",
    max_workers=4,
)

AUDIO_DIR = "data/tutorials/tale-of-two-cities_long-en"
audio_files = [file.name for file in Path(AUDIO_DIR).glob("*")]
json_paths = [Path(p).with_suffix(".json") for p in audio_files]

Step 1: Voice Activity Detection (VAD)

VAD segments each audio file into regions containing speech. The output is a set of JSON files containing AudioMetadata with SpeechSegment objects, each of which holds a list of AudioChunk objects1.

1 See the output schema for details on the data structure.

from easyaligner.pipelines import vad_pipeline
from easyaligner.vad.pyannote import load_vad_model

model_vad = load_vad_model()

vad_pipeline(
    model=model_vad,
    audio_paths=audio_files,
    audio_dir=AUDIO_DIR,
    chunk_size=30,
    sample_rate=16000,
    batch_size=1,
    num_workers=1, # No need for too many workers
    prefetch_factor=2,
    save_json=True,
    output_dir="output/vad",
)

The chunk_size parameter controls the maximum duration (in seconds) of each VAD chunk, when smaller chunks are merged2.

2 See merge_chunks in silero and pyannote VAD implementations

Note

The default VAD model is from pyannote and is gated. You need to accept terms and conditions and authenticate with Hugging Face. Alternatively, use silero VAD (CPU-only, no authentication required):

from easyaligner.vad.silero import load_vad_model
model_vad = load_vad_model()

See the vad_pipeline reference for all parameters.

Step 2: Transcription

Transcription runs a Whisper model over each VAD chunk to produce text. easytranscriber supports two backends:

  • ctranslate2 — C++ optimized inference for Whisper. Recommended for production use.
  • transformers — Hugging Face’s native implementation.

CTranslate2 backend

import ctranslate2
from transformers import WhisperProcessor
from easyaligner.data.collators import audiofile_collate_fn
from easyaligner.data.dataset import JSONMetadataDataset
from easytranscriber.asr.ct2 import transcribe
from easytranscriber.data import StreamingAudioFileDataset

processor = WhisperProcessor.from_pretrained(
    "distil-whisper/distil-large-v3.5", cache_dir="models"
)
model = ctranslate2.models.Whisper("models/ct2/distil-large-v3.5", device="cuda")

json_dataset = JSONMetadataDataset(
    json_paths=[str(Path("output/vad") / p) for p in json_paths]
)

file_dataset = StreamingAudioFileDataset(
    metadata=json_dataset,
    processor=processor,
    audio_dir=AUDIO_DIR,
    sample_rate=16000,
    chunk_size=30,
    alignment_strategy="chunk",
)

file_dataloader = torch.utils.data.DataLoader(
    file_dataset,
    batch_size=1,
    shuffle=False,
    collate_fn=audiofile_collate_fn,
    num_workers=2,
    prefetch_factor=2,
)

transcribe(
    model=model,
    processor=processor,
    file_dataloader=file_dataloader,
    batch_size=16,
    output_dir="output/transcriptions",
    language="en",
    task="transcribe",
    beam_size=5,
)
Note

easytranscriber handles automatic conversion from Hugging Face format to CTranslate2 format when using the unified pipeline() function. When running the decomposed pipeline, you however need to either, i) convert the model yourself with hf_to_ct2_converter, or ii) download a pre-converted model, or iii) run pipeline() once so it handles the conversion for you.

Hugging Face backend

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from easytranscriber.asr.hf import transcribe

processor = WhisperProcessor.from_pretrained(
    "distil-whisper/distil-large-v3.5", cache_dir="models"
)
model = WhisperForConditionalGeneration.from_pretrained(
    "distil-whisper/distil-large-v3.5", cache_dir="models"
).to("cuda")

# ... create file_dataloader as above ...

transcribe(
    model=model,
    processor=processor,
    file_dataloader=file_dataloader,
    batch_size=8,
    num_workers=2,
    prefetch_factor=2,
    output_dir="output/transcriptions",
    language="en",
    beam_size=1,  # beam search is slow in HF transformers
    device="cuda",
)

See the transcribe (ct2) and transcribe (hf) reference for all parameters.

Data loading

Both backends require a DataLoader that yields batches of audio chunks. easytranscriber provides StreamingAudioFileDataset, which reads audio chunks on demand via ffmpeg rather than loading entire files into memory. This is the recommended approach for trancribing with predictable memory usage3.

3 Under the hood, each chunk is decoded by an ffmpeg subprocess. The overhead is negligible or non-existent in most use cases, since GPU inference, not data loading, acts as the primary bottleneck

Alternatively, easyaligner provides AudioFileDataset, which loads the full audio file into memory before chunking. This can potentially be slightly faster for some datasets, but uses more memory.

Step 3: Emission extraction

Emission extraction runs a wav2vec2 model over the audio to produce frame-level CTC logits. These emissions are then used for forced alignment in the next step.

from transformers import AutoModelForCTC, Wav2Vec2Processor
from easyaligner.data.dataset import JSONMetadataDataset
from easyaligner.pipelines import emissions_pipeline

model_w2v = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base-960h", cache_dir="models"
).to("cuda").half()
processor_w2v = Wav2Vec2Processor.from_pretrained(
    "facebook/wav2vec2-base-960h", cache_dir="models"
)

json_dataset = JSONMetadataDataset(
    json_paths=[str(Path("output/transcriptions") / p) for p in json_paths]
)

emissions_pipeline(
    model=model_w2v,
    processor=processor_w2v,
    metadata=json_dataset,
    audio_dir=AUDIO_DIR,
    sample_rate=16000,
    chunk_size=30,
    alignment_strategy="chunk",
    batch_size_files=1,
    num_workers_files=2,
    prefetch_factor_files=2,
    batch_size_features=4,
    num_workers_features=4,
    save_json=True,
    save_emissions=True,
    output_dir="output/emissions",
)
Tip

A list of suitable wav2vec2 emission models for different languages can be found in the WhisperX library.

See the emissions_pipeline reference for all parameters.

Step 4: Forced alignment

Forced alignment matches the transcribed text to the audio at word level using the CTC emissions from the previous step. This is where word-level and sentence-level timestamps are produced.

from easyaligner.data.collators import metadata_collate_fn
from easyaligner.data.dataset import JSONMetadataDataset
from easyaligner.pipelines import alignment_pipeline
from easytranscriber.text.normalization import text_normalizer

json_dataset = JSONMetadataDataset(
    json_paths=[str(Path("output/emissions") / p) for p in json_paths]
)
audiometa_loader = torch.utils.data.DataLoader(
    json_dataset,
    batch_size=1,
    num_workers=4,
    prefetch_factor=2,
    collate_fn=metadata_collate_fn,
)

alignment_pipeline(
    dataloader=audiometa_loader,
    text_normalizer_fn=text_normalizer,
    processor=processor_w2v,
    tokenizer=None,  # or pass an nltk PunktTokenizer for sentence segmentation
    emissions_dir="output/emissions",
    output_dir="output/alignments",
    alignment_strategy="chunk",
    start_wildcard=True,
    end_wildcard=True,
    blank_id=0,
    word_boundary="|",
    chunk_size=30,
    ndigits=5,
    save_json=True,
    device="cuda",
)

The text_normalizer_fn preprocesses ASR output before alignment (lowercasing, removing punctuation, etc.). See text processing for details on writing custom normalization functions.

The tokenizer parameter controls how the aligned text is segmented. Passing an nltk PunktTokenizer produces sentence-level alignment segments. See sentence tokenization for details.

See the alignment_pipeline reference for all parameters.

Output

Each step writes JSON files to its output directory. The final aligned output in output/alignments contains the complete data structure with word-level timestamps. See the output schema in the overview for the full structure.

output
├── vad                  ← SpeechSegments with AudioChunks
├── transcriptions       ← + transcribed text per chunk
├── emissions            ← + emission file paths (.npy)
└── alignments           ← + AlignmentSegments with word timestamps

Since each step reads JSON from the previous step’s output directory, you can resume the pipeline from any intermediate stage. For example, if you want to re-run alignment with different parameters, you can skip VAD, transcription, and emission extraction entirely.