pipelines.pipeline

pipelines.pipeline(
    vad_model,
    emissions_model,
    transcription_model,
    audio_paths,
    audio_dir,
    backend='ct2',
    speeches=None,
    sample_rate=16000,
    chunk_size=30,
    alignment_strategy='chunk',
    text_normalizer_fn=text_normalizer,
    tokenizer=None,
    language=None,
    task='transcribe',
    beam_size=5,
    max_length=250,
    repetition_penalty=1.0,
    length_penalty=1.0,
    patience=1.0,
    no_repeat_ngram_size=0,
    start_wildcard=False,
    end_wildcard=False,
    blank_id=None,
    word_boundary=None,
    indent=2,
    ndigits=5,
    batch_size_files=1,
    num_workers_files=2,
    prefetch_factor_files=2,
    batch_size_features=8,
    num_workers_features=4,
    streaming=True,
    save_json=True,
    save_msgpack=False,
    save_emissions=True,
    return_alignments=False,
    delete_emissions=False,
    output_vad_dir='output/vad',
    output_transcriptions_dir='output/transcriptions',
    output_emissions_dir='output/emissions',
    output_alignments_dir='output/alignments',
    cache_dir='models',
    hf_token=None,
    device='cuda',
)

Run the full transcription pipeline (VAD -> Transcribe -> Emissions -> Align).

Parameters

Name Type Description Default
vad_model str Voice Activity Detection model: “pyannote” or “silero”. required
emissions_model str Hugging Face model ID for the emissions model (“org_name/model_name”). required
transcription_model str Path to Hugging Face model ID for the transcription model (“org_name/model_name”). required
audio_paths list List of audio file paths. required
audio_dir str Directory containing audio files. required
speeches list[list[SpeechSegment]] Existing speech segments for alignment. None
backend str Backend to use for the transcription model: “ct2” or “hf”. Default is “ct2”. 'ct2'
sample_rate int Sample rate. 16000
chunk_size int Chunk size in seconds. 30
alignment_strategy str Alignment strategy (‘speech’ or ‘chunk’). 'chunk'
text_normalizer_fn callable Function to normalize text before forced alignment. text_normalizer
tokenizer object An nltk tokenizer or a custom callable tokenizer that takes a string as input and returns a list of tuples (start_char, end_char), marking the spans/boundaries of sentences, paragraphs, or any other text unit of interest. None
beam_size int Number of beams for beam search. Recommended: 5 for ct2 and 1 for hf (beam search is slow in Hugging Face transformers). 5
patience float Patience. Only implemented in ct2. 1.0
length_penalty float Length penalty for beam search. See HF source code for details 1.0
repetition_penalty float See HF source code for details. 1.0
max_length int Maximum length of generated text. 250
start_wildcard bool Add start wildcard to forced alignment. False
end_wildcard bool Add end wildcard to forced alignment. False
blank_id int | None Blank token ID of the emissions model (generally the pad token ID). None
word_boundary str | None Word boundary character of the emissions model (usually “|”). None
indent int JSON indentation. 2
ndigits int Number of digits for rounding. 5
batch_size_files int Batch size for files. Recommended to set to 1. 1
num_workers_files int Number of workers for file loading. 2
prefetch_factor_files int Prefetch factor for files. 2
batch_size_features int Batch size for feature extraction. 8
num_workers_features int Number of workers for feature extraction. 4
streaming bool Use streaming mode. True
save_json bool Save results to JSON. True
save_msgpack bool Save results to MessagePack. False
save_emissions bool Save emissions. True
return_alignments bool Return alignment results. False
delete_emissions bool Whether to delete emissions numpy files after processing. False
output_vad_dir str Output directory for VAD. 'output/vad'
output_transcriptions_dir str Output directory for transcriptions. 'output/transcriptions'
output_emissions_dir str Output directory for emissions. 'output/emissions'
output_alignments_dir str Output directory for alignments. 'output/alignments'
cache_dir str Cache directory for transcription and emissions models. 'models'
hf_token str or None Hugging Face authentication token for gated models. None
device str Device to run models on. Default is cuda. 'cuda'

Examples

from pathlib import Path
from easyaligner.text import load_tokenizer
from easytranscriber.pipelines import pipeline
from easytranscriber.text.normalization import text_normalizer

tokenizer = load_tokenizer("english")
audio_files = [file.name for file in Path("data/en").glob("*.wav")]
pipeline(
    vad_model="pyannote",
    emissions_model="facebook/wav2vec2-base-960h",
    transcription_model="distil-whisper/distil-large-v3.5",
    audio_paths=audio_files,
    audio_dir="data/en",
    backend="ct2",
    language="en", # None to perform language detection
    tokenizer=tokenizer,
    text_normalizer_fn=text_normalizer,
    cache_dir="models",
)

Returns

Name Type Description
list[list[SpeechSegment]] or None If return_alignments is True, returns a list of alignment mappings for each audio file. Otherwise, returns None (the alignments are saved to disk only).