StreamingAudioFileDataset

data.dataset.StreamingAudioFileDataset(
    metadata,
    processor,
    audio_dir='data',
    sample_rate=16000,
    chunk_size=30,
    alignment_strategy='chunk',
)

Streaming version of AudioFileDataset that reads audio chunks on-demand.

Instead of loading entire audio files and chunking in memory, this dataset returns a StreamingAudioSliceDataset that lazily loads each chunk via ffmpeg.

Parameters

Name Type Description Default
metadata JSONMetadataDataset or list[AudioMetadata] or AudioMetadata Metadata source. required
processor transformers.Wav2Vec2Processor or transformers.WhisperProcessor Processor for feature extraction. required
audio_dir str Base directory for audio files. 'data'
sample_rate int Target sample rate for resampling. 16000
chunk_size int Maximum chunk size in seconds (for speech-based chunking). 30
alignment_strategy str ‘speech’ or ‘chunk’ - determines how chunks are defined. 'chunk'

Examples

import torch
from pathlib import Path
from transformers import WhisperProcessor
from easyaligner.data.collators import audiofile_collate_fn
from easyaligner.data.dataset import JSONMetadataDataset
from easytranscriber.data import StreamingAudioFileDataset

AUDIO_DIR = "data/en"
json_paths = [p.name for p in Path("output/vad").glob("*.json")]

processor = WhisperProcessor.from_pretrained(
    "distil-whisper/distil-large-v3.5", cache_dir="models"
)

json_dataset = JSONMetadataDataset(
    json_paths=[str(Path("output/vad") / p) for p in json_paths]
)

file_dataset = StreamingAudioFileDataset(
    metadata=json_dataset,
    processor=processor,
    audio_dir=AUDIO_DIR,
    sample_rate=16000,
    chunk_size=30,
    alignment_strategy="chunk",
)

file_dataloader = torch.utils.data.DataLoader(
    file_dataset,
    batch_size=1,
    shuffle=False,
    collate_fn=audiofile_collate_fn,
    num_workers=2,
    prefetch_factor=2,
)