StreamingAudioFileDataset
data.dataset.StreamingAudioFileDataset(
metadata,
processor,
audio_dir='data',
sample_rate=16000,
chunk_size=30,
alignment_strategy='chunk',
)Streaming version of AudioFileDataset that reads audio chunks on-demand.
Instead of loading entire audio files and chunking in memory, this dataset returns a StreamingAudioSliceDataset that lazily loads each chunk via ffmpeg.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| metadata | JSONMetadataDataset or list[AudioMetadata] or AudioMetadata | Metadata source. | required |
| processor | transformers.Wav2Vec2Processor or transformers.WhisperProcessor | Processor for feature extraction. | required |
| audio_dir | str | Base directory for audio files. | 'data' |
| sample_rate | int | Target sample rate for resampling. | 16000 |
| chunk_size | int | Maximum chunk size in seconds (for speech-based chunking). | 30 |
| alignment_strategy | str | ‘speech’ or ‘chunk’ - determines how chunks are defined. | 'chunk' |
Examples
import torch
from pathlib import Path
from transformers import WhisperProcessor
from easyaligner.data.collators import audiofile_collate_fn
from easyaligner.data.dataset import JSONMetadataDataset
from easytranscriber.data import StreamingAudioFileDataset
AUDIO_DIR = "data/en"
json_paths = [p.name for p in Path("output/vad").glob("*.json")]
processor = WhisperProcessor.from_pretrained(
"distil-whisper/distil-large-v3.5", cache_dir="models"
)
json_dataset = JSONMetadataDataset(
json_paths=[str(Path("output/vad") / p) for p in json_paths]
)
file_dataset = StreamingAudioFileDataset(
metadata=json_dataset,
processor=processor,
audio_dir=AUDIO_DIR,
sample_rate=16000,
chunk_size=30,
alignment_strategy="chunk",
)
file_dataloader = torch.utils.data.DataLoader(
file_dataset,
batch_size=1,
shuffle=False,
collate_fn=audiofile_collate_fn,
num_workers=2,
prefetch_factor=2,
)