SpeechSegment

data.datamodel.SpeechSegment()

A slice of the audio file that contains speech of interest to be aligned.

A SpeechSegment may be a speech given by a single speaker, a dialogue between multiple speakers, a book chapter, or whatever unit of organisational abstraction the user prefers.

If no SpeechSegment is defined, one will automatically be added, treating the entire audio as a single speech.

Attributes

Name Type Description
start (float, optional) Start time of the speech segment in seconds.
end (float, optional) End time of the speech segment in seconds.
text (str, optional) Optional text transcription (manual, or created by ASR).
text_spans (list[tuple[int, int]], optional) Optional (start_char, end_char) indices in the text that allows for a custom segmentation of the text to be aligned to audio. Can for example be used to perform alignment on paragraph, sentence, or other optional levels of granularity.
chunks list[AudioChunk] Audio chunks from which we create w2v2 logits (if alignment_strategy is ‘chunk’). When ASR is used, these chunks will additionally contain the transcribed text of the chunk. The ASR output will be used for forced alignment within the chunk.
alignments list[AlignmentSegment] Aligned text segments.
duration (float, optional) Duration of the speech segment in seconds.
audio_frames (int, optional) Number of audio frames speech segment spans.
speech_id (str or int, optional) Optional unique identifier for the speech segment.
probs_path (str, optional) Path to saved wav2vec2 emissions/probs.
metadata (dict, optional) Optional extra metadata such as speaker name, etc.