SpeechSegment
data.datamodel.SpeechSegment()A slice of the audio file that contains speech of interest to be aligned.
A SpeechSegment may be a speech given by a single speaker, a dialogue between multiple speakers, a book chapter, or whatever unit of organisational abstraction the user prefers.
If no SpeechSegment is defined, one will automatically be added, treating the entire audio as a single speech.
Attributes
| Name | Type | Description |
|---|---|---|
| start | (float, optional) | Start time of the speech segment in seconds. |
| end | (float, optional) | End time of the speech segment in seconds. |
| text | (str, optional) | Optional text transcription (manual, or created by ASR). |
| text_spans | (list[tuple[int, int]], optional) | Optional (start_char, end_char) indices in the text that allows for a custom segmentation of the text to be aligned to audio. Can for example be used to perform alignment on paragraph, sentence, or other optional levels of granularity. |
| chunks | list[AudioChunk] | Audio chunks from which we create w2v2 logits (if alignment_strategy is ‘chunk’). When ASR is used, these chunks will additionally contain the transcribed text of the chunk. The ASR output will be used for forced alignment within the chunk. |
| alignments | list[AlignmentSegment] | Aligned text segments. |
| duration | (float, optional) | Duration of the speech segment in seconds. |
| audio_frames | (int, optional) | Number of audio frames speech segment spans. |
| speech_id | (str or int, optional) | Optional unique identifier for the speech segment. |
| probs_path | (str, optional) | Path to saved wav2vec2 emissions/probs. |
| metadata | (dict, optional) | Optional extra metadata such as speaker name, etc. |