RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates

KBLab releases RixVox, a speech dataset comprised of 5500 hours of speech from parliamentary debates. The speeches have been aligned with transcripts from written protocols, and contain additional metadata such as the speaker’s gender, electoral district and birth year. RixVox is open and free for everyone to download and use.

Faton Rekathati https://github.com/Lauler (KBLab)https://www.kb.se/in-english/research-collaboration/kblab.html
2023-03-09

Automatic Speech Recognition (ASR) systems that convert spoken language to text rely heavily on annotated data to produce the best possible results. Such datasets are unfortunately not widely available for Swedish. The combined total of currently available audio datasets with annotated transcripts for the Swedish language number somewhere in the hundreds of hours.

To this end, KBLab releases Rixvox, a new Swedish ASR dataset consisting of \(5500\) hours of speech. The data originates from parliamentary debates between the years of \(2003\) to \(2023\), which were made available via the Swedish Parliament’s open data initiative. KBLab used written protocols to segment speeches from the debates, and to subsequently force align the the written transcripts with audio from the speeches. In addition to audio and transcripts, metadata such as the name, gender, birth year, political party, and electoral district of speakers is also available.

RixVox is free and open for anyone to download and use. The dataset can be reached on the following link: https://huggingface.co/datasets/KBLab/rixvox .

RixVox dataset statistics

The RixVox dataset was constructed from parliamentary debates. You can read more about how we segmented speeches from debates and determined their precise start and end location within a debate in a previous article on this blog: “Finding Speeches in the Riksdag’s debates” (Rekathati 2023).

The dataset has chunked the audio from speeches in to smaller snippets suitable for training ASR models. Each observation is up to 30 seconds in length, and consists of either a single or several sentences from the written transcript of a speech, along with the corresponding audio. The dataset consists of a total of \(5493.6\) hours of speech. There are \(1194\) different speakers represented in the data. The average duration of an observation is \(23.68\) seconds. In the table below, we present the distribution of observations of the different train, validation and test split of RixVox, along with some summary statistics for each split.

Dataset Split Observations Total duration of speech (hours) Average duration obs. (seconds) Number of speakers
Train 818227 5383 23.69 1165
Validation 7933 52 23.50 18
Test 8884 59 23.74 11

The dataset splits were created by sampling speakers until a threshold was reached in terms of total duration of speech. For the training set, we randomly sampled speakers until \(98\%\) of the total duration of the RixVox dataset was reached (\(5384\) hours). For the test and validation set, we randomly sampled speakers until each filled up to a bucket of \(1\%\) of the total duration of the entire dataset.

Let’s also take a look at the gender distribution of speakers. We have \(602\) men, \(519\) women, and \(73\) speakers for whom this metadata is missing.

Looking at the total duration of speech for each gender, we have a similar distribution to above. \(46.3\%\) of the individual speakers were women, and \(44.3\%\) of the total duration of speeches in RixVox is made up of women speaking.

Most and least intelligible electoral districts

Each observation in our dataset belongs to a speech in a debate. After segmenting the speeches from debate audio files, we machine transcribed every speech using KBLab’s wav2vec2-large-voxrex-swedish model (Malmsten, Haffenden, and Börjeson 2022). We then calculated the BLEU score to measure the correspondence between the machine generated transcription and the official written transcript. A high BLEU score indicates there’s a higher correspondence, or overlap, between the machine generated transcript and the official transcript. This may indicate that ASR systems find certain regions easier to transcribe, or may alternatively indicate that the people who transcribe the speeches tend to rephrase or reword written transcripts of speeches from these districts.

The speakers with the highest score are for those whom thedistrict is missing (NA). These are mostly government ministers who have never been members of parliament. The least intelligible electoral districts are southern Skåne, Gotland, and Malmö municipality (also southern Skåne).

Longest total duration speaker

Which speakers have spent the most time on the Riksdag Chamber’s podium? The table below shows that Morgan Johansson is the undisputed \(\#1\) debater in terms of total duration of speech.

Method of creation

Before RixVox could be created, we needed to accurately segment speeches from debates. In other words: locate where the speech started and ended within the debate audio file. The most cumbersome parts of the preliminary work undertaken to segment speeches from debates is described in our previous article “Finding Speeches in the Riksdag’s Debates” (Rekathati 2023). We recommend reading this article for background on the speech segmentation.

Quality filtering

Once the speeches were segmented, the remaining work consisted of performing some quality filtering based on simple heuristics, aligning the written transcripts with the audio on a sentence level, adding metadata about the speakers, and finally converting the alignments to short snippets up to 30 seconds in length (a suitable format for training ASR models).

The first round of quality filters applied on speeches can be found in the following lines of code. These include:

The second round of quality filters were applied after another fuzzy string matching sanity check was performed. This time, instead of fuzzy string matching the text of a written transcript against the machine transcription of an entire debate, we fuzzy string match the text of the written transcript against the machine transcription of the segmented speech, as predicted by our speaker diarization. A short summary of the second round of quality filters follows:

See the following lines of code for a full list of the filtering conditions.

The above filtering procedures reduced the number of speeches to be included in RixVox from about 122k speeches to 115k speeches.

Forced alignment

Once we had high confidence in the remaining set of predictions, we proceeded to align the written protocols with the audio. This was done by:

  1. Sentence tokenizing the written transcripts.
  2. Using the aeneas library to force align the audio with the text on the sentence level.

The aeneas library gives an output in the form of predicing the start and end location of the sentence within the speech.

We can recommend reading the masters thesis “Automatic Annotation of Speech: Exploring Boundaries within Forced Alignment for Swedish and Norwegian” (Biczysko 2022) for an excellent review of available forced alignment tools for Swedish and Norwegian.

Creating 30s observation snippets.

In the final step, we concatenate sentences from the same speech that follow one another up to a maximum length of \(30\) seconds per observation. The observations in RixVox are thus composed of either a single sentence, or several sentences in order within a speech up until the “bucket” fills up to the threshold of \(30\) seconds.

We remove the first sentence of each speech, as transcriptions tend to add a “Fru talman!” or “Herr Talman” here as a matter of formality, regardless of whether this was uttered by the speaker or not.

Dataset card

RixVox has a dataset card on Huggingface, where you can find more details about the dataset, its features, and how to download and use it. You can also preview the first 100 observations of the train, validation and test sets in the dataset viewer.

Acknowledgements

Part of this development work was carried out within the frame of the infrastructural project HUMINFRA.

Code

The code for reproducing results in this article can be found on https://github.com/kb-labb/riksdagen_anforanden.

Biczysko, Klaudia. 2022. “Automatic Annotation of Speech: Exploring Boundaries Within Forced Alignment for Swedish and Norwegian.” Master’s thesis, Uppsala University, Department of Linguistics; Philology; Uppsala University, Department of Linguistics; Philology.
Malmsten, Martin, Chris Haffenden, and Love Börjeson. 2022. “Hearing Voices at the National Library – a Speech Corpus and Acoustic Model for the Swedish Language.” https://arxiv.org/abs/2205.03026.
Rekathati, Faton. 2023. “The KBLab Blog: Finding Speeches in the Riksdag’s Debates.” https://kb-labb.github.io/posts/2023-02-15-finding-speeches-in-the-riksdags-debates/.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Rekathati (2023, March 9). The KBLab Blog: RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates. Retrieved from https://kb-labb.github.io/posts/2023-03-09-rixvox-a-swedish-speech-corpus/

BibTeX citation

@misc{rekathati2023rixvox:,
  author = {Rekathati, Faton},
  title = {The KBLab Blog: RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates},
  url = {https://kb-labb.github.io/posts/2023-03-09-rixvox-a-swedish-speech-corpus/},
  year = {2023}
}