Welcome KB-Whisper, a new fine-tuned Swedish Whisper model!

KBLab proudly presents KB-Whisper, a speech to text model fine-tuned using 50,000 hours of transcribed speech. This work reports an overall improvement across model sizes compared to OpenAI’s Whisper evaluated on Swedish. Most notably, we report an average 47% reduction in WER comparing our best performing model to OpenAI’s Whisper-large-v3, in evaluations across FLEURS, Common Voice, and NST.

Authors
Affiliation

Leonora Vesterbacka

Faton Rekathati

Robin Kurtz

Justyna Sikora

Agnes Toftgård

Published

March 7, 2025

The team behind KB-Whisper. Back row: Agnes Toftgård, Robin Kurtz, Justyna Sikora. Front row: Leonora Vesterbacka, Faton Rekathati. Photography: Lina Löfström Baker/KB

Improving Swedish speech recognition

KBLab proudly presents KB-Whisper, a speech to text model fine-tuned using 50,000 hours of transcribed speech. Traditionally, Automatic Speech Recognition (ASR) systems have been based on models that either require an extensive unsupervised pretraining or supervised training that demands very high-quality orthographic transcripts, which are rare and expensive to produce.

The Whisper model (), originally released by OpenAI has revolutionized automatic speech recognition, by showing that high performance could be achieved with a slight decrease in quality of the transcript, thus unlocking large amounts of training data that has hitherto not been used. Subtitles for TV often use abbreviations to fit the text on the screen, and are not considered a gold standard transcription. However, Radford et al. () showed that this training data, extracted from the web, was still good enough for Whisper to learn.

The massive improvement gain with Whisper has been shown for English, and this result is directly proportional to the amount of English training data available on the web. For languages with fewer speakers, this type of data is less represented on the web and leads to poorer performance. In order to bridge this performance gap, the team at KBLab have constructed a training dataset of transcribed Swedish speech of unprecedented size, which is used to fine-tune Whisper models.

The team behind KB-Whisper. Back row: Agnes Toftgård, Robin Kurtz, Justyna Sikora. Front row: Leonora Vesterbacka, Faton Rekathati. Photography: Lina Löfström Baker/KB

Subtitles, Parliament recordings and dialect archives

The National Library of Sweden is responsible for collecting, preserving and giving access to everything that is published in Sweden. The collections include the audiovisual archives that hold TV broadcasted in Sweden. Swedish subtitles paired with spoken Swedish from TV broadcasts constitute a large portion of the training data. Parliament recordings paired with high quality transcripts in the form of protocols provide the second largest source of the training data. This dataset is made publicly available on Huggingface, and constitute 23,000 hours of transcribed Swedish speech.

Both of these data sources have the advantage of covering wide variations of spoken Swedish. In order to enhance KB-Whisper’s performance in transcribing rare variations of Swedish, dialect recordings from The Institute for language and folklore (Isof) are included. Subtitles are also extracted from YouTube channels with Swedish content. Finally, datasets collected as crowd sourced initiatives such as Mozillas CommonVoice, Googles FLEURS and the Nordic Speech Technology (NST) dataset are used partly in the training, and partly in the evaluation.

From the audiovisual archives of the National Library of Sweden. Photography: Lina Löfström Baker/KB

Two stages of data quality

To assess the quality of transcriptions, i.e. how well the subtitles, protocols and other transcriptions match the spoken audio, we implemented a preprocessing pipeline. The training examples are split into small 30-seconds chunks and each audio chunk is transcribed using OpenAI’s Whisper-large-v3 and KBLabs VoxRex (). Then the overlap between the original transcript and the two AI-generated transcriptions are assessed using Character Error Rate (CER), BLEU and ROUGE scores.

The first quality assessment catches examples with low or no overlap, but still of sufficient quality for the model to learn from, yielding a large training dataset with an increased probability of covering rare Swedish words and names, denoted below as the Stage 1 data. The second quality assessment focuses on defining the style of transcription, aiming to teach the model how to transcribe rather than providing a large number of examples. Two styles of transcriptions are defined: one more subtitle-like (Stage 2-subtitle), and one more orthographic (Stage 2-standard) for more precise transcription.

Each stage is defined by a set of criteria on CER, BLEU and ROUGE, which are outlined in Table 1.

BLEU CER-head CER-tail ROUGE
Stage 1 > 0.2 - - -
Stage 2 standard > 0.6 < 0.3 < 0.3 > 0.7
Stage 2 subtitle > 0.6 < 0.4 < 0.4 -

The resulting hours the fall into each category is presented in Table 2.

Dataset Stage 1 (h) Stage 2 standard (h) Stage 2 subtitle (h)
Subtitles 34,261 3,110 6,928
Riksdag 21,949 5,119 8,710
ISOF 54 54 54
NST 250 250 250
Total 56,514 8,533 15,942

Based on open weights from OpenAI’s Whisper

Following the excellent Whisper fine-tuning tutorial from Huggingface we fine-tune all sizes of Whisper models on our Swedish training dataset of unprecedented size. The training is performed in a two-stage approach, where the first stage leverages the large Stage 1 dataset, followed by two parallel training stages where the model is either trained on the Stage 2-subtitle data or the Stage 2-standard data.

The training is executed on the Leonardo Supercomputer hosted by CINECA (Italy), that we were granted access to through a EuroHPC JU AI and data-intensive applications call.

The Leonardo Supercomputer. Source CINECA

A great improvement in Swedish ASR

We have evaluated the models on three datasets: FLEURS (train and test set), NST (test set), and Common Voice 16.0 (train, validation, and test set). The CommonVoice and FLEURS data has not been part of the training set and can therefore serve as a benchmark for the models’ out-of-domain performance.

To compare our newly trained models with OpenAI’s models, we calculate Word Error Rate (WER) and BLEU scores for each of the mentioned datasets. WER measures transcription accuracy by calculating the percentage of words that are substituted, deleted, or inserted, while the BLEU score evaluates how well a transcription matches the reference text.

The results evaluated in terms of WER is presented in the table below.

Model size FLEURS CommonVoice NST
tiny KBLab 13.2 12.9 11.2
OpenAI 59.2 67.8 85.2
base KBLab 9.1 8.7 7.8
OpenAI 39.6 52.1 53.4
small KBLab 7.3 6.4 6.6
OpenAI 20.6 26.4 26.4
medium KBLab 6.6 5.4 5.8
OpenAI 12.1 15.8 17.1
large-v3 KBLab 5.4 4.1 5.2
OpenAI 7.8 9.5 11.3

Our evaluations show that the best-performing model reduces the WER by an average of 47% compared to Whisper-large-v3.

The results evaluated in terms of BLEU is presented in the table below.

Model size FLEURS CommonVoice NST
tiny KBLab 76.6 73.7 74.3
OpenAI 26.9 21.1 24.0
base KBLab 83.2 79.9 78.3
OpenAI 41.1 32.5 36.9
small KBLab 86.6 83.5 79.6
OpenAI 64.0 56.5 58.2
medium KBLab 87.6 85.0 80.2
OpenAI 77.1 70.1 68.9
large-v3 KBLab 89.8 87.2 81.1
OpenAI 84.9 79.1 75.1

The most significant improvements are observed in smaller models, demonstrating that high-quality transcriptions can be achieved with fewer computational resources.The KB-whisper-small model outperforms OpenAI’s whisper-large-v3, a model six times its size. ´This means that similar transcription quality can be obtained using a smaller model, making speech-to-text more accessible and less costly.

Where to find the models?

All models are freely available for download from KBLab’s page on HuggingFace.

For more information on ASR models and how to use KBLab’s Whisper models programmatically, we recommend exploring this notebook.

Acknowledgments

We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium, through the Development Access call and AI and data intensive applications access call.

The development work to produce the notebook mentioned above was carried out within the HUMINFRA infrastructure project.

References

Malmsten, Martin, Chris Haffenden, and Love Börjeson. 2022. “Hearing Voices at the National Library – a Speech Corpus and Acoustic Model for the Swedish Language.” https://arxiv.org/abs/2205.03026.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” https://arxiv.org/abs/2212.04356.

Citation

BibTeX citation:
@online{vesterbacka2025,
  author = {Vesterbacka, Leonora and Rekathati, Faton and Kurtz, Robin
    and Sikora, Justyna and Toftgård, Agnes},
  title = {Welcome {KB-Whisper,} a New Fine-Tuned {Swedish} {Whisper}
    Model!},
  date = {2025-03-07},
  url = {https://kb-labb.github.io/posts/2025-03-07-welcome-KB-Whisper/},
  langid = {en}
}
For attribution, please cite this work as:
Vesterbacka, Leonora, Faton Rekathati, Robin Kurtz, Justyna Sikora, and Agnes Toftgård. 2025. “Welcome KB-Whisper, a New Fine-Tuned Swedish Whisper Model!” March 7, 2025. https://kb-labb.github.io/posts/2025-03-07-welcome-KB-Whisper/.