The KBLab Blog: Swedish speech synthesis

Faton Rekathati

Swedish speech synthesis

KBLab releases a neural network based text-to-speech model for Swedish. The model was trained on an open Swedish speech synthesis dataset from NST. We make our latest training checkpoint available for for anyone wishing to finetune on a new voice. We also contribute the model weights to the open source project Piper, where users can deploy a light weight optimized version of the model on their own computers or on their Raspberry Pis.

Author

Affiliation

Faton Rekathati

KBLab

Published

May 23, 2023

Citation

Rekathati, 2023

Creating realistic sounding speech with proper intonation, pitch and tone from text has long been a goal of speech synthesis systems. These systems have a wide range of applications, among which a few are:

Accessibility tooling for the visually impaired via screen readers.
Helping people with speech impairments communicate.
Allowing for interaction with computers, digital assistants, and robots through audio as opposed to text.

Recent developments in neural speech synthesis have allowed for the synthetization of voices with increasing natural fidelity. However, many of these high quality systems with support for smaller languages remain proprietary.

At KBLab we recently discovered a relatively user friendly option to train a neural speech synthesis model through the Piper libraryhttps://github.com/rhasspy/piper. Luckily, the Norwegian Language Bank (Språkbanken) maintains several speech datasets originally produced by the company NST (Nordisk Språkteknologi). One of those datasets consists of 5300 recordings of a single Swedish speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/, recorded for the purposes of training speech synthesis systems.

Have a listen

Have a listen to the output of KBLab’s model below. We use text from the wikipedia articles on Regnbåge and Europaparlementet as the source text. For each sample, we also compare our model to a recently released open source text-to-speech model from Meta AI. See the Massively Multilingual Speech project (MMS) https://ai.facebook.com/blog/multilingual-model-speech-recognition/ for further details about Meta’s model.

KBLab TTS (Piper)

Facebook/Meta TTS

En regnbåge är ett optiskt, meteorologiskt fenomen som uppträder som ett (nästintill) fullständigt ljusspektrum i form av en båge på himlen då solen lyser på nedfallande regn. Regnbågen består färgmässigt av en kontinuerlig övergång från rött (ytterst) via gula, gröna och blå nyanser till violett innerst; ofta definieras antalet färger som sju, inklusive orange och indigo.

KBLab TTS (Piper)

Facebook/Meta TTS

Europaparlamentet (EP), även känt som EU-parlamentet, är den ena lagstiftande institutionen inom Europeiska unionen; den andra är Europeiska unionens råd. Parlamentet, som består av 705 ledamöter, väljs genom allmänna och direkta val vart femte år, och företräder unionsmedborgarna direkt på unionsnivå. Parlamentet kan förenklat liknas vid ett underhus i ett tvåkammarsystem.

Odd pronounciations

Piper and the MMS model from Meta both use VITS to train the text-to-speech model. This model relies on espeak-ng to translate text to phonemes. The extent and comprehensiveness of espeak-ng’s pronounciation and prosody rules vary from language to language, as the software is largely reliant on volunteer contributions. The reason the words meteorologiskt, fenomen and kontinuerlig have such a strange pronounciations is because espeak-ng generates an incorrect text to phoneme conversion https://github.com/rhasspy/piper/issues/72#issuecomment-1550170779.

How do I use the model?

To try out KBLab’s TTS model yourself using Piper:

Download the Piper binary from Github (executable file that allows you to run the model in your terminal) https://github.com/rhasspy/piper. For Linux:

wget https://github.com/rhasspy/piper/releases/download/v0.0.2/piper_amd64.tar.gz

Unzip/Untar the downloaded archive in a directory of your choice.

# The contents will be untarred to directory named piper/
tar -xvf piper_amd64.tar.gz

Download the Svenska (Swedish) model weights from Piper’s voice samples.

wget https://github.com/rhasspy/piper/releases/download/v0.0.2/voice-sv-se-nst-medium.tar.gz

Unzip/Untar the downloaded archive file in a directory of your choice. We suggest doing it in the same directory where you untarred the piper binary: piper/.

tar -xvf voice-sv-se-nst-medium.tar.gz --directory="piper"

Generate speech via the terminal.

# Let's first move in to the piper directory
cd piper

# Generate speech to the audio file min_talsyntes.wav
echo 'Jag genererar tal med hjälp av talsyntes.' | ./piper \
  --model sv-se-nst-medium.onnx \
  --output_file min_talsyntes.wav

Pretrained model checkpoints

For anyone interested in using this model to finetune other voices, we have uploaded the pretrained checkpoint weights to Huggingface.

git clone https://huggingface.co/KBLab/piper-tts-nst-swedish

Footnotes

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Rekathati (2023, May 24). The KBLab Blog: Swedish speech synthesis. Retrieved from https://kb-labb.github.io/posts/2023-05-24-swedish-text-to-speech/

BibTeX citation

@misc{rekathati2023swedish,
  author = {Rekathati, Faton},
  title = {The KBLab Blog: Swedish speech synthesis},
  url = {https://kb-labb.github.io/posts/2023-05-24-swedish-text-to-speech/},
  year = {2023}
}