The KBLab Blog: Swedish speech synthesis

Faton Rekathati

Creating realistic sounding speech with proper intonation, pitch and tone from text has long been a goal of speech synthesis systems. These systems have a wide range of applications, among which a few are:

Accessibility tooling for the visually impaired via screen readers.
Helping people with speech impairments communicate.
Allowing for interaction with computers, digital assistants, and robots through audio as opposed to text.

Recent developments in neural speech synthesis have allowed for the synthetization of voices with increasing natural fidelity. However, many of these high quality systems with support for smaller languages remain proprietary.

At KBLab we recently discovered a relatively user friendly option to train a neural speech synthesis model through the Piper library¹. Luckily, the Norwegian Language Bank (Språkbanken) maintains several speech datasets originally produced by the company NST (Nordisk Språkteknologi). One of those datasets consists of 5300 recordings of a single Swedish speaker ², recorded for the purposes of training speech synthesis systems.

Have a listen

Have a listen to the output of KBLab’s model below. We use text from the wikipedia articles on Regnbåge and Europaparlementet as the source text. For each sample, we also compare our model to a recently released open source text-to-speech model from Meta AI. See the Massively Multilingual Speech project (MMS) ³ for further details about Meta’s model.

KBLab TTS (Piper)

Facebook/Meta TTS

En regnbåge är ett optiskt, meteorologiskt fenomen som uppträder som ett (nästintill) fullständigt ljusspektrum i form av en båge på himlen då solen lyser på nedfallande regn. Regnbågen består färgmässigt av en kontinuerlig övergång från rött (ytterst) via gula, gröna och blå nyanser till violett innerst; ofta definieras antalet färger som sju, inklusive orange och indigo.

KBLab TTS (Piper)

Facebook/Meta TTS

Europaparlamentet (EP), även känt som EU-parlamentet, är den ena lagstiftande institutionen inom Europeiska unionen; den andra är Europeiska unionens råd. Parlamentet, som består av 705 ledamöter, väljs genom allmänna och direkta val vart femte år, och företräder unionsmedborgarna direkt på unionsnivå. Parlamentet kan förenklat liknas vid ett underhus i ett tvåkammarsystem.

Odd pronounciations

Piper and the MMS model from Meta both use VITS to train the text-to-speech model. This model relies on espeak-ng to translate text to phonemes. The extent and comprehensiveness of espeak-ng’s pronounciation and prosody rules vary from language to language, as the software is largely reliant on volunteer contributions. The reason the words meteorologiskt, fenomen and kontinuerlig have such a strange pronounciations is because espeak-ng generates an incorrect text to phoneme conversion ⁴.

How do I use the model?

To try out KBLab’s TTS model yourself using Piper:

Download the Piper binary from Github (executable file that allows you to run the model in your terminal) ⁵. For Linux:

wget https://github.com/rhasspy/piper/releases/download/v0.0.2/piper_amd64.tar.gz

Unzip/Untar the downloaded archive in a directory of your choice.

# The contents will be untarred to directory named piper/
tar -xvf piper_amd64.tar.gz

Download the Svenska (Swedish) model weights from Piper’s voice samples.

wget https://github.com/rhasspy/piper/releases/download/v0.0.2/voice-sv-se-nst-medium.tar.gz

Unzip/Untar the downloaded archive file in a directory of your choice. We suggest doing it in the same directory where you untarred the piper binary: piper/.

tar -xvf voice-sv-se-nst-medium.tar.gz --directory="piper"

Generate speech via the terminal.

# Let's first move in to the piper directory
cd piper

# Generate speech to the audio file min_talsyntes.wav
echo 'Jag genererar tal med hjälp av talsyntes.' | ./piper \
  --model sv-se-nst-medium.onnx \
  --output_file min_talsyntes.wav

Pretrained model checkpoints

For anyone interested in using this model to finetune other voices, we have uploaded the pretrained checkpoint weights to Huggingface.

git clone https://huggingface.co/KBLab/piper-tts-nst-swedish

Swedish speech synthesis

Have a listen

KBLab TTS (Piper)

Facebook/Meta TTS

KBLab TTS (Piper)

Facebook/Meta TTS

Odd pronounciations

How do I use the model?

Pretrained model checkpoints

Corrections

Citation