KBLab releases a neural network based text-to-speech model for Swedish. The model was trained on an open Swedish speech synthesis dataset from NST. We make our latest training checkpoint available for for anyone wishing to finetune on a new voice. We also contribute the model weights to the open source project Piper, where users can deploy a light weight optimized version of the model on their own computers or on their Raspberry Pis.
Creating realistic sounding speech with proper intonation, pitch and tone from text has long been a goal of speech synthesis systems. These systems have a wide range of applications, among which a few are:
Recent developments in neural speech synthesis have allowed for the synthetization of voices with increasing natural fidelity. However, many of these high quality systems with support for smaller languages remain proprietary.
At KBLab we recently discovered a relatively user friendly option to train a neural speech synthesis model through the Piper library
Have a listen to the output of KBLab’s model below. We use text from the wikipedia articles on Regnbåge and Europaparlementet as the source text. For each sample, we also compare our model to a recently released open source text-to-speech model from Meta AI. See the Massively Multilingual Speech project (MMS)
En regnbåge är ett optiskt, meteorologiskt fenomen som uppträder som ett (nästintill) fullständigt ljusspektrum i form av en båge på himlen då solen lyser på nedfallande regn. Regnbågen består färgmässigt av en kontinuerlig övergång från rött (ytterst) via gula, gröna och blå nyanser till violett innerst; ofta definieras antalet färger som sju, inklusive orange och indigo.
Europaparlamentet (EP), även känt som EU-parlamentet, är den ena lagstiftande institutionen inom Europeiska unionen; den andra är Europeiska unionens råd. Parlamentet, som består av 705 ledamöter, väljs genom allmänna och direkta val vart femte år, och företräder unionsmedborgarna direkt på unionsnivå. Parlamentet kan förenklat liknas vid ett underhus i ett tvåkammarsystem.
Piper and the MMS model from Meta both use VITS to train the text-to-speech model. This model relies on espeak-ng to translate text to phonemes. The extent and comprehensiveness of espeak-ng
’s pronounciation and prosody rules vary from language to language, as the software is largely reliant on volunteer contributions. The reason the words meteorologiskt, fenomen and kontinuerlig have such a strange pronounciations is because espeak-ng
generates an incorrect text to phoneme conversion
To try out KBLab’s TTS model yourself using Piper:
wget https://github.com/rhasspy/piper/releases/download/v0.0.2/piper_amd64.tar.gz
# The contents will be untarred to directory named piper/
tar -xvf piper_amd64.tar.gz
wget https://github.com/rhasspy/piper/releases/download/v0.0.2/voice-sv-se-nst-medium.tar.gz
piper/
.tar -xvf voice-sv-se-nst-medium.tar.gz --directory="piper"
# Let's first move in to the piper directory
cd piper
# Generate speech to the audio file min_talsyntes.wav
echo 'Jag genererar tal med hjälp av talsyntes.' | ./piper \
--model sv-se-nst-medium.onnx \
--output_file min_talsyntes.wav
For anyone interested in using this model to finetune other voices, we have uploaded the pretrained checkpoint weights to Huggingface.
git clone https://huggingface.co/KBLab/piper-tts-nst-swedish
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Rekathati (2023, May 24). The KBLab Blog: Swedish speech synthesis. Retrieved from https://kb-labb.github.io/posts/2023-05-24-swedish-text-to-speech/
BibTeX citation
@misc{rekathati2023swedish, author = {Rekathati, Faton}, title = {The KBLab Blog: Swedish speech synthesis}, url = {https://kb-labb.github.io/posts/2023-05-24-swedish-text-to-speech/}, year = {2023} }