KBLab presents a robust, multi-label sentiment classifier trained on Swedish texts. The model is robust in the sense that it is trained on multiple datasets of different text types and allows labeling of neutral as well positive and negative texts. It is available under Apache 2.0 license on the Hugging Face Hub.
Many researchers in the humanities and adjacent fields are interested in the tonality of texts, for which sentiment analysis is an excellent tool. KBLab presents a robust, transformer based sentiment classifier in Swedish. The model is available as a multi-class model (negative/neutral/positive). It is publicly available via the Hugging Face Hub, published under the Apache 2.0 license.
The model was developed in collaboration with KBLab researcher Nora Hansson Bittár, who is a PhD student at the Stockholm School of Economics. She is currently studying the development of sentiments and emotional load in the Swedish media landscape over time, inspired by Rozado et al (2022)
A particular requirement when studying tonality in news media, is the need of a category representing a neutral tone, as many news articles are neither inherently positive or negative. This is often overlooked in sentiment modeling, as it adds complexity to the task, which in turn affects model performance.
Another aspect of other available sentiment models available in the Swedish language that makes them difficult to use in a project like Nora’s is their poor generalization capabilities. Most, if not all, previously published sentiment models in Swedish are trained on one type of text exclusively (reviews), which consequently leads to poor performance in other linguistic domains. We have trained our models on multiple datasets of various types and sizes.
Robustness, in this case, refers to a language model’s generalization capabilities. Since most, if not all, previously published sentiment models in Swedish are trained on only one type of text (reviews), the performance in other linguistic domains suffer. We have trained our models on five different datasets from different sources, of various sizes and quality. Note that these datasets do not have a consistent, underlying annotation schema. This is compensated by the relatively large size of the corpus.
The accuracy of the model is evaluated on a balanced test set and is measured at 0.80 for the multiclass version and 0.88 for the binary version. More extensive evaluation will be conducted at a later stage and included in Nora’s report (albeit not on a balanced test set, but in the news media domain specifically). Both models are finetuned on the Swedish BERT-large model with 340M parameters, developed here at KBLab.
Below is a minimal example showcasing how to perform predictions with the model using the transformers pipeline.
from transformers import AutoTokenizer, AutoModelForSequenceClassific
ation, pipeline
tokenizer = AutoTokenizer.from_pretrained("KBLab/megatron-bert-large-swedish-cased-165k")
model = AutoModelForSequenceClassification.from_pretrained("KBLab/robust-swedish-sentiment-multiclass")
text = "Rihanna uppges gravid"
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(text)
In this case, the classifier outputs the label NEUTRAL with a score of 0.89. Good luck and happy inference!
Part of this development work was carried out within HUMINFRA infrastructure project.
Preview photo by Bo Wingård (1967).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Hägglöf (2023, June 16). The KBLab Blog: A robust, multi-label sentiment classifier for Swedish. Retrieved from https://kb-labb.github.io/posts/2023-06-16-a-robust-multi-label-sentiment-classifier-for-swedish/
BibTeX citation
@misc{hägglöf2023a, author = {Hägglöf, Hillevi}, title = {The KBLab Blog: A robust, multi-label sentiment classifier for Swedish}, url = {https://kb-labb.github.io/posts/2023-06-16-a-robust-multi-label-sentiment-classifier-for-swedish/}, year = {2023} }