A robust, multi-label sentiment classifier for Swedish

KBLab presents a robust, multi-label sentiment classifier trained on Swedish texts. The model is robust in the sense that it is trained on multiple datasets of different text types and allows labeling of neutral as well positive and negative texts. It is available under Apache 2.0 license on the Hugging Face Hub.

Hillevi Hägglöf https://github.com/gilleti (KBLab)https://www.kb.se/in-english/research-collaboration/kblab.html
2023-06-16

Many researchers in the humanities and adjacent fields are interested in the tonality of texts, for which sentiment analysis is an excellent tool. KBLab presents a robust, transformer based sentiment classifier in Swedish. The model is available as a multi-class model (negative/neutral/positive). It is publicly available via the Hugging Face Hub, published under the Apache 2.0 license.

The model was developed in collaboration with KBLab researcher Nora Hansson Bittár, who is a PhD student at the Stockholm School of Economics. She is currently studying the development of sentiments and emotional load in the Swedish media landscape over time, inspired by Rozado et al (2022)1. Nora’s project will be further documented in the blog.

A particular requirement when studying tonality in news media, is the need of a category representing a neutral tone, as many news articles are neither inherently positive or negative. This is often overlooked in sentiment modeling, as it adds complexity to the task, which in turn affects model performance.

Another aspect of other available sentiment models available in the Swedish language that makes them difficult to use in a project like Nora’s is their poor generalization capabilities. Most, if not all, previously published sentiment models in Swedish are trained on one type of text exclusively (reviews), which consequently leads to poor performance in other linguistic domains. We have trained our models on multiple datasets of various types and sizes.

Robustness, in this case, refers to a language model’s generalization capabilities. Since most, if not all, previously published sentiment models in Swedish are trained on only one type of text (reviews), the performance in other linguistic domains suffer. We have trained our models on five different datasets from different sources, of various sizes and quality. Note that these datasets do not have a consistent, underlying annotation schema. This is compensated by the relatively large size of the corpus.

The accuracy of the model is evaluated on a balanced test set and is measured at 0.80 for the multiclass version and 0.88 for the binary version. More extensive evaluation will be conducted at a later stage and included in Nora’s report (albeit not on a balanced test set, but in the news media domain specifically). Both models are finetuned on the Swedish BERT-large model with 340M parameters, developed here at KBLab.

Inference

Below is a minimal example showcasing how to perform predictions with the model using the transformers pipeline.

from transformers import AutoTokenizer, AutoModelForSequenceClassific
ation, pipeline

tokenizer = AutoTokenizer.from_pretrained("KBLab/megatron-bert-large-swedish-cased-165k")
model = AutoModelForSequenceClassification.from_pretrained("KBLab/robust-swedish-sentiment-multiclass")

text = "Rihanna uppges gravid"
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(text)

In this case, the classifier outputs the label NEUTRAL with a score of 0.89. Good luck and happy inference!

Acknowledgements

Part of this development work was carried out within HUMINFRA infrastructure project.

Preview photo by Bo Wingård (1967).


  1. Rozado D, Hughes R, Halberstadt J (2022) Longitudinal analysis of sentiment and emotion in news media headlines using automated labeling with Transformer language models. PLoS ONE 17(10): e0276367. https://doi.org/10.1371/journal.pone.0276367↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Hägglöf (2023, June 16). The KBLab Blog: A robust, multi-label sentiment classifier for Swedish. Retrieved from https://kb-labb.github.io/posts/2023-06-16-a-robust-multi-label-sentiment-classifier-for-swedish/

BibTeX citation

@misc{hägglöf2023a,
  author = {Hägglöf, Hillevi},
  title = {The KBLab Blog: A robust, multi-label sentiment classifier for Swedish},
  url = {https://kb-labb.github.io/posts/2023-06-16-a-robust-multi-label-sentiment-classifier-for-swedish/},
  year = {2023}
}