Posts

Analysis of financial data at the Financial Supervisory Authority

A guest post from the Swedish Financial Supervisory Authority (Finansinspektionen, FI), where FI describe how they use machine learning to analyze financial data and perform risk assessments. (Image from I99pema, Wikimedia Commons)

Unearthing forgotten images with the help of AI

How can new AI methods be used to improve the searchability and accessibility of visual heritage collections? We’ve taken advantage of the possibilities opened up by multimodal AI models to produce a demo showcasing these capabilities via the example of postcards. Here we explain something about the project behind this and encourage you to explore our image search demo! https://lab.kb.se/bildsok

Words unboxed: discovering new words with Kubord

KBLab together with Språkbanken Text have developed 75 freely available datasets to support research in, but not limited to, lexicography. The datasets, marked as Kubord 2, offers intriguing material for humanistic research. Let's see what the datasets have to offer and where to find them!

For how long is a person recognisable by their voice?

Searching a database of speakers by their voice presents a unique challenge, as speakers' voices change as they age. We can represent a speaker's voice computationally with what we call "voiceprints". We can compare pairs of them to decide if they belong to the same speaker or not, or "verify" their identity. But how do we know that their voiceprints are still similar enough to each other when recorded at two different ages, for example 40 and 45? In this project, I investigated how voiceprints age over a 9-year age-range, how they age depending on when the first voiceprint was recorded, as well as the effect of the audio length used to create the voiceprints, and the effect of gender. For this I used debate speeches from Riksdagen.

A robust, multi-label sentiment classifier for Swedish

KBLab presents a robust, multi-label sentiment classifier trained on Swedish texts. The model is robust in the sense that it is trained on multiple datasets of different text types and allows labeling of neutral as well positive and negative texts. It is available under Apache 2.0 license on the Hugging Face Hub.

Swedish speech synthesis

KBLab releases a neural network based text-to-speech model for Swedish. The model was trained on an open Swedish speech synthesis dataset from NST. We make our latest training checkpoint available for for anyone wishing to finetune on a new voice. We also contribute the model weights to the open source project Piper, where users can deploy a light weight optimized version of the model on their own computers or on their Raspberry Pis.

Scientific discourse with BERTopic

We describe a typical topic modeling use case where BERTopic is applied to scientific abstracts in the research field of education. We discuss the limitations of BERTopic and potential ways to overcome them. Some further exploratory data analysis is performed using the topic model as a base.

RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates

KBLab releases RixVox, a speech dataset comprised of 5500 hours of speech from parliamentary debates. The speeches have been aligned with transcripts from written protocols, and contain additional metadata such as the speaker's gender, electoral district and birth year. RixVox is open and free for everyone to download and use.

Finding Speeches in the Riksdag's Debates

The Riksdag is the Parliament of Sweden. It has made available twenty years of parliamentary debates through its website and open data platform. Each speech is accompanied by rich metadata, including a transcript and markers indicating its start location and duration within the media file of the debate. We however find a not so insiginificant portion of the speeches have been misaligned, with the metadata being particularly unreliable for debates prior to 2012. In this work, we employ automatic speech recognition and speaker diarization to devise a fully automated approach towards aligning transcripts with their corresponding speech in audio files.

Swedish zero-shot classification model

KBlab has released a BERT model fine-tuned on NLI tasks, which can be used for zero-shot text classification.

Swedish Sentence Transformer 2.0

KBLab's Swedish sentence transformer has been updated to a newer version. The new version features an increased maximum sequence length of 384 tokens, allowing users to encode longer documents. It also performs better on retrieval tasks, such as matching questions and answers.

BERTopic for Swedish: Topic modeling made easier via KB-BERT

Topic modeling is an exciting option for exploring and finding patterns in large volumes of text data. While this previously required considerable programming skills, a recent innovation has simplified the method to make it more accessible for researchers in and beyond the academy. We explain how BERTopic harnesses KBLab’s language models to produce state-of-the-art topic modeling, and we offer some tips on how to get started.

Evaluating Swedish Language Models

We present _OverLim_, a new benchmark for evaluating large language models for Swedish, Danish, and Norwegian, created by translating a subset of the GLUE and SuperGLUE benchmark tasks. The lack of suitable downstream tasks for these three Scandinavian languages has made it difficult to evaluate new models that are being published; model developers cannot easily see whether their traning has succeeded, and comparison between various models becomes even more tedious. While the translations were done using state-of-the-art models, their quality was not double-checked with native speakers, meaning that any results on these datasets should be interpreted carefully. The dataset is available via the _huggingface_ 🤗 ecosystem and can be downloaded at https://huggingface.co/datasets/KBLab/overlim.

Swedish Bootleg model

We at KBLab have trained a Swedish version of an entity disambiguation model called Bootleg, developed by the Hazy Research Lab at Stanford. The model is trained on Swedish Wikipedia and can be used to disambiguate named entities based on their context.

SUCX 3.0 - NER

We present a remix of the venerable SUC 3.0 dataset for Swedish Named Entity Recognition (NER), and explore the effect of Hyper Parameter Optimization (HPO) for this task and dataset using our Swedish BERT model KB-BERT. We publish the data with a balanced train-development-test split using both manually and automatically annotated tags as a huggingface 🤗 dataset at https://huggingface.co/datasets/KBLab/sucx3_ner.

KBLab publishes an article about AI in the library in C&RL

What does AI mean for libraries and how could libraries pave the way for ethical AI? KBLab reflects on these questions in an article in the Open Access journal College & Research Libraries.

KBLab's director Love Börjeson nominated as AI Swede of the year 2021

We are very pleased to announce that our director Love Börjeson has been nominated for the prestigious prize Årets AI Svensk 2021 (AI Swede of the year 2021) which is awarded by TechSverige.

Introducing a Swedish Sentence Transformer

While language models such as BERT are effective at many tasks, they have limited use when it comes to information retrieval and large scale similarity comparisons. In this post we introduce a Swedish sentence transformer which produces semantically meaningful sentence embeddings suitable for use in semantic search applications. We evaluate the model on SuperLim (Swedish SuperGLUE), where it achieves the highest published scores on SweParaphrase (a test set to evaluate sentence similarity). The model is publicly available on Huggingface.

A Swedish-Norwegian Federated Language Model

We trained a bilingual Swedish-Norwegian [ELECTRA](https://github.com/google-research/electra) language model in a federated setup, showcasing LM training when various corpora cannot be shared directly. The main goal of the project was to validate the feasibility of the approch, as well as to study some key numerical properties affecting the performance of the FL process.

Topic models for Statens Offentliga Utredningar

We built some topic models on the National Library's SOU collection, as an example of what can be done with the Library's materials. The curated dataset can be a useful resource for other researchers interested in this collection, or a good playground for students who need a large dataset of freely available Swedish text.

The Power of Crowdsourcing

High-quality, curated datasets for minor languages like Swedish are hard to come by for most NLP tasks. We at KBLab decided to do something about it and are enlisting the help of our colleagues to annotate data to improve our models. Currently we are working on a NER dataset particularly suited for library purposes and we are also starting a project to transcribe radio speech for a new speech recognition model.

A multimodal approach to advertisement classification in digitized newspapers

The process of digitizing historical newspapers at the National Library of Sweden involves scanning physical copies of newspapers and storing them as images. In order to make the scanned contents machine readable and searchable, OCR (optical character recognition) procedures are applied. This results in a wealth of information being generated from different data modalities (images, text and OCR metadata). In this article we explore how features from multiple modalities can be integrated into a unified advertisement classifier model.

Introduction to Distill for R Markdown

A short introduction showcasing the capabilities of the distill package for scientific and technical writing. Get yourself set up and start publishing posts!

More articles »

Posts

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.