A guest post from the Swedish Financial Supervisory Authority (Finansinspektionen, FI), where FI describe how they use machine learning to analyze financial data and perform risk assessments. (Image from I99pema, Wikimedia Commons)
How can new AI methods be used to improve the searchability and accessibility of visual heritage collections? We’ve taken advantage of the possibilities opened up by multimodal AI models to produce a demo showcasing these capabilities via the example of postcards. Here we explain something about the project behind this and encourage you to explore our image search demo! https://lab.kb.se/bildsok
KBLab together with Språkbanken Text have developed 75 freely available datasets to support research in, but not limited to, lexicography. The datasets, marked as Kubord 2, offers intriguing material for humanistic research. Let's see what the datasets have to offer and where to find them!
Searching a database of speakers by their voice presents a unique challenge, as speakers' voices change as they age. We can represent a speaker's voice computationally with what we call "voiceprints". We can compare pairs of them to decide if they belong to the same speaker or not, or "verify" their identity. But how do we know that their voiceprints are still similar enough to each other when recorded at two different ages, for example 40 and 45? In this project, I investigated how voiceprints age over a 9-year age-range, how they age depending on when the first voiceprint was recorded, as well as the effect of the audio length used to create the voiceprints, and the effect of gender. For this I used debate speeches from Riksdagen.
KBLab presents a robust, multi-label sentiment classifier trained on Swedish texts. The model is robust in the sense that it is trained on multiple datasets of different text types and allows labeling of neutral as well positive and negative texts. It is available under Apache 2.0 license on the Hugging Face Hub.
KBLab releases a neural network based text-to-speech model for Swedish. The model was trained on an open Swedish speech synthesis dataset from NST. We make our latest training checkpoint available for for anyone wishing to finetune on a new voice. We also contribute the model weights to the open source project Piper, where users can deploy a light weight optimized version of the model on their own computers or on their Raspberry Pis.
We describe a typical topic modeling use case where BERTopic is applied to scientific abstracts in the research field of education. We discuss the limitations of BERTopic and potential ways to overcome them. Some further exploratory data analysis is performed using the topic model as a base.
KBLab releases RixVox, a speech dataset comprised of 5500 hours of speech from parliamentary debates. The speeches have been aligned with transcripts from written protocols, and contain additional metadata such as the speaker's gender, electoral district and birth year. RixVox is open and free for everyone to download and use.
The Riksdag is the Parliament of Sweden. It has made available twenty years of parliamentary debates through its website and open data platform. Each speech is accompanied by rich metadata, including a transcript and markers indicating its start location and duration within the media file of the debate. We however find a not so insiginificant portion of the speeches have been misaligned, with the metadata being particularly unreliable for debates prior to 2012. In this work, we employ automatic speech recognition and speaker diarization to devise a fully automated approach towards aligning transcripts with their corresponding speech in audio files.
KBlab has released a BERT model fine-tuned on NLI tasks, which can be used for zero-shot text classification.
KBLab's Swedish sentence transformer has been updated to a newer version. The new version features an increased maximum sequence length of 384 tokens, allowing users to encode longer documents. It also performs better on retrieval tasks, such as matching questions and answers.
Topic modeling is an exciting option for exploring and finding patterns in large volumes of text data. While this previously required considerable programming skills, a recent innovation has simplified the method to make it more accessible for researchers in and beyond the academy. We explain how BERTopic harnesses KBLab’s language models to produce state-of-the-art topic modeling, and we offer some tips on how to get started.
We present _OverLim_, a new benchmark for evaluating large language models for Swedish, Danish, and Norwegian, created by translating a subset of the GLUE and SuperGLUE benchmark tasks. The lack of suitable downstream tasks for these three Scandinavian languages has made it difficult to evaluate new models that are being published; model developers cannot easily see whether their traning has succeeded, and comparison between various models becomes even more tedious. While the translations were done using state-of-the-art models, their quality was not double-checked with native speakers, meaning that any results on these datasets should be interpreted carefully. The dataset is available via the _huggingface_ 🤗 ecosystem and can be downloaded at https://huggingface.co/datasets/KBLab/overlim.
We at KBLab have trained a Swedish version of an entity disambiguation model called Bootleg, developed by the Hazy Research Lab at Stanford. The model is trained on Swedish Wikipedia and can be used to disambiguate named entities based on their context.
We present a remix of the venerable SUC 3.0 dataset for Swedish Named Entity Recognition (NER), and explore the effect of Hyper Parameter Optimization (HPO) for this task and dataset using our Swedish BERT model KB-BERT. We publish the data with a balanced train-development-test split using both manually and automatically annotated tags as a huggingface 🤗 dataset at https://huggingface.co/datasets/KBLab/sucx3_ner.
What does AI mean for libraries and how could libraries pave the way for ethical AI? KBLab reflects on these questions in an article in the Open Access journal College & Research Libraries.
We are very pleased to announce that our director Love Börjeson has been nominated for the prestigious prize Årets AI Svensk 2021 (AI Swede of the year 2021) which is awarded by TechSverige.
While language models such as BERT are effective at many tasks, they have limited use when it comes to information retrieval and large scale similarity comparisons. In this post we introduce a Swedish sentence transformer which produces semantically meaningful sentence embeddings suitable for use in semantic search applications. We evaluate the model on SuperLim (Swedish SuperGLUE), where it achieves the highest published scores on SweParaphrase (a test set to evaluate sentence similarity). The model is publicly available on Huggingface.
We trained a bilingual Swedish-Norwegian [ELECTRA](https://github.com/google-research/electra) language model in a federated setup, showcasing LM training when various corpora cannot be shared directly. The main goal of the project was to validate the feasibility of the approch, as well as to study some key numerical properties affecting the performance of the FL process.
We built some topic models on the National Library's SOU collection, as an example of what can be done with the Library's materials. The curated dataset can be a useful resource for other researchers interested in this collection, or a good playground for students who need a large dataset of freely available Swedish text.
High-quality, curated datasets for minor languages like Swedish are hard to come by for most NLP tasks. We at KBLab decided to do something about it and are enlisting the help of our colleagues to annotate data to improve our models. Currently we are working on a NER dataset particularly suited for library purposes and we are also starting a project to transcribe radio speech for a new speech recognition model.
The process of digitizing historical newspapers at the National Library of Sweden involves scanning physical copies of newspapers and storing them as images. In order to make the scanned contents machine readable and searchable, OCR (optical character recognition) procedures are applied. This results in a wealth of information being generated from different data modalities (images, text and OCR metadata). In this article we explore how features from multiple modalities can be integrated into a unified advertisement classifier model.
A short introduction showcasing the capabilities of the distill package for scientific and technical writing. Get yourself set up and start publishing posts!
If you see mistakes or want to suggest changes, please create an issue on the source repository.