Posts

BERTopic for Swedish: Topic modeling made easier via KB-BERT

Topic modeling is an exciting option for exploring and finding patterns in large volumes of text data. While this previously required considerable programming skills, a recent innovation has simplified the method to make it more accessible for researchers in and beyond the academy. We explain how BERTopic harnesses KBLab’s language models to produce state-of-the-art topic modeling, and we offer some tips on how to get started.

Evaluating Swedish Language Models

We present _OverLim_, a new benchmark for evaluating large language models for Swedish, Danish, and Norwegian, created by translating a subset of the GLUE and SuperGLUE benchmark tasks. The lack of suitable downstream tasks for these three Scandinavian languages has made it difficult to evaluate new models that are being published; model developers cannot easily see whether their traning has succeeded, and comparison between various models becomes even more tedious. While the translations were done using state-of-the-art models, their quality was not double-checked with native speakers, meaning that any results on these datasets should be interpreted carefully. The dataset is available via the _huggingface_ 🤗 ecosystem and can be downloaded at https://huggingface.co/datasets/KBLab/overlim.

Swedish Bootleg model

We at KBLab have trained a Swedish version of an entity disambiguation model called Bootleg, developed by the Hazy Research Lab at Stanford. The model is trained on Swedish Wikipedia and can be used to disambiguate named entities based on their context.

SUCX 3.0 - NER

We present a remix of the venerable SUC 3.0 dataset for Swedish Named Entity Recognition (NER), and explore the effect of Hyper Parameter Optimization (HPO) for this task and dataset using our Swedish BERT model KB-BERT. We publish the data with a balanced train-development-test split using both manually and automatically annotated tags as a huggingface 🤗 dataset at https://huggingface.co/datasets/KBLab/sucx3_ner.

KBLab publishes an article about AI in the library in C&RL

What does AI mean for libraries and how could libraries pave the way for ethical AI? KBLab reflects on these questions in an article in the Open Access journal College & Research Libraries.

KBLab's director Love Börjeson nominated as AI Swede of the year 2021

We are very pleased to announce that our director Love Börjeson has been nominated for the prestigious prize Årets AI Svensk 2021 (AI Swede of the year 2021) which is awarded by TechSverige.

Introducing a Swedish Sentence Transformer

While language models such as BERT are effective at many tasks, they have limited use when it comes to information retrieval and large scale similarity comparisons. In this post we introduce a Swedish sentence transformer which produces semantically meaningful sentence embeddings suitable for use in semantic search applications. We evaluate the model on SuperLim (Swedish SuperGLUE), where it achieves the highest published scores on SweParaphrase (a test set to evaluate sentence similarity). The model is publicly available on Huggingface.

A Swedish-Norwegian Federated Language Model

We trained a bilingual Swedish-Norwegian [ELECTRA](https://github.com/google-research/electra) language model in a federated setup, showcasing LM training when various corpora cannot be shared directly. The main goal of the project was to validate the feasibility of the approch, as well as to study some key numerical properties affecting the performance of the FL process.

Topic models for Statens Offentliga Utredningar

We built some topic models on the National Library's SOU collection, as an example of what can be done with the Library's materials. The curated dataset can be a useful resource for other researchers interested in this collection, or a good playground for students who need a large dataset of freely available Swedish text.

The Power of Crowdsourcing

High-quality, curated datasets for minor languages like Swedish are hard to come by for most NLP tasks. We at KBLab decided to do something about it and are enlisting the help of our colleagues to annotate data to improve our models. Currently we are working on a NER dataset particularly suited for library purposes and we are also starting a project to transcribe radio speech for a new speech recognition model.

A multimodal approach to advertisement classification in digitized newspapers

The process of digitizing historical newspapers at the National Library of Sweden involves scanning physical copies of newspapers and storing them as images. In order to make the scanned contents machine readable and searchable, OCR (optical character recognition) procedures are applied. This results in a wealth of information being generated from different data modalities (images, text and OCR metadata). In this article we explore how features from multiple modalities can be integrated into a unified advertisement classifier model.

Introduction to Distill for R Markdown

A short introduction showcasing the capabilities of the distill package for scientific and technical writing. Get yourself set up and start publishing posts!

More articles »

Posts

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.