CLiPS Technical Report Series (CTRS)

This page contains all issues of the CLiPS Technical Report Series (CTRS) as published by the CLiPS Research Center from the University of Antwerp. The issues are listed in reverse chronological order.

ISSN: 2033-3544

11. Multiscope: A User-Friendly Multi-Label Text Classification Dashboard

Abstract

We present Multiscope, a user-friendly tool for training and applying Multi-Label Text Classification (MLTC) models to datasets of choice. MLTC is a complex, yet vital component of analyzing large corpora that aims to assign multiple labels to a single text. However, compared to more traditional classification approaches, training, evaluating and deploying MLTC models can be challenging because of the nature of the task. Multiscope provides a complete pipeline for this classification problem, starting with data stratification, providing insights into the label distribution and interactions between labels. The tool also provides a framework for fine-tuning state-of-the-art transformer models and training classical Machine Learning models. The trained models can be evaluated using multi-label classification metrics.

Keywords: Multi-label text classification, NLP for the humanities

Issue #

001

Auhtor(s)

Jens Van Nooten

Walter Daelemans

ISSN

2033-3544

Published

January 2025

Download

Pdf

10. Styloscope and Toposcope: Towards User-Friendly Digital Text Analysis

Abstract

We present two Natural Language Processing tools that were developed during the CLARIAH-Flanders project, which aims to facilitate large-scale research in the humanities and social sciences. The first tool, Styloscope, allows automatic writing style analysis for large text corpora. Secondly, Toposcope detects and annotates topics in unstructured text data. The pipelines are available on GitHub and can be used from the command line or through a user interface. Additionally, they provide downloadable outputs, including raw document-level results and visualizations of aggregated results.

Keywords: Topic modeling, stylometry, NLP for the humanities and social sciences

Issue #

001

Auhtor(s)

Jens Lemmens
Walter Daelemans

ISSN

2033-3544

Published

May 2024

Download

Pdf

9. The LiLaH Emotion Lexicon of Greek, Kurdish, Turkish, Spanish, Farsi and Chinese

Abstract

In this technical report, we present manual translations in Greek, Farsi, Spanish, Kurdish, Turkish, simplified Chinese and traditional Chinese of the NRC word-emotion association lexicon (Mohammad and Turney, 2013). For each of the aforementioned languages, we collaborated with one native speaker who checked and - if necessary - corrected the automatically generated translations of the lexicon according to the provided annotation guidelines.

Issue #

009

Author(s)

Jens Lemmens

Ilia Markov

Walter Daelemans

ISSN

2033-3544

Published

February 2023

Download

Pdf

8. Multilingual Cross-domain Perspectives on Online Hate Speech

Abstract

In this report, we present a study of eight corpora of online hate speech, by demonstrating the NLP techniques that we used to collect and analyze the jihadist, extremist, racist, and sexist content. Analysis of the multilingual corpora shows that the different contexts share certain characteristics in their hateful rhetoric. To expose the main features, we have focused on text classification, text profiling, keyword and collocation extraction, along with manual annotation and qualitative study.

Issue #

008

Author(s)

Tom De Smedt

Sylvia Jaki

Eduan Kotzé

Leïla Saoud

Maja Gwóźdź

Guy De Pauw

Walter Daelemans

ISSN

2033-3544

Published

11/09/2018

Download

Pdf

7. Automatic Detection of Online Jihadist Hate Speech

Abstract

We have developed a system that automatically detects online jihadist hate speech with over 80% accuracy, by using techniques from Natural Language Processing and Machine Learning. The system is trained on a corpus of 45,000 subversive Twitter messages collected from October 2014 to December 2016. We present a qualitative and quantitative analysis of the jihadist rhetoric in the corpus, examine the network of Twitter users, outline the technical procedure used to train the system, and discuss examples of use.

Issue #

007

Author(s)

Tom De Smedt
Guy De Pauw
Pieter Van Ostaeyen

ISSN

2033-3544

Published

01/03/2018

Download

Pdf

6. Creating TwiSty: Corpus Development and Statistics

Abstract

This document provides information on the creation of the Twitter Stylometry (TwiSty) corpus (Verhoeven et al., 2016). The corpus contains Twitter profiles annotated with MBTI personality types and gender information, covering six languages: Italian (IT), Dutch (NL), German (DE), Spanish (ES), French (FR), and Portuguese (PT).

Issue #

006

Author(s)

Ben Verhoeven
Walter Daelemans
Barbara Plank

ISSN

2033-3544

Published

01/05/2016

Download

Pdf

5. Annotation Guidelines for Compound Analysis

Abstract

This technical report introduces three sets of annotation guidelines for the analysis of compounds in Afrikaans and Dutch. The first protocol serves the annotation of compound boundaries when creating a dataset to use for compound segmentation. The second and third protocol serve the semantic annotation of the relation between the constituents of compounds. Where the second protocol only focuses on noun-noun (NN) compounds, the third protocol deals with other two-part nominal (XN) compounds.

The report further contains a terminology list with definitions of concepts and abbreviations relevant to the analysis of compounds and an overview of the AuCoPro project in the context of which these guidelines were developed.

Issue #

005

Author(s)

Ben Verhoeven
Gerhard van Huyssteen
Menno van Zaanen
Walter Daelemans

ISSN

2033-3544

Published

31/01/2014

Download

Pdf

4. STYLENE: an Environment for Stylometry and Readability Research for Dutch

Abstract

This report provides a practical introduction to the use of the interface and corresponding backend system that was developed in the context of the Stylene project. The goal of that project was to implement a robust, modular system for stylometry and readability research on the basis of existing methods, and the development of a web service that allows researchers in the humanities and social sciences to analyze texts with this system.

Issue #

004

Author(s)

Walter Daelemans
Véronique Hoste

ISSN

2033-3544

Published

19/07/2013

Download

Pdf

3. Annotation of Negation Cues and their Scope. Guidelines v1.0.

Abstract

This technical report contains the guidelines for the annotation of negation information at sentence level in two Conan Doyle stories, The Hound of the Baskevilles and The Adventure of Wisteria Lodge. In sentences containing negation, the negation cues and their scope are marked, as well as the event that is negated. The annotated corpus is publicly available.

Issue #

003

Author(s)

Roser Morante
Sarah Schrauwen
Walter Daelemans

ISSN

2033-3544

Published

04/05/2011

Download

Pdf

2. Memory-based Shallow Parser for Python

Abstract

MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.

The Python implementation of MBSP is open source and freely available.

Issue #

002

Author(s)

Tom De Smedt
Vincent Van Asch
Walter Daelemans

ISSN

2033-3544

Published

07/09/2010

Download

Pdf

1. Machine Learning Approaches to Sentiment Analysis Using the Dutch Netlog Corpus

Abstract

Sentiment analysis deals with the computational treatment of opinion, sentiment and subjectivity. We constructed and manually annotated a corpus, the Dutch Netlog Corpus, with data extracted from the social networking website Netlog. This corpus was annotated on three levels: ‘valence’ (expressing the opinion of the writer; we distinguish between ‘positive’, ‘negative’, ‘both’, ‘neutral’ and ‘n/a’) and additionally language performance, which is divided into two areas: ‘performance’ (‘standard’ versus ‘dialect’) and ‘chat’ (‘chat’ versus ‘non-chat’). We tackle sentiment analysis as a text classification task and employ two simple feature sets (the most frequent and the most informative words of the corpus) and three supervised classifiers implemented from the Natural Language ToolKit(the Naïve Bayes, Maximum Entropy and Decision Tree classifiers).

Issue #

001

Auhtor(s)

Sarah Schrauwen

ISSN

2033-3544

Published

28/07/2010

Download

Pdf