CLiPS Technical Report Series (CTRS)
This page contains all issues of the CLiPS Technical Report Series (CTRS) as published by the CLiPS Research Center from the University of Antwerp. The issues are listed in reverse chronological order.
ISSN: 2033-3544
10. Styloscope and Toposcope: Towards User-Friendly Digital Text Analysis
Abstract
We present two Natural Language Processing tools that were developed during the CLARIAH-Flanders project, which aims to facilitate large-scale research in the humanities and social sciences. The first tool, Styloscope, allows automatic writing style analysis for large text corpora. Secondly, Toposcope detects and annotates topics in unstructured text data. The pipelines are available on GitHub and can be used from the command line or through a user interface. Additionally, they provide downloadable outputs, including raw document-level results and visualizations of aggregated results.
Keywords: Topic modeling, stylometry, NLP for the humanities and social sciences
Issue #
001
Auhtor(s)
Jens Lemmens
Walter Daelemans
ISSN
2033-3544
Published
May 2024
Download
9. The LiLaH Emotion Lexicon of Greek, Kurdish, Turkish, Spanish, Farsi and Chinese
Abstract
In this technical report, we present manual translations in Greek, Farsi, Spanish, Kurdish, Turkish, simplified Chinese and traditional Chinese of the NRC word-emotion association lexicon (Mohammad and Turney, 2013). For each of the aforementioned languages, we collaborated with one native speaker who checked and - if necessary - corrected the automatically generated translations of the lexicon according to the provided annotation guidelines.
Issue #
009
Author(s)
Jens Lemmens
Ilia Markov
Walter Daelemans
ISSN
2033-3544
Published
February 2023
Download
8. Multilingual Cross-domain Perspectives on Online Hate Speech
Abstract
In this report, we present a study of eight corpora of online hate speech, by demonstrating the NLP techniques that we used to collect and analyze the jihadist, extremist, racist, and sexist content. Analysis of the multilingual corpora shows that the different contexts share certain characteristics in their hateful rhetoric. To expose the main features, we have focused on text classification, text profiling, keyword and collocation extraction, along with manual annotation and qualitative study.
Issue #
008
Author(s)
Tom De Smedt
Sylvia Jaki
Eduan Kotzé
Leïla Saoud
Maja Gwóźdź
Guy De Pauw
Walter Daelemans
ISSN
2033-3544
Published
11/09/2018
Download
7. Automatic Detection of Online Jihadist Hate Speech
Abstract
We have developed a system that automatically detects online jihadist hate speech with over 80% accuracy, by using techniques from Natural Language Processing and Machine Learning. The system is trained on a corpus of 45,000 subversive Twitter messages collected from October 2014 to December 2016. We present a qualitative and quantitative analysis of the jihadist rhetoric in the corpus, examine the network of Twitter users, outline the technical procedure used to train the system, and discuss examples of use.
Issue #
007
Author(s)
Tom De Smedt
Guy De Pauw
Pieter Van Ostaeyen
ISSN
2033-3544
Published
01/03/2018
Download
6. Creating TwiSty: Corpus Development and Statistics
Abstract
This document provides information on the creation of the Twitter Stylometry (TwiSty) corpus (Verhoeven et al., 2016). The corpus contains Twitter profiles annotated with MBTI personality types and gender information, covering six languages: Italian (IT), Dutch (NL), German (DE), Spanish (ES), French (FR), and Portuguese (PT).
Issue #
006
Author(s)
Ben Verhoeven
Walter Daelemans
Barbara Plank
ISSN
2033-3544
Published
01/05/2016
Download
5. Annotation Guidelines for Compound Analysis
Abstract
This technical report introduces three sets of annotation guidelines for the analysis of compounds in Afrikaans and Dutch. The first protocol serves the annotation of compound boundaries when creating a dataset to use for compound segmentation. The second and third protocol serve the semantic annotation of the relation between the constituents of compounds. Where the second protocol only focuses on noun-noun (NN) compounds, the third protocol deals with other two-part nominal (XN) compounds.
The report further contains a terminology list with definitions of concepts and abbreviations relevant to the analysis of compounds and an overview of the AuCoPro project in the context of which these guidelines were developed.
Issue #
005
Author(s)
Ben Verhoeven
Gerhard van Huyssteen
Menno van Zaanen
Walter Daelemans
ISSN
2033-3544
Published
31/01/2014
Download
4. STYLENE: an Environment for Stylometry and Readability Research for Dutch
Abstract
This report provides a practical introduction to the use of the interface and corresponding backend system that was developed in the context of the Stylene project. The goal of that project was to implement a robust, modular system for stylometry and readability research on the basis of existing methods, and the development of a web service that allows researchers in the humanities and social sciences to analyze texts with this system.
Issue #
004
Author(s)
Walter Daelemans
Véronique Hoste
ISSN
2033-3544
Published
19/07/2013
Download
3. Annotation of Negation Cues and their Scope. Guidelines v1.0.
Abstract
This technical report contains the guidelines for the annotation of negation information at sentence level in two Conan Doyle stories, The Hound of the Baskevilles and The Adventure of Wisteria Lodge. In sentences containing negation, the negation cues and their scope are marked, as well as the event that is negated. The annotated corpus is publicly available.
Issue #
003
Author(s)
Roser Morante
Sarah Schrauwen
Walter Daelemans
ISSN
2033-3544
Published
04/05/2011
Download
2. Memory-based Shallow Parser for Python
Abstract
MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.
The Python implementation of MBSP is open source and freely available.
Issue #
002
Author(s)
Tom De Smedt
Vincent Van Asch
Walter Daelemans
ISSN
2033-3544
Published
07/09/2010
Download
1. Machine Learning Approaches to Sentiment Analysis Using the Dutch Netlog Corpus
Abstract
Sentiment analysis deals with the computational treatment of opinion, sentiment and subjectivity. We constructed and manually annotated a corpus, the Dutch Netlog Corpus, with data extracted from the social networking website Netlog. This corpus was annotated on three levels: ‘valence’ (expressing the opinion of the writer; we distinguish between ‘positive’, ‘negative’, ‘both’, ‘neutral’ and ‘n/a’) and additionally language performance, which is divided into two areas: ‘performance’ (‘standard’ versus ‘dialect’) and ‘chat’ (‘chat’ versus ‘non-chat’). We tackle sentiment analysis as a text classification task and employ two simple feature sets (the most frequent and the most informative words of the corpus) and three supervised classifiers implemented from the Natural Language ToolKit(the Naïve Bayes, Maximum Entropy and Decision Tree classifiers).
Issue #
001
Auhtor(s)
Sarah Schrauwen
ISSN
2033-3544
Published
28/07/2010