CLiPS Colloquium - Artur Kulmizev : How good is your Wikipedia?

Date: December 6, 15:00

Location: Room S.R.231

Registration: Please confirm your attendance by mailing jens.lemmens@uantwerpen.be

Abstract: Wikipedia’s perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting. 

Bio: Artur Kulmizev is a Postdoctoral Researcher at KU Leuven (🇧🇪), working within the LAGoM-NLP group led by Miryam de Lhoneux. His research concerns Multilingual NLP and methods for making language representation fair and equitable. In particular, his interests lie in how principled data selection and language sampling can be employed as part of the pre-training process, so that models can learn from better data — rather than simply more of it. In addition, he is also interested in how typological bias is reflected in commonly employed evaluation datasets and metrics, and how this affects our appraisal of multilingual models.

Prior KU Leuven, Artur earned his PhD in computational linguistics at Uppsala University (🇸🇪), supervised by Joakim Nivre and Anders Søgaard. His dissertation focused on the syntactic knowledge encoded by language models, investigated through the lens of dependency parsing. Before his PhD, Artur graduated from the EM-LCT program, where he spent his first year at the University of Groningen (🇳🇱) and his second year at the the University of the Basque Country (🇪🇸).

Kushal Tatariya is a PhD student at KU Leuven (🇧🇪), working within the LAGoM-NLP group led by Miryam de Lhoneux. She works in problems in multilingual and low-resource NLP. Her past projects looked at NLP for creole languages and code-switching. She is particularly interested in how availability of data and data quality influence the development of language models for low-resource languages. A central aspect of her research is also in linguistically informed interpretability as a way aid the understanding and development of multilingual language models. 

Prior to her PhD, Kushal got her masters in linguistics from the University of Glasgow and her advance master in digital humanities from KU Leuven (🇧🇪), with her focus being on the sociolinguistics of language contact in post-colonial settings.