Developing new text data collections for specific research questions
Crawling textual data sources from the Internet (compliant with GDPR)
- collecting corpora from social media platforms such as Twitter & Facebook
- crawling threads from reddit.com etc.
Data conversion and extraction from various file formats
- unstructured data: .pdf / .doc / .txt etc.
- structured data: XML etc.
Text data preprocessing
- tokenization
- automatic linguistic analysis: e.g. Named Entity Recognition (NER)
- automated cleanup: spelling normalization & correction