Developing new text data collections for specific research questions

Crawling textual data sources from the Internet (compliant with GDPR)

  • collecting corpora from social media platforms such as Twitter & Facebook  
  • crawling threads from reddit.com etc.

Data conversion and extraction from various file formats

  • ​unstructured data: .pdf / .doc / .txt etc.
  • structured data: XML etc.

Text data preprocessing

  • ​tokenization
  • automatic linguistic analysis: e.g. Named Entity Recognition (NER)
  • automated cleanup: spelling normalization & correction