Research topics | Applied Data Mining

The research topics include:

Mobile Advertising
Social Network Analysis
Mining Behavioural Big Data
Explaining Documents' Classification
Rule Extraction
Bigraph Social Networks
Credit Scoring
Swarm Intelligence for Data Mining
Text Mining for Politics
Explaining Deep Learning Models
Customs Fraud Detection

Mobile Advertising

In this research, we make use of location data seen in Real-Time Bidding (RTB) environments. We have proposed and empirically evaluated a new social-targeting design for effective, privacy-friendly mobile advertising. Specifically, the new social-targeting design uses location data from mobile devices (smart phones, smart pads, laptops, etc.) to create user/device cohorts, which then can be used (1) to target advertisements in a manner that is both effective and privacy friendly, and (2) to evaluate campaigns across multiple channels. A short summary of this work can be found here.

This is joint work with Foster Provost (NYU) and EveryScreen Media (a DStillery company).

^[TOP]

Social Network Analysis

In recent years, vast amounts of networked data on a broad range of information flows between interlinked entities have become available, such as calls and text messages linking telephone accounts or online users, and money transfers connecting bank accounts. In this research we develop and apply new techniques that leverage such networked data to make better prediction models in marketing and finance domains.

^[TOP]

Mining Behavioural Big Data

The identification and comprehension of customers’ behavior is of crucial importance in current business environments. Data mining plays an important role in this field, and has been applied widespread for e.g. prediction of churn and response. Social network data has become very valuable for marketing purposes. Addressing network neighbors of current customers can therefore be a very efficient marketing strategy. This idea of using network learners for response modeling has already been successfully applied in literature. However, until now such analyses were largely limited to domains where an explicit social network is available, such as online friendship communities or the telecommunication domain. In this research, we will look for proxies of social ties to build pseudo social network variables in a retail and B2B setting. Concretely, we will enrich existing data with pseudo social network data. Using these enriched datasets, we shall investigate whether adding social network data improves the predictive performance of churn prediction models.

^[TOP]

Explaining Documents' Classification

Document classiﬁcation has widespread applications, such as with web pages for advertising, emails for legal discovery, blog entries for sentiment analysis, and many more. Unfortunately, due to the high dimensionality, understanding the decisions made by the document classiﬁers is very diﬃcult. We deﬁne a new sort of explanation, tailored to the business needs of document classiﬁcation and able to cope with the associated technical constraints. Speciﬁcally, an explanation is deﬁned as a set of words (terms, more generally) such that removing all words within this set from the document changes the predicted class from the class of interest.

^[TOP]

Rule Extraction

The lack of transparency of many state-of-the-art data mining techniques, such as support vector machines, artificial neural networks or random forests, renders them useless in any domain where comprehensibility is of importance. Rule extraction is a techinque designed to remedy this, which extracts comprehensible rules that mimic the decisions made by the black box model.

ALPA-R

Advances in data mining have led to algorithms that produce accurate regression models for large and difficult to approximate data. Most of these use non-linear models to handle complex data-relationships in the input data. Problematic is their lack transparency, a requirement in many application domains where comprehensibility is of key importance.

Rule-extraction algorithms have been proposed to solve this problem for classification by extracting comprehensible rulesets from the better performing, complex models. We present a new rule extraction algorithm for regression, based on active learning and the pedagogical approach to rule extraction. Empirical results show that improves on classical rule induction techniques, both in fidelity as well as accuracy.

ALPA

Many of the state-of-the-art data mining techniques introduce non-linearities in their models to cope with complex data-relationships effectively. Although such techniques are consistently included among the top classiﬁcation techniques in terms of predictive behaviour, their lack of transparency renders them useless in any domain where comprehensibility is of importance. Rule-extraction algorithms remedy this by distilling comprehensible rulesets from complex models that explain how the classiﬁcations are made.

The present article considers a new pedagogical rule extraction technique, based on active learning (ALPA). The technique generates artiﬁcial data points around training data with low conﬁdence in the output score, after which these are labelled by the black-box model. The main novelty of the proposed method is that it uses a pedagogical approach without making any architectural assumptions of the underlying model. It can therefore be applied to any black-box technique. Furthermore, ALPA can generate any rule format, depending on the chosen underlying rule induction technique.

We validated these claims in an empirical study using combinations of popular data mining algorithms (SVM, ANN, Random Forest, C4.5, Ripper). Our results show that not only do the ALPA generated rules explain the black-box models well, the proposed algorithm also performs substantially better than traditional rule induction techniques in terms of accuracy

Rule Evaluation for Metaheuristic-based Sequential Covering Algorithms

While many papers propose innovative methods for constructing individual rules in separate-and-conquer rule learning algorithms, comparatively few study the heuristic rule evaluation functions used in these algorithms to ensure that the selected rules combine into a good rule set. Underestimating the impact of this component has led to suboptimal design choices in many algorithms. The main goal of this paper is to demonstrate the importance of heuristic rule evaluation functions by improving existing rule induction techniques and to provide guidelines for algorithm designers.

We ﬁrst select optimal heuristic rule learning functions for several metaheuristic-based algorithms and empirically compare the resulting heuristics across algorithms. This results in large and signiﬁcant improvements of the predictive accuracy for two techniques. We ﬁnd that despite the absence of a global optimal choice for all algorithms, good default choices seem to exist for families of algorithms. A near-optimal selection can thus be found for new algorithms with minor experimental tuning. A major contribution is made towards balancing a model’s predictive accuracy with its comprehensibility, as the parametrized heuristics oﬀer an unmatched ﬂexibility when it comes to setting the trade-oﬀ between accuracy and comprehensibility.

ALBA

Support vector machines (SVMs) are currently state-of-the-art for the classification task and, generally speaking, exhibit good predictive performance due to their ability to model nonlinearities. However, their strength is also their main weakness, as the generated nonlinear models are typically regarded as incomprehensible black-box models. In this research, we propose a new Active Learning-Based Approach (ALBA) to extract comprehensible rules from opaque SVM models. Through rule extraction, some insight is provided into the logics of the SVM model. ALBA extracts rules from the trained SVM model by explicitly making use of key concepts of the SVM: the support vectors, and the observation that these are typically close to the decision boundary. Active learning implies the focus on apparent problem areas, which for rule induction techniques are the regions close to the SVM decision boundary where most of the noise is found. By generating extra data close to these support vectors that are provided with a class label by the trained SVM model, rule induction techniques are better able to discover suitable discrimination rules. This performance increase, both in terms of predictive accuracy as comprehensibility, is confirmed in our experiments where we apply ALBA on several publicly available data sets.

^[TOP]

Bigraph Social Networks

Many real-world large datasets correspond to bipartite graph datasets, think for example of users rating movies or people visiting locations. Although some work exists in analysis and collaborative filtering over such bigraphs, no general methodology has been proposed yet to perform node classification. In this paper, we propose a three-stage classification framework: first, a weighting of the top nodes is defined. Secondly, the bigraph is projected into a unipartite (homogenous) graph among the bottom nodes, where the weights of the edges are a function of the weights of the top nodes in the bigraph. Finally, relational learners are applied to the resulting weighted unigraph. In a large-scale experimental setup, we propose and asses a range of weighting schemes, aggregation functions and relational learners. The beta distribution turns out to be the best weighting scheme due to its flexibility, but the tuning procedure scales badly to large datasets. For those, the inverse degree performs best. The optimal aggregation function is simply summing the weights of the shared nodes, while the network only link-based classifier is the optimal relational learner. A comparison to applying statistical learning methods (SVM) on the adjacency matrix of the bigraph shows the superiority of our approach.

^[TOP]

Credit Scoring

Credit risk is the risk that a borrower does not honour its obligation to service debt which can occur when the debt is not serviced on time and/or in full. The predicted estimates of credit risk are used by banks to set the loan granting policy or to set capital requirements (cf. Basel II/III). In our research we use innovative input drivers and data mining techniques to improve the accuracy and comprehensibility of credit scoring models.

^[TOP]

Swarm Intelligence for Data Mining

This research is at the intersection of two fascinating and increasingly popular domains: swarm intelligence and data mining. Whereas data mining has been a popular academic topic for decades, swarm intelligence is a relatively new sub-field of artiﬁcial intelligence which studies the emergent collective intelligence of groups of simple agents.

^[TOP]

Text Mining for Politics

Research on sentiment analysis has a wide variety of economical and social applications, from finding out which books your friends will like or predicting the stock market to aiding in detecting depressed teenagers on social networking sites. During the course of 2012, we have applied sentiment analysis to various text sources with political content. This has resulted in one journal publication and a good reception by the Belgian media.

^[TOP]

Explaining Deep Learning Models

Deep learning has been shown to outperform many other prediction techniques in making accurate predictions using behavioral big data. Combining behavioral data and deep learning unfortunately results in incomprehensible black box predictions, which leads to skepticism to use it in practice. Explainable AI has become a research area that gained a lot of attention because of its implications on model deployment. Here, we aim at finding global and instance-based explanations that increase transparency when using such models.

^[TOP]

Customs Fraud Detection

^[TOP]

More about our research