Building Customer Loyalty
About SurfGold
Solutions
Products
Approach
Our Clients
Knowledge Hub
Careers
Home > Knowledge Hub > White Papers | Information tools | Articles | Press Releases
Articles

Core data mining algorithms - Statistical

We have progressed somewhat in the journey of data mining and knowledge discovery. We have differentiated what data mining is and what it is not. We have discussed the various processes underlying a typical knowledge discovery engagement and in the most recent article we went through a number of data mining techniques involving visualization. The time is now ripe for us to jump right in and explore the intricacies of algorithms based analytical techniques commonly adopted in data mining. There will be a little mathematics involved, but nothing that requires you to dust off that copy of College Calculus that is still hibernating on the bookshelf.

Analytical Algorithmic methods tend to discourage user interactions. Although we are often given permissions to decide on initial conditions and the variables upon which the analysis is targeted, much of the mining mechanics is left to the algorithms themselves. This is why developing a good understanding of the algorithms utilized is important; otherwise, the results might not make much sense.

From a classification standpoint, analytical data mining algorithms can be categorized into the following:
> Statistical Analysis
> Outcome based Algorithms
> Cluster Algorithms
> Affinity Algorithms
Again, each broad category of technique consists of many sub categories. In the interest of time and space, we shall discuss the first 2 techniques here leaving the last two in the next article.

Statistical Analysis As expected, statistics often play a large part in analytical algorithms. In fact, much of the results from other analytical algorithms are typically presented in some form of statistics. As you can expect, statistical methods require the use of quantitative data. They work with “numbers” rather than qualitative data like “East”, “Jurong” or “Male”. Obviously, these qualitative data needs to be encoded into quantitative forms before statistical methods can be applied. The most common encoding method would be that of categorization where data items are segmented before statistical analysis is performed. We have discussed some of these techniques in the previous article.

Statistical methods often forms the entry point of more involved analysis works. Many data analysts would often equip themselves with the “vital statistics” of the variables that they are dealing with before applying other analytical algorithms on them. Typically the basic level of statistics that interests most data analysts are the mean, median, mode and standard deviation of each variable within the data set. These give analysts a good “feel” of the data set that they are dealing with. For category-based data, another statistical value that is also of interest is the number of discrete values that can be found within the data set.

Another popular use of statistical methods is in the testing of hypothesis. This is where the F tests, t tests,…etc, belonging to the realm of parametric statistical tests, can be applied. Typically, the testing of a hypothesis begins with, well, a hypothesis; a statement that is true or false. “The ROI of the Spring Madness program is effective, while considering seasonal variations in revenue” and “Owners of luxury cars are less likely to shift their loyalty in petroleum purchase if its price increases by 10%” are examples of outcomes that an analyst might want to investigate. In hypothesis-speak, however, these would become, “The ROI of the Spring Madness program is significantly more than that due to seasonal variation” and “The number of owners of luxury cars shifting their loyalty in petroleum purchase if its price increases by 10% is not significant”.

The next step is always to calculate a ratio that compares the tested hypothesis with random variances that occurs within the entire data set. These “random variances” would capture characteristics like variations due to seasonality, data errors, baseline fluctuations, noise…which could otherwise influence the hypothesis. So, if the calculated ratio is large, it would mean that the hypothesis being tested is “significant” as compared to the random variation that the data is inherently exposed to. We would then infer that this hypothesis would be observed again when similar conditions prevailed.

Statistical methods are also used in Regression Analysis. The objective is always to map the variables being analyzed into some form of equation so that postulation can be made. We all learned about best-fit lines in high school and how they do not always work. We are not going to discuss too much about that here. More sophisticated regression analysis uses some form of polynomial, exponential or logarithmic equation to fit as many data points as possible. This is where variations of Bezier, B-Splines and other exotically named algorithms can be utilized in the generation of best-fit curves; each with their own strengths and weaknesses. If you are keen in exploring some of these without investing thousands of dollars in a data-mining tool, rudimentary functions exist in MS Excel, allowing analysts to generate trend lines onto charts depicting series data. Try it; it can often produce rather insightful results.

Outcome based Algorithms Outcome based Algorithms always begin with an outcome in mind. And this outcome is typically the reason why analysis is performed to begin with. For example, in a scenario depicting customer transactions on a regularly purchased product (like petroleum, for example), customer churn might be defined as a situation where a customer has not purchased the product for a period of 2 months. From an analysis perspective, we might want to figure out the factors identifying customer churn with respect to petroleum purchase. In this situation the outcome to be investigated is, therefore, customer churn.

Outcome based algorithms are usually applied to data sets containing many variables, of which the outcome variable is one of. These analyses are typically performed in order to determine which variables would influence the state of outcome variable. In situations where the outcome variable does not exist in the original transactional records, it may have to be generated before the algorithms can be applied. Without going through all the details, the transformed transactional records would likely take the form as depicted in figure 1, before algorithms can be applied. Each record would describe a customer’s profile details and his transactional behaviors. It should be noted these transactional behavior record values tend to be generated from the transactional records themselves. The decisions to be made on how these variables would be consolidated and categorized are oftentimes difficult (see Sidebar).

With the records generated, we are now ready to apply algorithms to the data set. By far, the most popular outcome based algorithms are those that generate decision trees. One of the main reasons for their popularity has to do with how easily they can be understood. Figure 2 shows a portion of the decision tree generated based on the data set described above.

The way to read the decision tree would be as follows:
> If the total purchase is greater than $200 AND the car is 1600cc, the likelihood of churn is high.

> If the total purchase is greater than $200 AND the car is between 1601 and 2000cc AND the number of loyalty points accumulated is less than 1000 then the likelihood of churn is high.

> ..etc.

As can be noticed, those variables closer to the root exert greater influence on the outcome variable. These variables are said to be more predictive for the outcome variable. In our data sample, the total purchased made within the 6 months period is the most predictive variable for customer churn. It basically means that if you are only given one variable to predict if the customer would churn or not, you would want to know the total purchased made by the customer within the most recent 6 months. And in this case, the split lies at the $200 mark. From the Decision Tree generated the next variable within those that spend more than $200 that can help you determine churners and loyal customer would be the capacity of their car, and so on…

Equip with these rules on the outcome variable, we can build a system that traverse the tree thus created, in order to predict the likelihood in which a customer might churn on us in the future. In the following month, when we have collected and cleansed the most up to date transactional records, we could apply transformations on them to generate records similar to those in figure 1 and have the decision tree generate the value of the outcome variable to predict the likelihood of churn amongst the customer set being investigated. In fact, since the traversing of a tree takes a very short time (maximally, this is the height of the tree), we could actually predict in real time, if the customer is going to churn on us at the point of transaction. What this means, theoretically, is that, with the decision tree, we can potentially predict the likelihood in which the customer would churn, at the instance in which he is paying for his purchase at the petrol kiosk and do something, immediately, to influence his loyalty.

Before you get all too excited and rush out to build your first decision tree, a word of caution. We have only begun to scratch the surface of decision trees as a concept in data mining. Consider some of the questions below:

Since the root node is the most predictive variable, how do we systematically choose the root node?
Is having a binary tree (one in which each node has 2 or less children nodes) better than an n-ary tree (one in which the number of children nodes are allowed to vary) in prediction?
How do we prune a tree to rid it of rules that are nonsensical in the context of the application?
There is a great level of temptation to dive into a discussion on Information Gain, how modes are split, ID3, Goodness functions…etc. so that a much deeper understanding of decision trees can be established. This would, however, necessitate an inescapable discussion with the mathematical complexity that I have warned you about. Let’s devote a future article on these topics if there are sufficient interests.



Partner Relationship Management,Partner Relationship Marketing

Related Links
Download
our Fact Sheet
on Data Analytics
Read the
HP Case Study
on Data Analytics

Download the Data Analytics Brochure
Click here to download Chapter 1 of our book on Data Analytics
© Copyright 2005 SurfGold. All rights reserved.

Customer Loyalty Solutions | Partner Relationship Management | Data Analytics | Promo@Ease | AdoreAsia Rewards | Loyalty Whitepapers | Relationship Management Consulting | Loyalty Case Studies | Loyalty Engine | Loyalty Cube | PRM Solutions | Strategic Planning Process | Loyalty Solutions