Building Customer Loyalty
About SurfGold
Solutions
Products
Approach
Our Clients
Knowledge Hub
Careers
Home > Knowledge Hub > White Papers | Information tools | Articles | Press Releases
Articles

Core data mining algorithms - Cluster Based

We have indeed advanced substantially in the journey of data mining and knowledge discovery. Instead of giving the normal prose of what we have discussed, I think it might be useful to provide a map of our journey thus far, as can be found in figure 1.

In this article, we want to continue our exploration of the analytical data mining algorithms that we have begun in the previous article. We shall focus on

Cluster Algorithms
Affinity or Association Algorithms
Cluster Algorithms We met these algorithms in a discussion on visualization where variables are displaced on multi dimensional graphs so that they can be observed to congregate into visual groupings. The strength to any techniques involving visualization is also often its limitation. While the generated model is attractive to look at and interesting to interact with, clusters need to be discovered through observation. And if you have begun dabbling with visualization, you’d realize that the human eye is not often the best tool for such discovery. Thankfully, clusters can be identified through algorithms. And these can be utilized by tireless machines in discovering potential patterns that can be found within the data set under investigation. How do these algorithms work? Invariably, they form groupings, or clusters containing objects that are very similar in comparison to one another and very dissimilar when compared to objects in other clusters. Similarities (or dissimilarities) are assessed based on the values of the objects. Most often, some form of distance measurements are involved so that objects within the same cluster are said to be “close” to one another and those in other clusters are said to be “far” from them.

Herein lies the most complicated part of clustering algorithms. How does one effectively calculate the “distance” between 2 objects in a data set when the attributes within the objects may be rather different in dimensions, unit of measurements, scale and even level of importance? For example, we can clearly say that a person whose Sex attribute is Male (encoded as 1) is very different from a person whose Sex attribute is Female (encoded as 0). After all, the Sex attribute of an object has 2 states, 1 and 0; what we often called binary attribute. Can we say the same for the same 2 persons if the Age attribute of one is 17 and that of the other is 18?

In clustering, the algorithm is, you will very quickly realize, more keen on “relative distances” than absolute distances between objects. In the example above, if all objects of the data set, except one, have the Sex attribute equals to “Male”, then the odd object out would be considered as an outlier, beyond any clusters that might be formed within the data set. Similarly, if the Age attribute has a range of 0 to 99, then the 2 objects with a difference in Age attribute of 1 might be very well be in the same cluster rather than be attributed with any significance in their differences.

The statistical methods behind attribute value normalization and standardization for the computing of dissimilarity is rather established. If there are sufficient interests we can devote future article to it. The mathematics is reasonably hairy, but it is not something that only mathematicians can understand.

Clustering algorithms are, by far, useful because they are inherently unsupervised, that is, we do not need to have any a priori notion of what the underlying data set is before they can be applied. There isn’t the need to establish a training set to which the algorithm must be applied in order to perform prediction. Clustering belongs, hence, to the set of algorithms that learn by observation, rather than by example. They can be applied to data sets raw without biasness. It is, therefore, unsurprising that they are often used as a stand-alone tool to gain insight on the distribution of data so that individual clusters might be extracted for further detailed analysis.

Affinity or Association Algorithms
Affinity or Association Algorithms attempts to uncover interesting associations and correlation relationships between objects within the data set being analyzed. These algorithms would discover which group of objects are likely to come together, which comes before which comes after,..etc, basically the affinity of the objects of each other. A simple example will illustrate how Association Algorithms can be applied. Consider the last grocery trip that you had been. Do you remember what you have bought? Chances are, many of the things that you have placed in your shopping cart are bought because of one another. For example, you could have bought Jam or Peanut Butter because you had bought bread, curry powder because you had bought potatoes, Pampers because you had bought milk powder,….etc. Imagine you have at your disposal millions and millions of such transactions from customers visiting the grocery store. Comparing each item that was bought and what was bought together with them, you would very quickly, be able to come up with association rules that looks like:

Bread => Jam, Orange Juice, Cereals

Curry Powder=>Potatoes, Tomatoes. Onions

Pampers=>Milk Powder, Soft Tissues, Soap, Bread

…etc.

Such is the basis of Association Algorithms. We are able to uncover the affinity, for example, of bread to jam, orange juice and cereals. It would hence be useful to place these products closely together so that the purchase of bread can “encourage” that of orange juice, cereals and jam.

A discussion of Association Algorithms is never complete without the introduction of 2 additional concepts: support and confidence.

Support has the following definition.

Support = P(condition È result)

Yes, the symbol in the equation is union, and P( ) means the probability of. In mathematical terms, support means the probability of that an item (the condition) and the items it is associated to (the result) are in the data set. In English, it refers to the usefulness of the association rule. If the purchase of bread does indeed generates further purchase of jam, cereal and orange juice, but this happens only 1 out of every 100 purchases (meaning the probability of 0.01), this rule while true is hardly useful. A useful rule is hence one which has a high support level, typically greater than 0.6.

Confidence has the following definition

Confidence = P(condition È result)/ P(condition)

I shall let you figure out what it means mathematically, but in English it measures the truthfulness of the rule. Basically, how confident we are of this rule. A rule can have high support, meaning that it occurs very often, but every purchase of bread also lead to the purchase of Pampers, Curry Powder, Milk Formula, and a dozen other products also sold in the store which makes the truthfulness of rule a little questionable.

So, the rules generated by Association Algorithms are useful only if they have great support (occurring very often) and equally great confidence (very truthful). Most algorithms would calculate these metrics when the rules are being generated, rules below a certain threshold would then be rejected.

Similar to Clustering Algorithms, Association Algorithms are unsupervised. They tend to work on transactional data and require plenty of computing power. There are variants of Association Algorithms that generates rules that has a sequential nature to it, meaning, if a customer buys Bread today, his likelihood of buying Jam, Cereals and Milk Powder in his next visit would have a confidence of 0.99 and a support of 0.83. These are great rules for recommending a customer of his next purchase now that you have got him to commit on his first purchase. With such an understanding it would be easy to make existing customers, loyal customers.

So, we have once again reached the end of another article. We have looked at 2 additional non-visual based, analytical data mining techniques and their variants. Clustering Algorithms we have first met in the discussion on visualization techniques, the ones discussed here are code based, requiring minimal human supervision. Association based algorithms generate rules that would reveal the relationships between objects within datasets, which objects have a natural tendency to come together, which would lead from one to another. In the next article, we shall discuss some of the ways in which data mining techniques can be utilized, how they can be applied and what likely benefits can we derived from them, what disasters can we avoid through their applications.

CRM solution,Customer Loyalty Solutions

Related Links
Download
our Fact Sheet
on Data Analytics
Read the
HP Case Study
on Data Analytics

Download the Data Analytics Brochure
Click here to download Chapter 1 of our book on Data Analytics
© Copyright 2005 SurfGold. All rights reserved.

Customer Loyalty Solutions | Partner Relationship Management | Data Analytics | Promo@Ease | AdoreAsia Rewards | Loyalty Whitepapers | Relationship Management Consulting | Loyalty Case Studies | Loyalty Engine | Loyalty Cube | PRM Solutions | Strategic Planning Process | Loyalty Solutions