Building Customer Loyalty
About SurfGold
Solutions
Products
Approach
Our Clients
Knowledge Hub
Careers
Home > Knowledge Hub > White Papers | Information tools | Articles | Press Releases
Articles

Core data mining algorithms

We have progressed somewhat in the journey of data mining and knowledge discovery. We have differentiated what data mining is and what it is not. We have discussed the various processes underlying a typical knowledge discovery engagement and in the previous

Let’s recap what we have discussed thus far. In the first article of this series, we differentiates what data mining is and what is not, how people have positioned their products as delivering data mining and how they have not. In the second article, we investigated in the various stages of setting up a knowledge discovery environment and where the actual mystical algorithms can be applied.

We have unwittingly reached an interesting cross road. There are a number of options for this article. We could follow the logical sequence of a detailed description of the various data preparation, data optimization, data analysis,..etc. processes spelt out in the previous article or we could dive right in on a discussion of the algorithms themselves. Based on the feedback I have gotten from the first article, I have decided to focus on the latter. Perhaps we can revisit the discussion on data preparation, optimization and cleaning later.

From 20,000 feet, data mining algorithms can be separated into 2 large families:
• Visualization Methods AND
• Algorithm based Analytical Methods.

In this article we shall explore some of the more popular visualization methods use in data discovery.

As the name suggests, visualization methods entail some form of plotting of data points into some space so that analysis can be “eye-balled”. Often used as a precursor to data sampling, Visualization can be applied quickly and frequently provides a good “feel” of what the underlying data entails.

Visualization methods, by themselves, however, are not great in building models. They tend to answer questions such as “Who will respond well to my future marketing campaign that is similar to the one that I ran last Christmas” better than “Give me the rules that will identify if this is a high risk loan or not”.

The other good thing about visualization is that of identifying outliers or data records that are erroneous in some manner or form. Recall from the previous article that analytical methods are largely machine based. These algorithms would take data records that escape the cleaning process but still obviously erroneous, out-of-range, exceptional…etc, as statistically significant as those that contain valuable data. When we plot these data sets in space, the outliers becomes immediately apparent. These records can then be extracted and isolated before further analysis is made.

Before we go any further, let’s have a pop quiz. How many variables can you effectively plot in space, for effective visualization?
If your answer is 3, don’t despair; this is the path that our high school teachers like to lead us to. Nobody quite knows what the correct answer is, but it is at least larger than 20. How? Consider the diagram below:
A simple 3-D plot as follows can depict the following variables
- x, y and z axes to represent 3 variables
- Dimension of each object point to represent 3 further variables
- Color to depict another variable

That is 7 variables in all that could be visualized from a simple plot! Add in frame based animations, surface plots…etc. and very soon, you can tell where the 20+ variables can come from. For practical purposes, however, one seldom investigates more than 10 variables in any single, useful visualization session.

As can be expected, visualization methods can be separated into the sub categories. Below is a description of some of the more popular ones.

Cluster Plots
Data points are plotted in space across multiple variables and investigated. With luck, they would appear in the manner depicted in Figure 1 above, with some obvious clusters forming in the top-left hand corner of the plot.
Obviously, the choice of variables is of paramount importance in discovering clusters within the data set and rarely would one stop at the first trial. The choice of variables is typically an interactive process and can be rather time consuming. From personal experience, we once spent 2 days analyzing a 20 plus variables data set before settling with 3 variables that are useful in depicting cluster based information within the set. You could imagine that how the variables are ordered, how they are segmented, how they are grouped into categories would affect their clustered appearance significantly.

Relative and Absolute Clusters
Hierarchy Maps
Hierarchy Maps can be used when an object’s position is determined on its relationship with other objects. Often times, the relationship is one of parent-child, super class-sub class; typically in some form of categorization. Hierarchies are also great in showing some value chain based relationships. An example encountered in the course of our work here in SurfGold would serves as useful illustration. Many brand owners today (like HP, Johnson & Johnson, Motorola, APB) do not sell direct. If you want to buy something; ranging from contact lenses to printers, from hand phones to beer, you will have to get it from a reseller. Resellers in turn get their wares from a distributor. Distributors buy these products in bulk from the brand owner. This is known as 2-tiered distribution channel. Describing the details of distribution structures is out of the scope of this article; suffice to say that brand owners typically would like to know which resellers buy from which distributors and whether they do it consistently across time. If there are channel marketing managers amongst you, ask yourself, how is this information presented to you today? More often than not, this information, if available, is given to you in a table. If the table is sorted, you should be thankful already. Consider the amount of clarity that you could achieve if it is presented in the form in figure 2, which, by the way, is generated from real world data, with the relevant identities masked. Here Hierarchy Maps are used, where the root node is defined as the brand owner with the distributors occupying level 1 nodes. Resellers that have a direct purchase relationship with specific distributors occupy level 2 nodes. What insight can we gain from such a Hierarchy Map? Plenty. We would know for example, that Reseller ABC buys from both Distributor 1 and Distributor 2. Judging from the “strength” of the link between Reseller ABC and both distributors, we can safely conclude that the relationship with Distributor 1 is probably primary while that of Distributor 2 is secondary. We can also see that Distributor 3 is not doing such a great job in cultivating resellers and deserve closer scrutiny. As it turns out, Distributor 3 is a sub-distributor; they have been secretly shipping products from Singapore to Vietnam. In marketing terms, this is called gray marketing, something that all brand owners would frown upon. Imagine, if you will, what if we superimpose the hierarchical map over a locality map, and animate it over time? Very quickly we would depict channel coverage across a geographical region and how such dynamics changes with seasonality. I’ll leave a detailed discussion of this in a future article where we discuss successful case studies of data mining. On another dimension, are there any reasons to expand the hierarchical map to greater depths? The answer is in fact, yes. Consider the potential of level 3 nodes to represent the sales reps within each reseller making a sale. Next, assign a color to these nodes that exceed the sales of 50% contributed by the reseller. What have we accomplished here? We have effectively generated the invite list for this quarter’s New Product Training. These are the folks that will bring in the sales that you need to meet your targets this year.



Self Organizing Networks
Lastly we want to discuss another popular visualization technique, known as self-organizing networks. A network structure would typically consist of multiple instances of 3 components: Objects, Links and Networks, as depicted in figure.


How do Network structures work? They are based on the concept of push and pull. Basically objects would repel each other while links would attract them. With these dynamics built in, we can very quickly create a network of data items interacting with each other based on their relationships with other data items. Objects that relate to one another in a relatively strong manner would form cluster groupings that separate themselves from the others. While data items that “influences” the behavior of others would forms centroids in a largely fanned out network.
We once did an engagement where the links are defined in a dependency relationship and the resulting network structure allows an estimation of the associated cost involved in removing one or more of the variables in the data set.

Visualization techniques are often one of the means to the end. They are often used as precursor to algorithms based analysis techniques, giving the data miner a good feel of what to expect from the algorithms. As discussed above, they are also great in exposing data records that are outliers, allowing them to be isolated so as to remove unnecessary biasness that can affect further analysis. In some cases, visualization would also influence the selection of the algorithms to be applied to the data sets in order to uncover more fruitful results.
Often times, visualization techniques would be revisited once all analysis has been done to the data set. This time the emphasis would be that of presentation, after all, even the most remarkable discovery need to be communicated convincingly in order for the appropriate actions to take place. Usually, the variables that have the greatest influence on the data segmentation would have been identified; new data records that depict the uncovered trend would have been generated; meaningful categorization of the data sets would have been created. The task of visualization techniques in this case would be to effectively communicate the findings to the targeted audience. Much depends on the level of sophistication of the intended audience, whether they are of a technical biasness or marketing types. Understandably, this task is frequently better defined and more guided as compared to that of data discovery.

In the next article, we shall explore algorithm based analytical methods. Unfortunately, we may not be able to have a useful discussion of them without a small dose of mathematics. The journey is, however, by no means less exciting.

PRM solution,Relationship Management Consulting,Rewards Program,Database Marketing

Related Links
Download
our Fact Sheet
on Data Analytics
Read the
HP Case Study
on Data Analytics

Download the Data Analytics Brochure
Click here to download Chapter 1 of our book on Data Analytics
© Copyright 2005 SurfGold. All rights reserved.

Customer Loyalty Solutions | Partner Relationship Management | Data Analytics | Promo@Ease | AdoreAsia Rewards | Loyalty Whitepapers | Relationship Management Consulting | Loyalty Case Studies | Loyalty Engine | Loyalty Cube | PRM Solutions | Strategic Planning Process | Loyalty Solutions