Articles
Core data mining
algorithms
We have progressed somewhat in the journey of data mining and knowledge
discovery. We have differentiated what data mining is and what it
is not. We have discussed the various processes underlying a typical
knowledge discovery engagement and in the previous
Let’s recap what we have discussed thus far. In the first
article of this series, we differentiates what data mining is and
what is not, how people have positioned their products as delivering
data mining and how they have not. In the second article, we investigated
in the various stages of setting up a knowledge discovery environment
and where the actual mystical algorithms can be applied.
We have unwittingly reached an interesting cross road. There are
a number of options for this article. We could follow the logical
sequence of a detailed description of the various data preparation,
data optimization, data analysis,..etc. processes spelt out in the
previous article or we could dive right in on a discussion of the
algorithms themselves. Based on the feedback I have gotten from
the first article, I have decided to focus on the latter. Perhaps
we can revisit the discussion on data preparation, optimization
and cleaning later.
From 20,000 feet, data mining algorithms can be separated into
2 large families:
• Visualization Methods AND
• Algorithm based Analytical Methods.
In this article we shall explore some of the more popular visualization
methods use in data discovery.
As the name suggests, visualization methods entail some form of
plotting of data points into some space so that analysis can be
“eye-balled”. Often used as a precursor to data sampling,
Visualization can be applied quickly and frequently provides a good
“feel” of what the underlying data entails.
Visualization methods, by themselves, however, are not great in
building models. They tend to answer questions such as “Who
will respond well to my future marketing campaign that is similar
to the one that I ran last Christmas” better than “Give
me the rules that will identify if this is a high risk loan or not”.
The other good thing about visualization is that of identifying
outliers or data records that are erroneous in some manner or form.
Recall from the previous article that analytical methods are largely
machine based. These algorithms would take data records that escape
the cleaning process but still obviously erroneous, out-of-range,
exceptional…etc, as statistically significant as those that
contain valuable data. When we plot these data sets in space, the
outliers becomes immediately apparent. These records can then be
extracted and isolated before further analysis is made.
Before we go any further, let’s have a pop quiz. How many
variables can you effectively plot in space, for effective visualization?
If your answer is 3, don’t despair; this is the path that
our high school teachers like to lead us to. Nobody quite knows
what the correct answer is, but it is at least larger than 20. How?
Consider the diagram below:
A simple 3-D plot as follows can depict the following variables
- x, y and z axes to represent 3 variables
- Dimension of each object point to represent 3 further variables
- Color to depict another variable
That is 7 variables in all that could be visualized from a simple
plot! Add in frame based animations, surface plots…etc. and
very soon, you can tell where the 20+ variables can come from. For
practical purposes, however, one seldom investigates more than 10
variables in any single, useful visualization session.

As can be expected, visualization methods can be separated into
the sub categories. Below is a description of some of the more popular
ones.
Cluster Plots
Data points are plotted in space across multiple variables and investigated.
With luck, they would appear in the manner depicted in Figure 1
above, with some obvious clusters forming in the top-left hand corner
of the plot.
Obviously, the choice of variables is of paramount importance in
discovering clusters within the data set and rarely would one stop
at the first trial. The choice of variables is typically an interactive
process and can be rather time consuming. From personal experience,
we once spent 2 days analyzing a 20 plus variables data set before
settling with 3 variables that are useful in depicting cluster based
information within the set. You could imagine that how the variables
are ordered, how they are segmented, how they are grouped into categories
would affect their clustered appearance significantly.
Relative and Absolute Clusters
Hierarchy Maps
Hierarchy Maps can be used when an object’s position is determined
on its relationship with other objects. Often times, the relationship
is one of parent-child, super class-sub class; typically in some
form of categorization. Hierarchies are also great in showing some
value chain based relationships. An example encountered in the course
of our work here in SurfGold would serves as useful illustration.
Many brand owners today (like HP, Johnson & Johnson, Motorola,
APB) do not sell direct. If you want to buy something; ranging from
contact lenses to printers, from hand phones to beer, you will have
to get it from a reseller. Resellers in turn get their wares from
a distributor. Distributors buy these products in bulk from the
brand owner. This is known as 2-tiered distribution channel. Describing
the details of distribution structures is out of the scope of this
article; suffice to say that brand owners typically would like to
know which resellers buy from which distributors and whether they
do it consistently across time. If there are channel marketing managers
amongst you, ask yourself, how is this information presented to
you today? More often than not, this information, if available,
is given to you in a table. If the table is sorted, you should be
thankful already. Consider the amount of clarity that you could
achieve if it is presented in the form in figure 2, which, by the
way, is generated from real world data, with the relevant identities
masked. Here Hierarchy Maps are used, where the root node is defined
as the brand owner with the distributors occupying level 1 nodes.
Resellers that have a direct purchase relationship with specific
distributors occupy level 2 nodes. What insight can we gain from
such a Hierarchy Map? Plenty. We would know for example, that Reseller
ABC buys from both Distributor 1 and Distributor 2. Judging from
the “strength” of the link between Reseller ABC and
both distributors, we can safely conclude that the relationship
with Distributor 1 is probably primary while that of Distributor
2 is secondary. We can also see that Distributor 3 is not doing
such a great job in cultivating resellers and deserve closer scrutiny.
As it turns out, Distributor 3 is a sub-distributor; they have been
secretly shipping products from Singapore to Vietnam. In marketing
terms, this is called gray marketing, something that all brand owners
would frown upon. Imagine, if you will, what if we superimpose the
hierarchical map over a locality map, and animate it over time?
Very quickly we would depict channel coverage across a geographical
region and how such dynamics changes with seasonality. I’ll
leave a detailed discussion of this in a future article where we
discuss successful case studies of data mining. On another dimension,
are there any reasons to expand the hierarchical map to greater
depths? The answer is in fact, yes. Consider the potential of level
3 nodes to represent the sales reps within each reseller making
a sale. Next, assign a color to these nodes that exceed the sales
of 50% contributed by the reseller. What have we accomplished here?
We have effectively generated the invite list for this quarter’s
New Product Training. These are the folks that will bring in the
sales that you need to meet your targets this year.

Self Organizing Networks
Lastly we want to discuss another popular visualization technique,
known as self-organizing networks. A network structure would typically
consist of multiple instances of 3 components: Objects, Links and
Networks, as depicted in figure.

How do Network structures work? They are based on the concept of
push and pull. Basically objects would repel each other while links
would attract them. With these dynamics built in, we can very quickly
create a network of data items interacting with each other based
on their relationships with other data items. Objects that relate
to one another in a relatively strong manner would form cluster
groupings that separate themselves from the others. While data items
that “influences” the behavior of others would forms
centroids in a largely fanned out network.
We once did an engagement where the links are defined in a dependency
relationship and the resulting network structure allows an estimation
of the associated cost involved in removing one or more of the variables
in the data set.
Visualization techniques are often one of the means to the end.
They are often used as precursor to algorithms based analysis techniques,
giving the data miner a good feel of what to expect from the algorithms.
As discussed above, they are also great in exposing data records
that are outliers, allowing them to be isolated so as to remove
unnecessary biasness that can affect further analysis. In some cases,
visualization would also influence the selection of the algorithms
to be applied to the data sets in order to uncover more fruitful
results.
Often times, visualization techniques would be revisited once all
analysis has been done to the data set. This time the emphasis would
be that of presentation, after all, even the most remarkable discovery
need to be communicated convincingly in order for the appropriate
actions to take place. Usually, the variables that have the greatest
influence on the data segmentation would have been identified; new
data records that depict the uncovered trend would have been generated;
meaningful categorization of the data sets would have been created.
The task of visualization techniques in this case would be to effectively
communicate the findings to the targeted audience. Much depends
on the level of sophistication of the intended audience, whether
they are of a technical biasness or marketing types. Understandably,
this task is frequently better defined and more guided as compared
to that of data discovery.
In the next article, we shall explore algorithm based analytical
methods. Unfortunately, we may not be able to have a useful discussion
of them without a small dose of mathematics. The journey is, however,
by no means less exciting.
|