Articles
Core data mining
algorithms - Cluster Based
We have indeed advanced substantially in the journey of data mining
and knowledge discovery. Instead of giving the normal prose of what
we have discussed, I think it might be useful to provide a map of
our journey thus far, as can be found in figure 1.
In this article, we want to continue our exploration of the analytical
data mining algorithms that we have begun in the previous article.
We shall focus on
Cluster Algorithms
Affinity or Association Algorithms
Cluster Algorithms We met these algorithms in a discussion on visualization
where variables are displaced on multi dimensional graphs so that
they can be observed to congregate into visual groupings. The strength
to any techniques involving visualization is also often its limitation.
While the generated model is attractive to look at and interesting
to interact with, clusters need to be discovered through observation.
And if you have begun dabbling with visualization, you’d realize
that the human eye is not often the best tool for such discovery.
Thankfully, clusters can be identified through algorithms. And these
can be utilized by tireless machines in discovering potential patterns
that can be found within the data set under investigation. How do
these algorithms work? Invariably, they form groupings, or clusters
containing objects that are very similar in comparison to one another
and very dissimilar when compared to objects in other clusters.
Similarities (or dissimilarities) are assessed based on the values
of the objects. Most often, some form of distance measurements are
involved so that objects within the same cluster are said to be
“close” to one another and those in other clusters are
said to be “far” from them.
Herein lies the most complicated part of clustering algorithms.
How does one effectively calculate the “distance” between
2 objects in a data set when the attributes within the objects may
be rather different in dimensions, unit of measurements, scale and
even level of importance? For example, we can clearly say that a
person whose Sex attribute is Male (encoded as 1) is very different
from a person whose Sex attribute is Female (encoded as 0). After
all, the Sex attribute of an object has 2 states, 1 and 0; what
we often called binary attribute. Can we say the same for the same
2 persons if the Age attribute of one is 17 and that of the other
is 18?
In clustering, the algorithm is, you will very quickly realize,
more keen on “relative distances” than absolute distances
between objects. In the example above, if all objects of the data
set, except one, have the Sex attribute equals to “Male”,
then the odd object out would be considered as an outlier, beyond
any clusters that might be formed within the data set. Similarly,
if the Age attribute has a range of 0 to 99, then the 2 objects
with a difference in Age attribute of 1 might be very well be in
the same cluster rather than be attributed with any significance
in their differences.
The statistical methods behind attribute value normalization and
standardization for the computing of dissimilarity is rather established.
If there are sufficient interests we can devote future article to
it. The mathematics is reasonably hairy, but it is not something
that only mathematicians can understand.
Clustering algorithms are, by far, useful because they are inherently
unsupervised, that is, we do not need to have any a priori notion
of what the underlying data set is before they can be applied. There
isn’t the need to establish a training set to which the algorithm
must be applied in order to perform prediction. Clustering belongs,
hence, to the set of algorithms that learn by observation, rather
than by example. They can be applied to data sets raw without biasness.
It is, therefore, unsurprising that they are often used as a stand-alone
tool to gain insight on the distribution of data so that individual
clusters might be extracted for further detailed analysis.
Affinity or Association Algorithms
Affinity or Association Algorithms attempts to uncover interesting
associations and correlation relationships between objects within
the data set being analyzed. These algorithms would discover which
group of objects are likely to come together, which comes before
which comes after,..etc, basically the affinity of the objects of
each other. A simple example will illustrate how Association Algorithms
can be applied. Consider the last grocery trip that you had been.
Do you remember what you have bought? Chances are, many of the things
that you have placed in your shopping cart are bought because of
one another. For example, you could have bought Jam or Peanut Butter
because you had bought bread, curry powder because you had bought
potatoes, Pampers because you had bought milk powder,….etc.
Imagine you have at your disposal millions and millions of such
transactions from customers visiting the grocery store. Comparing
each item that was bought and what was bought together with them,
you would very quickly, be able to come up with association rules
that looks like:
Bread => Jam, Orange Juice, Cereals
Curry Powder=>Potatoes, Tomatoes. Onions
Pampers=>Milk Powder, Soft Tissues, Soap, Bread
…etc.
Such is the basis of Association Algorithms. We are able to uncover
the affinity, for example, of bread to jam, orange juice and cereals.
It would hence be useful to place these products closely together
so that the purchase of bread can “encourage” that of
orange juice, cereals and jam.
A discussion of Association Algorithms is never complete without
the introduction of 2 additional concepts: support and confidence.
Support has the following definition.
Support = P(condition È result)
Yes, the symbol in the equation is union, and P( ) means the probability
of. In mathematical terms, support means the probability of that
an item (the condition) and the items it is associated to (the result)
are in the data set. In English, it refers to the usefulness of
the association rule. If the purchase of bread does indeed generates
further purchase of jam, cereal and orange juice, but this happens
only 1 out of every 100 purchases (meaning the probability of 0.01),
this rule while true is hardly useful. A useful rule is hence one
which has a high support level, typically greater than 0.6.
Confidence has the following definition
Confidence = P(condition È result)/ P(condition)
I shall let you figure out what it means mathematically, but in
English it measures the truthfulness of the rule. Basically, how
confident we are of this rule. A rule can have high support, meaning
that it occurs very often, but every purchase of bread also lead
to the purchase of Pampers, Curry Powder, Milk Formula, and a dozen
other products also sold in the store which makes the truthfulness
of rule a little questionable.
So, the rules generated by Association Algorithms are useful only
if they have great support (occurring very often) and equally great
confidence (very truthful). Most algorithms would calculate these
metrics when the rules are being generated, rules below a certain
threshold would then be rejected.
Similar to Clustering Algorithms, Association Algorithms are unsupervised.
They tend to work on transactional data and require plenty of computing
power. There are variants of Association Algorithms that generates
rules that has a sequential nature to it, meaning, if a customer
buys Bread today, his likelihood of buying Jam, Cereals and Milk
Powder in his next visit would have a confidence of 0.99 and a support
of 0.83. These are great rules for recommending a customer of his
next purchase now that you have got him to commit on his first purchase.
With such an understanding it would be easy to make existing customers,
loyal customers.
So, we have once again reached the end of another article. We have
looked at 2 additional non-visual based, analytical data mining
techniques and their variants. Clustering Algorithms we have first
met in the discussion on visualization techniques, the ones discussed
here are code based, requiring minimal human supervision. Association
based algorithms generate rules that would reveal the relationships
between objects within datasets, which objects have a natural tendency
to come together, which would lead from one to another. In the next
article, we shall discuss some of the ways in which data mining
techniques can be utilized, how they can be applied and what likely
benefits can we derived from them, what disasters can we avoid through
their applications.
|