Articles
Core data mining
algorithms - Statistical
We have progressed somewhat in the journey of data mining and knowledge
discovery. We have differentiated what data mining is and what it
is not. We have discussed the various processes underlying a typical
knowledge discovery engagement and in the most recent article we
went through a number of data mining techniques involving visualization.
The time is now ripe for us to jump right in and explore the intricacies
of algorithms based analytical techniques commonly adopted in data
mining. There will be a little mathematics involved, but nothing
that requires you to dust off that copy of College Calculus that
is still hibernating on the bookshelf.
Analytical Algorithmic methods tend to discourage user interactions.
Although we are often given permissions to decide on initial conditions
and the variables upon which the analysis is targeted, much of the
mining mechanics is left to the algorithms themselves. This is why
developing a good understanding of the algorithms utilized is important;
otherwise, the results might not make much sense.
From a classification standpoint, analytical data mining algorithms
can be categorized into the following:
> Statistical Analysis
> Outcome based Algorithms
> Cluster Algorithms
> Affinity Algorithms
Again, each broad category of technique consists of many sub categories.
In the interest of time and space, we shall discuss the first 2
techniques here leaving the last two in the next article.
Statistical Analysis As expected, statistics often play a large
part in analytical algorithms. In fact, much of the results from
other analytical algorithms are typically presented in some form
of statistics. As you can expect, statistical methods require the
use of quantitative data. They work with “numbers” rather
than qualitative data like “East”, “Jurong”
or “Male”. Obviously, these qualitative data needs to
be encoded into quantitative forms before statistical methods can
be applied. The most common encoding method would be that of categorization
where data items are segmented before statistical analysis is performed.
We have discussed some of these techniques in the previous article.
Statistical methods often forms the entry point of more involved
analysis works. Many data analysts would often equip themselves
with the “vital statistics” of the variables that they
are dealing with before applying other analytical algorithms on
them. Typically the basic level of statistics that interests most
data analysts are the mean, median, mode and standard deviation
of each variable within the data set. These give analysts a good
“feel” of the data set that they are dealing with. For
category-based data, another statistical value that is also of interest
is the number of discrete values that can be found within the data
set.
Another popular use of statistical methods is in the testing of
hypothesis. This is where the F tests, t tests,…etc, belonging
to the realm of parametric statistical tests, can be applied. Typically,
the testing of a hypothesis begins with, well, a hypothesis; a statement
that is true or false. “The ROI of the Spring Madness program
is effective, while considering seasonal variations in revenue”
and “Owners of luxury cars are less likely to shift their
loyalty in petroleum purchase if its price increases by 10%”
are examples of outcomes that an analyst might want to investigate.
In hypothesis-speak, however, these would become, “The ROI
of the Spring Madness program is significantly more than that due
to seasonal variation” and “The number of owners of
luxury cars shifting their loyalty in petroleum purchase if its
price increases by 10% is not significant”.
The next step is always to calculate a ratio that compares the
tested hypothesis with random variances that occurs within the entire
data set. These “random variances” would capture characteristics
like variations due to seasonality, data errors, baseline fluctuations,
noise…which could otherwise influence the hypothesis. So,
if the calculated ratio is large, it would mean that the hypothesis
being tested is “significant” as compared to the random
variation that the data is inherently exposed to. We would then
infer that this hypothesis would be observed again when similar
conditions prevailed.
Statistical methods are also used in Regression Analysis. The objective
is always to map the variables being analyzed into some form of
equation so that postulation can be made. We all learned about best-fit
lines in high school and how they do not always work. We are not
going to discuss too much about that here. More sophisticated regression
analysis uses some form of polynomial, exponential or logarithmic
equation to fit as many data points as possible. This is where variations
of Bezier, B-Splines and other exotically named algorithms can be
utilized in the generation of best-fit curves; each with their own
strengths and weaknesses. If you are keen in exploring some of these
without investing thousands of dollars in a data-mining tool, rudimentary
functions exist in MS Excel, allowing analysts to generate trend
lines onto charts depicting series data. Try it; it can often produce
rather insightful results.
Outcome based Algorithms Outcome based Algorithms always begin
with an outcome in mind. And this outcome is typically the reason
why analysis is performed to begin with. For example, in a scenario
depicting customer transactions on a regularly purchased product
(like petroleum, for example), customer churn might be defined as
a situation where a customer has not purchased the product for a
period of 2 months. From an analysis perspective, we might want
to figure out the factors identifying customer churn with respect
to petroleum purchase. In this situation the outcome to be investigated
is, therefore, customer churn.
Outcome based algorithms are usually applied to data sets containing
many variables, of which the outcome variable is one of. These analyses
are typically performed in order to determine which variables would
influence the state of outcome variable. In situations where the
outcome variable does not exist in the original transactional records,
it may have to be generated before the algorithms can be applied.
Without going through all the details, the transformed transactional
records would likely take the form as depicted in figure 1, before
algorithms can be applied. Each record would describe a customer’s
profile details and his transactional behaviors. It should be noted
these transactional behavior record values tend to be generated
from the transactional records themselves. The decisions to be made
on how these variables would be consolidated and categorized are
oftentimes difficult (see Sidebar).
With the records generated, we are now ready to apply algorithms
to the data set. By far, the most popular outcome based algorithms
are those that generate decision trees. One of the main reasons
for their popularity has to do with how easily they can be understood.
Figure 2 shows a portion of the decision tree generated based on
the data set described above.
The way to read the decision tree would be as follows:
> If the total purchase is greater than $200 AND the car is 1600cc,
the likelihood of churn is high.
> If the total purchase is greater than $200 AND the car is
between 1601 and 2000cc AND the number of loyalty points accumulated
is less than 1000 then the likelihood of churn is high.
> ..etc.
As can be noticed, those variables closer to the root exert greater
influence on the outcome variable. These variables are said to be
more predictive for the outcome variable. In our data sample, the
total purchased made within the 6 months period is the most predictive
variable for customer churn. It basically means that if you are
only given one variable to predict if the customer would churn or
not, you would want to know the total purchased made by the customer
within the most recent 6 months. And in this case, the split lies
at the $200 mark. From the Decision Tree generated the next variable
within those that spend more than $200 that can help you determine
churners and loyal customer would be the capacity of their car,
and so on…
Equip with these rules on the outcome variable, we can build a
system that traverse the tree thus created, in order to predict
the likelihood in which a customer might churn on us in the future.
In the following month, when we have collected and cleansed the
most up to date transactional records, we could apply transformations
on them to generate records similar to those in figure 1 and have
the decision tree generate the value of the outcome variable to
predict the likelihood of churn amongst the customer set being investigated.
In fact, since the traversing of a tree takes a very short time
(maximally, this is the height of the tree), we could actually predict
in real time, if the customer is going to churn on us at the point
of transaction. What this means, theoretically, is that, with the
decision tree, we can potentially predict the likelihood in which
the customer would churn, at the instance in which he is paying
for his purchase at the petrol kiosk and do something, immediately,
to influence his loyalty.
Before you get all too excited and rush out to build your first
decision tree, a word of caution. We have only begun to scratch
the surface of decision trees as a concept in data mining. Consider
some of the questions below:
Since the root node is the most predictive variable, how do we
systematically choose the root node?
Is having a binary tree (one in which each node has 2 or less children
nodes) better than an n-ary tree (one in which the number of children
nodes are allowed to vary) in prediction?
How do we prune a tree to rid it of rules that are nonsensical in
the context of the application?
There is a great level of temptation to dive into a discussion on
Information Gain, how modes are split, ID3, Goodness functions…etc.
so that a much deeper understanding of decision trees can be established.
This would, however, necessitate an inescapable discussion with
the mathematical complexity that I have warned you about. Let’s
devote a future article on these topics if there are sufficient
interests.
|