Articles
Data Mining
Data mining is part of a larger discipline known as Knowledge Discovery
in Databases (KDD), which involved disciplines such as mathematics,
algorithms, machine learning, statistics, artificial intelligence
in order to make more intelligence sense of data while at the same
time refining our understanding of the data collected.
Why is Data Mining Important?
Look around you. The world is obviously a data centric
one. Humans and corporations alike have been collecting data with
a fervor that challenges that of the gold rush. The innate need
to tag things, to label objects, to track transactions, have lead
to a parallel industry to record, classify, categorize, segment
and otherwise organize the data thus collected.
We, hence, live in a sea of data. Not very much unlike deep-sea
ecosystems where unimaginable life forms wobble along their existence
in an environment surrounded by water. We are, unfortunately, not
very much smarter than these creatures whose world is in perpetual
darkness. We are a visitor to the Library of Babel, as described
by Argentine writer Jorge Luis Borges, whose infinite shelves of
books contain an infinite amount of knowledge, yet unattainable
without the discovery of the hinted catalogue of catalogues nor
accessible since the contents of which are incomprehensible by humans.
We know who eats in which restaurant at which time and yet have
no idea when he would be hungry for pan-cakes next, we know where
and when he pumped his last gallon of gas, but would not have any
clue as to why if he decides to switch his petrol alliance to someone
else.
The answer is as simple as the question: data, massive amounts
of it, does not equate to intelligence. We have massive collections
of data, but we are no more intelligent than we have been before.
Which explains why there is currently so much interest in the area
of data mining – the art and science of extracting intelligence
from massive collections of data.
Similar to any technology under the spotlight, data mining is a
frequently misunderstood term. I have personally encountered situations
where companies promoting statistical packages as having data mining
capabilities. There are others who regurgitate data from backend
systems to nicely formatted web fronts who also claim to have data
mining competencies. The classics would be those that tout POS systems
that supply transactional information as data mining systems.
What exactly is data mining, this mysterious silver bullet that
would convert our heaps and heaps of data into intelligence? Simply
put, data mining refers to processes and algorithms that enable
the discovery of hidden information from collections of data. This
discovered information can be trends, segmentation, clusters, associations,
rules-of-thumb, understandings,..etc.; fundamentally, a new realization
of the data that was previously unknown.
In fact, it is widely accepted that, data mining is but the discovery
stage of a wider discipline known as Knowledge Discovery in Databases
(KDD) . Pieter Adriaans et al. defines KDD as the “ non-trivial
extraction of implicit, previously unknown and potentially useful
information from data”. KDD is hence an amalgamation of machine
learning, statistical methods, visualization, expert systems and
database technologies.
This series of articles attempts to offer a comprehensive introduction
to the entire KDD process and not limit itself to data mining algorithms
and methods. Future issues will discuss topics on Data Preparation,
Cleansing and Transformation, Data Visualization techniques and
Data Modeling. The intension is to round off the series with a discussion
of real life applications of data mining techniques and how companies
have benefited from these explorations. Oh, yes, there will also
be an elaboration of data mining algorithms with an emphasis on
the more popular methods for intelligence discovery.
To round off this issue, perhaps it might be useful to clarify some
of the greatest myths surrounding data mining.
1 I don’t think there are any trends or
clusters in my data set.Nothing can be further from the truth. Data
mining engagements always suffers from insufficient data, both in
quality (number of attributes available) and in quantity (number
of instances available), but seldom in the lack of character. There
are rare occasions where data sets, of significant size, did not
reveal anything interesting patterns, but these are indeed far and
few in between. More often than not, data mining uncovers interesting
patterns that were at best suspected.
2 My data is mine. I don’t want you to be
sticking your nose into them in case the competition gets wind of
what it reveals.This is partially true. Data in its raw form reveals
many things. Like how sloppy the data collection process can be.
Or how redundancy might be exploited in order to build in data verification
or solicit more details from the user. It is, therefore, conceivable
that an analysis of the data would reveal the deepest and darkest
secrets within your enterprise. But there are many things that can
be done to protect the semantic content of your data. Encryption
is one. Normalization is the other. Meta data generation is yet
another. We can probably fill up this entire article with transformations
that can be applied to data in order to achieve the objective of
privacy with the added benefit data modeling before applying data
mining algorithms.
3 I am already doing data mining. We have this
application that display my data in rows and columns, allowing me
to drill down its rows and columns to see greater and greater details
of my business.
You could be deriving a great deal of insight through drilling
down rows and columns, but that is data mining, NOT. What you have
just described is probably an OLAP application. OLAP, which stands
for Online Analytical Processing, allows the user to gather data
from multiple databases into highly complex tables. OLAP basically
deals with aggregates, which is fairly different from intelligence
discovery such as identifying patterns, trends, segmentations, clusters
and associations.
However, it is unwise to overlook what OLAP tools can reveal, especially
when used in conjunction with data mining algorithms. In fact, I
have seen numerous instances of how OLAP and data mining offerings
augment each other providing great insights to the business operations.
In fact, OLAP will be discussed extensively in a later issue in
conjunction with analytical methods.
4 I am a marketing guy; I don’t understand
predictive regression analysis nor cross correlation matrix. Heck,
I don’t even know how to set up a DBMS properly nor do I have
the money to pay for the hardware or software needed for data mining.
Again, this is partially true. Data mining has, however, evolved
into offerings available in an ASP (Application Service Provider)
model. You pay the data-mining provider a small fee per month and
you get reports on the analysis performed on your data set. You
don’t need to know how predictive regression analysis is done,
not how cross correlation matrices are created. Heck, you don’t
even know how to set up a DBMS, your friendly data-mining service
provider will provide you with reports that describe the interesting
patterns discovered in your data set in terms that you understand.
5 Data mining is not suitable for me. I don’t
have a computer system; the only data I have of my customers’
is in the form of paper invoices or warranty cards. Nothing can
be further from the truth. If you are keen in uncovering hidden
trends in your data, the analog nature by which your data is trapped
in is not going to stop you. In fact, many of the data mining engagements
that we have undertaken began with data trapped in invoices, warranty
cards, reports…etc. There are well-established, automated
and semi-automated ways to encode and otherwise digitalize data
into a form on which data mining algorithms can be applied.
Data mining is a field that has generated a lot of interest lately.
It tends to straddle between the realm of art and science, requiring
the data-mining practitioner to be both well versed in the science
of database technologies and the art of data manipulation. Too many
people, in the recent past, have claimed to offer data mining solutions.
Too many of them borders on outright misrepresentation and fraud.
It is hope that this series of articles will demystify many of the
fallacies surrounding data mining. In the next article we shall
describe the processes necessary to set up a knowledge discovery
environment.
| Terms |
Description |
| DM |
Data Mining or Direct Marketing or Direct
Mailer |
| DBMS |
Database Mangement System |
| RDBMS |
Relational DBMS |
| KDO |
Knowledge Discovery in Databases |
| OLAP |
Online Analytical Processing |
| OLTP |
Online Transactional Processing |
| BI |
Business Intelligence |
| RTL |
Extract-TransForm-Load |
| k-NN |
k-nearest neighbour algorithm |
|