Articles
Stages of data
mining
In the first article, we discussed what data mining is and is not,
why it is important and how it could affect our lives. In this article,
we would explore what are the various stages necessary in the setting
up a knowledge discovery environment where data mining algorithms
can be effectively applied.
Similar to any serious endeavor, the key word here is plan, plan
and plan. A wise man once said that any project manager with two
grams of salt would run through their projects twice, once in their
head and another during execution. This is exactly the case for
any knowledge discovery projects. Don’t be mistaken; the knowledge
discovery process need not be a long-winded commitment, requiring
an extended period of time. If sufficient preparation work is done,
the knowledge discovery process can be relatively short and absolutely
painless.
What then are the steps of a knowledge discovery process?
We have much of the usual suspects here as depicted
by the diagram below:
Data Load: We typically start with a collection
of data from multiple sources and we want to store them into a coherent
locality. Typical sources that are great sources of data can range
from tables in databases of various systems (if you are lucky),
POS (Point of sales) records, Transactional receipts, Facsimiles,
Printouts or even a series of screen-scrapping exercise. Data trapped
in non-digital records would need to be transcribed, of course.
Data Cleansing: By now we should have all the required
data fields in digital form residing within a single coherent locality.
While the data fields are still undeniably raw, nor are they any
form suitable for analysis yet, but believe me, we are already well
on our way in the knowledge discovery process.
We next have to clean that data in our possession to ensure that
each table contains data that are relatively free of errors, anomalies
and misspellings. The obvious key word here is “relatively”.
We need to decide if the extra mile needed to squeeze that extra
percentage point of cleanliness in the data is worth the effort.
While it is always great to have as accurate a data set to work
on as possible, however, more often than not, a percentage of unclean
data would not significantly affect the results of any analysis
applied on the data set.
Over the years, many data cleansing techniques have been developed;
an elaborate discussion of them can indeed fill up an entire article.
In fact, there are companies that make a living solely on cleaning
data for corporations. The more common data cleansing techniques
typically involve some form of custom build or industry strength
taxonomy to compare each data item against.
Transformation: Now that the raw data is reasonably
cleaned, we need to begin the next process of transforming the data
into a form that is suitable for data mining and analysis. The need
for data transformation is best thought of using the following example.
Consider the records found in the tables in figure 2. The fields
that are presented seem to be useless for analysis. What results
can one realistically gather from a table showing details of people’s
addresses and another showing the locality of where the transactions
were made? The situation here could be one of too much information,
making analysis useless. On closer observation, those who are a
little familiar with the geography of Singapore would notice that
everyone in table 1, except for Bob, lives in the vicinity of each
other. Similarly, the “Locality” field stores location
that appears to be in the vicinity of each other as well. To do
effective analysis, like figuring out if there is a relationship
between where a person stays and where he buys stuff, we might want
to consider transforming the tables into the following, in figure
3, consistently encoding “Address” and “Locality”
information in the Braddell-Toa Payoh vicinity as 01 and the Jurong
vicinity as 02. There are obviously a lot more to data transformation
than transcribing addresses in the vicinity into numbers but further
elaboration can be left to a future article.
Records optimization: Some considered this phase
as part of data transformation; others tend to keep it separate.
I think whether they are separate or not is really pure academic,
we are probably not going to worry ourselves with that here.
The Optimization phase actually does the familiar conversion of
raw data into formats that make the analysis process easier and
more meaningful. The key difference between this phase and the previous
one is just that the number of resultant records from this phase
would naturally to be less than those that we have begun with. We
are again faced with a situation where the raw data has too much
information that would make effective analysis reasonable. Think
of it this way, if we have 5 million records of the daily transactions
of customers patronizing retail outlets, which takes weeks of computing
power to churn out useful models, can we optimize the dataset by
aggregating the transactions into weekly or even monthly transactions?
This can convert the required weeks of processing to days of processing,
often more palatable, at least for the initial investigative phase.
Again, which records to consolidate, what criteria should be applied
is more art than science. An iterative approach of selection is
often adapted to suite the resource limitation faced.
Data Upload: Once the records are effectively scrubbed
cleaned, transformed and selected, they are loaded into files or
tables suitable for analysis.
This should be a relatively straightforward phase, where datasets
from multiple tables are converted into file formats that data mining
algorithms would discover hidden trends and rules in. Many algorithms
prefer CSV (comma separated values); others require the user to
import the data into their table formats.
Data Discovery: This is probably the most exciting
stage of the knowledge discovery process. This is the time to apply
the various data mining algorithms to the data set, sit back and
decipher the findings that the algorithms uncover for us. The difficulty
is often, where does one begin? There aren’t rules of thumb
here, but there are preferences. Many prefer a statistical analysis
first, doing the simple things like finding the mean, median, mode
and standard deviation of the data elements followed by a rough
visualization of the data set across fields that have suspected
relationships. You will get a lot of noise, of course; but there
would be times in which the data selected would reveal possible
trends that analysis algorithms can zero in later.
What typically happens next is a series of iterative steps where
one would apply directed and undirected mining algorithms to first
a large data set, then to a more selected data set for detailed
analysis across a collection of variables, and then zooming back
out to a larger data set if the detailed analysis fail to uncover
anything interesting. Instead of going on and on, we’ll cover
details of this process in a structured and organized way, across
multiple articles in the future.
Visualization: Thought we had this before in the
pervious stage? Well, yes and no. In the data discovery stage, visualization
techniques are used to “have a feel” of what the characteristics
of the data is underlying. In this phase, we need to concentrate
on the presentation of information. We need to appropriately design
the presentation of our findings from the discovery phase in a format
that will deliver the best possible impact. Why is this needed?
This is because, more frequently than not, the output churned from
the discovery phase can be rather boring or difficult for a finance
manager, for example, to understand. An experienced data analyst
would “massage” the raw data and map them on the appropriate
visualization tools to clearly depict the results uncovered in the
discovery phase. Many times, showing rudimentary charts such as
pie charts and line graphs would do. To fundamentally impress business
executives and clearly depict inter variable relationships; 3-d
plots are naturally the favorites.
| |
User Name |
Address |
|
Transaction |
Locality |
| 1. |
John |
2 Toa Payoh Lorong 8 |
|
XNC 839903 |
Braddell |
| 2. |
James |
23 Braddell Heights |
|
XBH 2878900 |
Potong Pasir |
| 3. |
Mary |
78 Daisy Ave |
|
CXV 4908439 |
Jurong East |
| 4. |
Helen |
93 Woskel Rd |
|
XVG 943003 |
Jurong West |
| 5. |
Ann |
#19-192 Upper Serangoon Rd |
|
|
|
| 6. |
Joanna |
#01-198 Woodsville Rd |
|
|
|
| 7. |
May |
27 Jalan Lateh |
|
|
|
| 8. |
Bob |
84 Jurong West |
|
|
|
| |
User ID |
Transaction Details
|
|
Address |
Locality |
| 1. |
0001 |
------- |
|
01 |
01 |
| 2. |
0002 |
------- |
|
01 |
01 |
| 3. |
0003 |
------- |
|
01 |
01 |
Aren’t they similar to those found in a typical Data Processing
Project? Well, one shouldn’t be too remarkably surprised.
Knowledge Discovery is after all, from a macro perspective, the
processing of data. The age-old adage “GIGO” would naturally
hold true and particularly in the case of knowledge discovery, erroneous
raw data would lead flawed prediction and conclusion. And similar
to any data processing project, good planning is essential to excellent
outcome. Too many have fallen into the trap of analyzing data before
they are sufficiently cleaned and transformed. A similar number
suffered from “analysis paralysis” where way too much
resources are spent on analysis that dug too deep into the data
to be useful. What is enough and yet not too much? What is useful
and significant but yet not too little? This fine line is, perhaps,
what separates the art from the science of data mining.
|