Articles
OLAP
In the first series of articles we have explored what is
data mining, fundamental processes in data exploration and
elaborated on a number of visualization and analytical algorithms.
With this article, we begin a new series where a number of
the more popular data mining concepts would be explored in details.
To start off, we shall discuss an
often-misunderstood term, OLAP in this article.
OLAP or Online Analytical Processing is best understood when contrasted
with OLTP or Online Transaction Processing.
OLTP systems are transaction-based system. Those systems that runs
the banks, allows the telecommunications company to generate bills
that charges you for phone usage are examples of OLTP systems.
The basis of OLTP systems is to process transactions.
As such, you can imagine how the data would be stored. The database schema
of an OLTP system is geared towards operational efficiencies.
Queries should be short and to the point. Spot data is important
and must be highly accurate. Results should be generated
in seconds if not milliseconds. When a user interacts
with an OLTP system, the results can be generated while the user
is waiting.
In contrast, OLAP systems focus on analysis.
Applications supported by a data warehouse, with tools that
allow you to “drill into details of data”, slice-and-dice data from
multiple dimensions, are examples of OLAP systems.
Data accuracy is important but this loses in significance
to trend information. Users
of OLAP systems are typically knowledge workers that are keen in
understanding the trends that exist in the underlying data rather
than their spot values. Operational efficiencies
are not as important as clarity in the abstraction of meta data. Queries are typically
complex and not often crafted manually no matter how good you are
in SQL. While
some OLAP tools can generate reports while the user is waiting,
OLAP may require longer processing time (hours, if not days) while
their underlying model is being constructed.
OLAP Fundamentals
At the core of OLAP tools lay a multidimensional data model. The best and most
typical way to visualize this is in the form of a data cube.
Simply stated, a data cube allows data to be modeled and
viewed in multiple dimensions. In general, each
cube is defined by 2 entities, measurements and metric.
Let’s first discuss what is
a metric. A metric is basically
the dimensions in which data in an organization is kept for. Time (in Years,
Quarters or Months) or Region (North, South, East West), would be
examples of metrics. Each vendor pushing
their own brand of OLAP tools would have their own way of calling
a metric. I have seen it
being called as perspective, dimension, viewpoint…etc. just to name
a few.
Measurement, on the other hand, represents
values of the data that is being stored.
Think of measurements as quantities in which we want to analyze
relationships between metrics. Measurements are
typically numeric in nature. While Time (in
Years, Quarters or Months) is a good example of a metric in the
example given above, Yearly Sales or Average Quarter-To-Quarter
Growth are good examples of measurements.
So very quickly, with a combination
of metrics (Years, Quarters or Months) and measurements we have
the following “cube”.
At the core of OLAP tools lay a multidimensional data model. The best and most
typical way to visualize this is in the form of a data cube.
Simply stated, a data cube allows data to be modeled and
viewed in multiple dimensions. In general, each
cube is defined by 2 entities, measurements and metric.
Let’s first discuss what is
a metric. A metric is basically
the dimensions in which data in an organization is kept for. Time (in Years,
Quarters or Months) or Region (North, South, East West), would be
examples of metrics. Each vendor pushing
their own brand of OLAP tools would have their own way of calling
a metric. I have seen it
being called as perspective, dimension, viewpoint…etc. just to name
a few.
Measurement, on the other hand, represents
values of the data that is being stored.
Think of measurements as quantities in which we want to analyze
relationships between metrics. Measurements are
typically numeric in nature. While Time (in
Years, Quarters or Months) is a good example of a metric in the
example given above, Yearly Sales or Average Quarter-To-Quarter
Growth are good examples of measurements.
So very quickly, with a combination
of metrics (Years, Quarters or Months) and measurements we have
the following “cube”..
| |
Yr 98 |
Yr 99 |
Yr 00 |
Yr 01 |
| Yearly Sales |
329243 |
4894004 |
499420 |
4829040 |
| Avg Q-to-Q Growth |
20% |
12% |
2% |
8% |
Strictly speaking, a 1 dimensional cube is more
commonly known as a list, but with a blink of an eye, a marketer’s
analysis needs can very quickly turn this into a more complex animal.
For example, what if we want to know
the sales performance across different regions as well? We could
get a cross tab in the form depicted below, transforming the list
into a table.
| |
Yr 98 |
Yr 99 |
Yr 00 |
Yr 01 |
| North |
3232 |
42424 |
42244 |
422442 |
| South |
32323 |
24234 |
24424 |
842244 |
| East |
42424 |
24444 |
54353 |
24447 |
| West |
42442 |
245345 |
2444343 |
23833 |
What if we would also like to know how sales contributions are segmented
across Corporate, Retail, Government and Education segments?
In order to do that, in OLAP speak, we’d need to build a cube
with the added metric of region as indicated in the diagram below
Figure1: 3 dimensions of OLAP cube
Voila, we get a cube. It is obviously
the objective of OLAP tools to allow the user to analyze data across
more than 3 dimensions but from here on, things get a little hairy
in 2 dimension graphical depiction such as printed-paper or
static images. A typical
way of visualizing OLAP N-dimension cubes is to repeat each cube
for each value of the additional metric and grouped them up accordingly. In the diagram
below, we try to depict a 4 dimensional cube and I am sure you would
agree with me, the 2 dimensional papers that you are holding gets
clumsy very quickly.
Figure2: N dimensions of OLAP cube
Strictly speaking, OLAP is not data mining. OLAP systems do not
discover trends or segmentation in any automated fashion. They basically store
data in this format and spit them back at you when you ask them
to. Stripped to its
basic, OLAP is basically a database models, a bunch of complicated
queries and fancy user interfaces. Why do we then discuss
them here? Fundamentally,
because OLAP is extremely useful in analysis; when coupled with
a visualization tool, OLAP systems allows a marketing oriented person
to dive into the data to gain a much better understanding. Are OLAP systems easy
to deploy? Not quite.
And following the same train of thought started in the previous
series of articles, if there is enough interest in OLAP systems
roll out; we could allocate a future article on it.
|