Abstract :
Data mining
derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes
of store scanner data — and mining a mountain for a vein of valuable ore. Both processes
require either sifting through an immense amount of material, or intelligently probing
it to find exactly where the value resides. Given databases of sufficient size
and quality, data mining technology can generate new business opportunities by providing
these
capabilities:
Automated prediction
of trends and behaviors. Data mining automates the process of finding predictive
information in large databases. Questions that traditionally required extensive
hands-on analysis can now be answered directly from the data — quickly. A typical
example of a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize return
on investment in future mailings. Other predictive problems include forecasting
bankruptcy and other forms of default, and identifying segments of a population
likely to respond similarly to given events.
Automated
discovery of previously unknown patterns. Data mining tools sweep through databases
and identify previously hidden patterns in one step. An example of pattern discovery
is the analysis of retail sales data to identify seemingly unrelated products that
are often purchased together. Other pattern discovery problems include detecting
fraudulent credit card transactions and identifying anomalous data that could represent
data entry keying errors.
Data mining
techniques can yield the benefits of automation on existing software and hardware
platforms, and can be implemented on new systems as existing platforms are upgraded
and new products developed. When data mining tools are implemented on high performance
parallel processing systems, they can analyze massive databases in minutes. Faster
processing means that users can automatically experiment with more models to
understand complex data. High speed makes it practical for users to analyze huge
quantities of data. Larger databases, in turn, yield improved predictions.
Databases can
be larger in both depth and breadth:
More
columns. Analysts must often limit the number of variables they examine when
doing hands-on analysis due to time constraints. Yet variables that are discarded
because they seem unimportant may carry information about unknown patterns. High
performance data mining allows users to explore the full depth of a database,
without preselecting a subset of variables.
More
rows. Larger samples yield lower estimation errors and variance, and allow users
to make inferences about small but important segments of a population.
A recent Gartner
Group Advanced Technology Research Note listed data mining and artificial intelligence
at the top of the five key technology areas that "will clearly have a major
impact across a wide
range of industries within the next 3 to 5 years."2 Gartner also listed parallel
architectures and data mining as two of the top 10 new technologies in which companies
will invest during the next 5 years. According to a recent Gartner HPC Research
Note, "With the rapid advance in data capture, transmission and storage, large-systems
users will increasingly need to implement
new and innovative ways to mine the after-market value of their vast stores of
detail data, employing MPP [massively parallel processing] systems to create new
sources of business advantage (0.9 probability)."3
The most commonly
used techniques in data mining are:
Artificial
neural networks: Non-linear predictive models that learn through training and resemble
biological neural networks in structure.
Decision
trees: Tree-shaped structures that represent sets of decisions. These decisions
generate rules for the classification of a dataset. Specific decision tree methods
include Classification and Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) .
Genetic
algorithms: Optimization techniques that use processes such as genetic combination,
mutation, and natural selection in a design based on the concepts of evolution.
Nearest
neighbor method: A technique that classifies each record in a dataset based on a
combination of the classes of the k record(s) most similar to it in a historical
dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
Rule
induction: The extraction of useful if-then rules from data based on statistical
significance.
Many of these
technologies have been in use for more than a decade in specialized analysis tools
that work with relatively small volumes of data. These capabilities are now evolving
to integrate directly with industry-standard data warehouse and OLAP platforms.
The appendix to this white paper provides a glossary of data mining terms.
Data Mining
Processes
Data mining
is a promising and relatively new technology. Data mining is defined as a process
of discovering hidden valuable knowledge by analyzing large amounts of data,
which is stored in databases or data warehouse, using various data mining technique such as machine learning, artificial intelligence(AI) and statistical.
Many organizations
in various industries are taking advantages of data mining including manufacturing,
marketing, chemical, aerospace… etc, to increase
their business efficiency. Therefore the needs for a standard data mining process
increased dramatically. A data mining process must be reliable and it must be
repeatable by business people with little or no knowledge of data mining background.
As the result, in 1990 a cross-industry standard process for data mining (CRISP-DM)
first published after going through a lot of workshops, and contributions from
over 300 organizations.
Download :
Download :