Data Mining Techniques Applications and Examples
What Is Data Mining?
Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
In simple words, data mining is defined as a process used to extract usable data from a larger set of any raw data. It implies analyzing data patterns in large batches of data using one or more software techniques.
Data Mining Need
In recent years data mining techniques are used to process the growing amount of data piling up from many resources.
These techniques process efficiently in many areas of human activities such as the Genome Project.
Genome Project Data Set
The Human Genome Project was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying and mapping all of the genes of the human genome from both a physical and a functional standpoint.
The Project has provided researchers all over the world with a large set of data containing valuable information that needs to be discovered. The code that codifies life has been read, but it is not yet known how life works. It is needed to know the relationship between the genes and how they interact and affect one another.
Complex analysis is needed to discover interesting information hidden in the data.
Data Produced by Web Crawler
Another important data set is produced by web crawlers on the internet. Discovering interesting patterns in the documents available from web pages also may help for the welfare of human activities.
Discovering interesting patterns in the chaotic interconnection of web pages helps in finding useful relationships for web searching purposes.
Sensor Capturing Images
Sensor capturing images or sounds are used in agricultural and industrial sectors for monitoring or performing various tasks.
In order to extract valuable information, these data need to be analyzed.
The collection of images of apples scanned by the scanner can be used to select good apples and bad apples for marketing purposes.
The set of sound recorded from animals can be used to identify various diseases in animals and bad environmental conditions.
Computational techniques can be designed to perform various tasks and to substitute for human ability. These techniques can perform these tasks effectively and even in harsh environmental conditions that may be harmful to human beings.
The computational techniques developed in the present study try to mimic the human ability to solve a specific problem and also to perform even better than human beings.
Whereas an experienced farmer can personally monitor the sounds generated by animals to discover the presence of diseases and bad environmental conditions, there are various tasks that humans can perform only with great difficulties.
As an example, human experts can check apples on a conveyor belt to separate good and bad apples. The percentage to remove bad apples from the conveyor belt is a function of the speed of the conveyor and the human dedication to doing the task.
It has been proved that it is rather difficult for the human brain to be focused on a particular object for a long time.
Furthermore, there are tasks that humans cannot perform, such as the task of discovering complex hidden valuable patterns among various items stored in a huge repository of any application domain.
DATA MINING DISCOVERIES
In 1980 Gregory Piatesky Shapiro gave the term Knowledge Discovery in Data (KDD), but most people liked the term “Data Mining” better.
Historically, the idea of discovering valuable patterns in data has been given a variety of names, including knowledge discovery, information harvesting, data archeology, data mining, and pattern discovery.
The phrase knowledge discovery in the database was first given at Knowledge Discovery in Database (KDD) Workshop in 1989 to emphasize that knowledge is the end product of data-driven discovery.
It has been popularized in the Artificial Intelligence (AI) and Machine Learning (ML) fields. KDD refers to the overall process of discovering valuable knowledge from data while Data Mining (DM) is one of the important steps used in the knowledge discovery process as shown in Fig. 1.1.
In the business industry, data mining was introduced in 1990 but the roots of DM have been in the business world for a substantial time.
They are namely Machine Learning, Artificial Intelligence, and Classical Statistics.
The traditional statistical model was comprised of concepts such as point estimation, regression analysis, correlation analysis, standard distribution, variance, classification, and cluster analysis.
These techniques could be identified as a study of data. All these concepts worked as the base blocks of data mining.
Data mining is not a new discipline, statisticians have used similar techniques to analyze data and provide business projections for many years.
Data mining technology introduced various changes in the old data storage and discovery process.
These Changes in data mining techniques enable organizations and decision-makers to store, retrieve, and analyze data in a new way.
The first change occurred in data collection and data storage.
Before organizations collected data in paper made file, they made the transition from paper-based records to electronic computer-based records.
Now they are able to answer complex and detailed queries faster, easier, and with more accuracy.
The second change occurred in data organization and data retrieval procedure, its warehousing and decision support systems.
The development of these systems has enabled organizations to extend queries from knowing existing to future trends and future predictions. Now business industries have more information at their disposal. They have a huge data repository.
They can use a data mining algorithm to process this huge repository. They can determine the outcomes of the data analysis by the parameters they choose, thus providing additional values to business strategies and initiatives.
Without passing parameters, data mining algorithms generate only permutations and combinations irrespective of their relevance and interest of the user.
The explosive growth in data has generated an immediate requirement for new techniques that can automatically transform this data into knowledge.
Consequently, data mining has become a research area with increasing importance. Data mining means a process of nontrivial extractions of implicit, previously unknown, and potentially useful information from data.
It is the computer-assisted process of digging through and analyzing enormous data and then extracting the sense of the data as shown in Fig. 1.2.
Data Mining -INTERDISCIPLINARY NATURE
Data Mining has evolved, and continues to evolve, from the intersection of research fields such as Artificial Intelligence (AI), Machine Learning (ML), Statistics, Database, Expert System, Pattern Recognition, and High-Performance Computing, World Wide Web, Information Retrieval and many application domains.
The data mining techniques currently relies on known techniques of ML, pattern recognition, and statistics to discover patterns from data.
Statistics has much in common with KDD. Statistics provides a framework for quantifying the uncertainty in results when one tries to infer general patterns from a particular sample of a population. Main force behind data mining is the database field.
A related field evolving from a database is data warehousing, which refers to the important business activity of collecting and cleaning transactional data to make them available for analysis and decision making.
The data mining process can be viewed as a multidisciplinary activity that includes technique beyond the scope of any one particular discipline such as machine learning.
Hence, there is an opportunity for other fields of AI to contribute to data mining.
Data mining is still in infancy stage, organizations including retail, finance, medical, manufacturing, transportation, aerospace etc. already using data mining techniques to take benefit from historical data.
Data mining technique can generate new business opportunities by:
- Automated prediction of trends and behavior: Data mining automates the prediction of new targeted customers, new targeted product for specific customers etc. Data mining can forecasts the bankruptcy and other faults of the business.
- Automated discovery of previously unknown patterns: Data mining algorithms can discover the previously unknown hidden patterns to find out the risks already prevailing in the application domain and also find the various factors responsible to minimize these risks consequently situation can be improved.
DATA MINING TECHNIQUES
The two main goals of data mining in practice are prediction and description [5]. Prediction involves using some variables or attributes in the database to predict unknown or future values of other variables of interest.
Description focuses on the discovery of human-understandable patterns describing the data. The goal of prediction and description can be achieved using a variety of data mining methods.
From the above, it can be inferred that the main goal of data mining is to discover previously unknown and interesting patterns. There are two techniques to discover and identify valuable patterns. One is supervised data mining techniques and the other is unsupervised data mining techniques.
Supervised techniques are those techniques that use training data set for discovering the patterns. An example of supervised techniques is classification.
Unsupervised data mining techniques discover patterns without applying the training data set. They treat input data as a random data set and a joint density pattern is built from the data set.
An example of unsupervised techniques is clustering. Brief reviews of classification and clustering techniques are as follows:
Classification in Data Mining
Classification is a learning function that maps a data item into one of several predefined classes. It is referred to as supervised learning because classes are decided before examining the data.
Often a training set is used to develop the specific parameters required by the technique.
Training data consists of sample input data. These algorithms are based on statistics, decision trees, neural networks, and rules.
Classification algorithms require that the classes be defined based on attribute values.
The algorithm describes these classes by viewing the characteristics of data already known to belong to a class.
One straight forward way to perform classification is association rule-based classification.
An association rule has two parts: an antecedent if part and consequent then part. Consequent is an item or a class that is found with the combination of antecedents.
Association rule mining is an important data mining task involving the discovery of hidden relationships in items stored in a huge repository.
Various extensions of the traditional association rule mining have been proposed so far, however, the problem of mining of complex hidden patterns is not tackled yet to the satisfaction of the user.
These complex undiscovered patterns are useful in many applications such as cross-marketing, catalog design, predicting the failure of telecommunication switches, weather forecasting, minimizing socioeconomic risk factors associated with the farming systems, and many more.
The availability of one factor when another factor is present represents an association rule.
There are various algorithms under classification techniques which are as follows:
- Classification based on association rules
- Statistics based classification
- Decision tree-based classification
- Bayesian classification
- Neural network-based classification
- Support vector machine
Clustering in Data Mining
A process of grouping a set of physical or abstract objects into a class of similar objects is called clustering.
Clustering is the method by which like records are grouped together.
According to another definition, clustering is a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data.
Clustering is similar to classification technique in that data are grouped, but in this, groups are not predefined. Clustering is of two types hierarchical clustering and partition clustering.
Various algorithms using clustering techniques are K-means clustering, nearest neighbor, genetic algorithms, etc. The K-means clustering algorithm partitions a set of data into ‘K’ clusters by finding inherent patterns in the given set.
‘K’ is a predetermined value given by the expert or the user. K-Means clustering was used to analyze the agricultural meteorology for the enhancement of crop yields and the reduction of crop losses, classifying soils in combination with GPS-based technologies, and to classify soils and plants.
Categorization of different types of clustering is as follows:
- Hierarchical clustering
- Partition clustering
- Categorical clustering
- Density-based clustering
- Grid-based clustering
DATA MINING: AN OVERVIEW FROM DATABASE PERSPECTIVE
Recently, the capabilities of both generating and collecting data have been increasing rapidly. Millions of databases have been used in various domains of applications.
Here, a brief description of various databases contributing to data mining is given.
Relational Database
A relational database is a collection of data items organized as a set of formally described tables from which data can be accessed in different ways.
The relational database was invented by E.F. Codd at IBM in 1970. The standard application program interface to a relational database is the structured query language (SQL).
SQL statements are used both for interactive queries for information from a relational database and for gathering data for reports. Data mining on relational database focuses on discovering patterns and trends.
Transactional Database
Transactional database refers to the collection of transaction records, mostly sales or purchase records. Data mining on a transactional database focuses on the mining of association rules discovering relationships among items stored in the transaction records.
Special Database
A special database is a database that has been optimized to store data and query data related to space including lines, points, and polygons. The special database contains not only traditional data but also the location or geographic information about the corresponding data.
Temporal or Time Series Database
The temporal database is one which stores time-related data. Temporal data mining extracts the temporal patterns from the temporal database.
DATA MINING ALGORITHM
The data mining algorithm is a well-defined procedure that takes input and produces output patterns.
Well defined means that the algorithm is precisely encoded with fixed set of rules, the procedure terminates after fixed set of steps and produces output. The data mining algorithm has three primary components.
- Model representation
- Model evaluation
- Search
Model representation is the set of heuristics and calculations used to create the patterns.
Model evaluation criteria are statements to indicate a particular pattern or model and its parameters which meet the goals of the knowledge discovery process.
For instance, predictive models are judged by the empirical prediction accuracy on the test data set.
The descriptive model can be evaluated by the dimensions of predictive accuracy, utility, and understandability of the given model or pattern.
Search methods consist of computational statements to describe the procedure to calculate the path of descent and to terminate the search.
Data mining based on user-specified constraints to guide the algorithm for searching the desired patterns is called constraints based data mining.
CONSTRAINTS BASED DATA MINING
The data mining process may uncover thousands of patterns from a given data set; most of them may be uninteresting or unrelated to the need of the user.
Uninteresting patterns consume more time in analysis and create confusion for the decision-maker. Such unfocused pattern mining can reduce the efficiency and usability of the mining process.
Often, users have the good sense to select the “line” of mining which may lead to related or interesting patterns.
Therefore, a good heuristic is to have the users specify such intuition or expectation as constraints to confine the search space. This strategy is known as Constraints-Based Data Mining.
Constraints include the following classification
- Knowledge Constraints: These constraints specify the type of knowledge to be mined, such as characterization, discrimination, association, and correlation analysis.
- Task Related Constraints: These constraints specify the type of task-related information to be used in mining.
- Dimension Constraints: These constraints specify the desired dimensions (or attributes) of the data to be used in mining.
- Level Constraints: Concepts hierarchy is a popular form of background knowledge, which allows data to be mined at multiple levels of abstractions.
- Interestingness Constraints: These constraints based on the threshold values of support and confidence. Rules whose support and confidence threshold values are below user-specified support and confidence threshold are considered uninteresting rules.
- Meta-Rule Constraints: These constraints specify the form of rules to be mined. Meta-rule may be based on the user’s knowledge, understanding, experience, or expectations. Generally, meta-rule forms a hypothesis regarding the relationship that the user is interested in confirming. The data mining system then searches for a rule that matches the given meta-rule.
Constraints data-based mining allows users to describe the rule that they would like to mine, thereby making the process more relevant and effective.
Constraints can be implemented using high-level declarative data mining query language, user interface, or query optimizer.