What is data mining?
Posted on: May 12, 2023Data mining is a process that blends artificial intelligence and statistics to sort, analyse, and extract useful information from large datasets.
Through data mining tools and techniques, the patterns, relationships, and anomalies in raw data can be identified and then used to solve problems, answer questions, support decision-making, and predict outcomes for businesses and other organisations.
The data mining process
The data mining process typically involves four key steps: data gathering, preparation, mining and analysis.
Data gathering
The initial step in data mining is effectively an information gathering exercise. During this stage, relevant stakeholders ascertain what data sources are available, how they should be collected, and how they are – or should be – stored and secured. They should also identify any missing data, and agree what problem they are attempting to solve, or question they are attempting to answer, through the data mining exercise. Finally, they should determine relevant parameters, metrics, and limits. Data can then be extracted, uploaded, and otherwise gathered.
Data preparation
Once data is gathered, it needs to be prepared for mining. This includes data exploration, profiling, and pre-processing, followed by cleansing, standardising, and transforming into consistent datasets.
Data mining
Once data has been gathered and prepared, it can then be mined using appropriate techniques, algorithms, and machine learning applications. The mining process will look for relationships, trends, correlations, affiliations, associations, and sequential patterns in the data, and may even use predictive models to propose future outcomes.
Data analysis
During the final stage, data mining results are assessed, interpreted, analysed, and communicated to help inform business decisions and actions. This step often includes using models, visualisation aids, and data storytelling techniques to ensure findings can be understood.
Following these four steps, data analysts will typically evaluate the results of the data mining exercise to assess its performance, a process known as validation.
Areas of data mining
Descriptive modelling
Descriptive data mining is used to identify patterns, extract information, and summarise findings from a dataset in order to answer questions about it.
Predictive modelling
Predictive data mining is more complex than descriptive data mining because it uses the information gleaned from data in order to answer questions about future scenarios, or to predict a future outcome.
Data mining techniques
Association
Association rules detect relationships between variables in records and datasets. They are often used to examine and predict consumer behaviour.
Classification
The classification technique assigns predefined classes to different data points based on their characteristics, ensuring even large-scale datasets can be categorised and organised. Algorithms used for the classification technique in data mining are called classifiers. They are useful in a number of scenarios, from summarising sales trends to identifying spam email messages.
Clustering
Clustering is similar to the classification technique, but is able to group data by similarities without relying on assigned classes. This data mining technique is typically used for forecasting purposes, enabling businesses to better predict behaviours and patterns.
Regression
Regression techniques predict future outcomes by analysing past or historic data.
Decision trees
Decision trees rely on classification or regression techniques to predict outcomes based on a series of decisions. This technique creates a tree-like visual to represent different decisions and outcomes, and can be used to augment other decision support systems used by businesses and other organisations.
K-nearest neighbor (KNN)
The K-nearest neighbor (KNN) technique classifies data based on proximity, seeking out associations between similar data points and calculating the distance between them before assigning an appropriate category to the grouped data points.
Data mining tools and resources
Data warehousing
Data warehouses bring together data from multiple sources into a single repository – or warehouse – in order to support data mining as well as data analysis, artificial intelligence, and machine learning.
Neural networks
Neural networks offer more sophisticated data mining techniques, relying on deep learning algorithms to process data.
Support vector machines
Support vector machines (SVMs) maximise the accuracy of predictive models using classification and regression data mining techniques.
Open-source data mining tools
In addition to the various techniques and tools used in data mining, there are also a number of open access tools available to data analysts.
Data Mining and Knowledge Discovery publication
Data Mining and Knowledge Discovery is a technical Springer publication with article submissions spanning data mining theory and research to methods and techniques. It also publishes special issues that focus on specific issues and trends within data mining.
Uses and applications for data mining
Data mining has become an increasingly important area of data science, particularly as Big Data continues to expand and integrate into modern business and organisational practices in a wide variety of fields. For example, medical informatics – also known as health informatics – uses data mining to support healthcare systems and improve patient outcomes, such as when it’s used to analyse medical data in electronic health records (EHR). It has even been used to analyse clinical data to support decision-making processes for the early – and successful – detection of breast cancer.
Other fields that harness data mining include bioinformatics and biomedical informatics, which use data mining to analyse biomedical data and support analysis in areas such as genomic medicine and gene expression.
What is the difference between data mining and data analytics?
Data mining is effectively a tool used for data analytics. The knowledge gleaned through data mining exercises supports the analysis of data.
What is the difference between data mining and knowledge discovery in databases (KDD)?
The knowledge discovery in databases (KDD) process is a data science methodology for gathering, processing, and analysing data. Data mining is one step in this KDD process.
Explore data mining and the knowledge discovery process
Take an in-depth look at data mining – including data collection and processing data types from a variety of sources – with the MSc Computer Science at the University of Wolverhampton. This flexible Master’s degree has been developed for ambitious individuals who may not have a background in computer science, and is delivered 100% online.
The degree includes a key module in data mining and informatics, so in addition to learning other in-demand computer science skills, you will explore different data processing and mining techniques, as well as data processing and mining tools. This includes:
- data models
- the knowledge discovery process
- data selection
- pre-processing and cleaning
- data mining algorithms and association rules
- classification
- clustering
- text mining
- social media mining
- data visualisation.
After studying this module, you will be able to use all aspects of the knowledge discovery process – from data collection and data processing to representation – and apply cutting-edge data mining techniques to extract information from data. You will also develop your own data mining solution using a state-of-the-art data mining tool.