Data mining is a process, while computer science is a field. The aim of data mining is to make the data more usable, while the aim of data science is to build data-centric products for an organisation.
Data mining is an activity that is part of a broader Knowledge Discovery in Databases (KDD) process, while Data science is a field of study in exactly the same way as applied mathematics or computer science.
Data mining is the process of finding non-conformities, patterns and correlations in large datasets in order to predict results. Using a wide range of techniques, this information can be used to cut costs, boost revenue, improve customer relationships and/or reduce risk.
The foundation consists of three scientific disciplines: statistics, artificial intelligence and machine learning. Machine learning in data mining is used more in pattern recognition, while in computer science it has a more general use.
Many of the algorithms were invented many years ago, but with the advances that have been made in processing power and speed over the past decade, it is now possible to automate a lot of what previously had to be processed manually. The more complex the datasets, the greater the potential to find relevant insights.
The key steps involved in a data mining process are:
Extracting, transforming and loading data into a data warehouse
Storing and administering data in multidimensional databases
Giving data access to business analysts via applications
Present analysed data in easily understandable forms, such as graphs
Cegal and Data mining
In our AI& Analytics projects, Data mining is a fundamental element for extracting insight from data. For example, in one energy project we have worked on, this involved data from many sensors installed at a number of hydropower plants. In this case, it was easy to understand the steps involved in data mining:
Combine the various data sources from different hydropower plants and systems into a single data source
Administer the data that has been collected live
It is important to decide what data to use – and to cleanse it. Not all available data is of interest; it depends on the project and the need. In this case, we wanted to detect deviations in the start-up of a turbine in connection with a hydropower plant.