Data Wrangling

What is Data wrangling?

Data wrangling, sometimes known as Data Munging, can be explained as data reconciliation, data adjustment or data arranging. Data wrangling is part of data mining or data preparation.

Data wrangling/data reconciliation/data adjustment is about bringing together two datasets or extracts from datasets.

For example, data set A may consist of name, year of birth, and income, while dataset B may consist of name, year of birth and insurance data. If we then link data set A to data set B using names only, we would get many hits for "John Smith", for example. However, if we link data set A to data set B using both name and year of birth, this may be sufficient to create an unambiguous link. It is the process of executing this link on uncertain data that is referred to as ‘data wrangling’.

Data wrangling is a simple, intuitive way of preparing data using a graphical tool. The aim is to convert and map data from one format to another. When you get raw data from a source, it is like a kind of raw material. If you want something useful from it, you have to process it, e.g. beets that give sugar, or crude oil that releases petrol.

The tools that are created for data wrangling are user-friendly and are therefore not designed for data scientists or developers. They are typically used by business analysts instead.

Tools: Trifacta, DataWrangler, OpenRefine, even Excel – it is also possible to use various libraries or packages in the programming languages Python and/or R.

Cegal and Data wrangling

At Cegal, our Business Intelligence (BI)teams carry out data wrangling on a daily basis. In the case of BI projects, when the data is available, e.g. after data mining, it is necessary to cleanse, merge and transform the data in order to create visualisations or reports which are useful to customers.

Cegal uses data wrangling to validate the data and to segment your dataset in order to overcome complex business problems and identify solutions.

Find out more about:Our services within Data and Analytics >