Data Science Series: Extracting and Transforming Data (2 of 3)

Carlos Brignardello

April 17, 2022

Today we are increasingly connected to the internet, meaning that we consume more information and consequently generate it in equal or greater proportion. This sea of data is a competitive advantage for organisations that are able to capture it and derive real value from it. We could see this in detail in the previous article - "Data Science Series: Data and roles (1 of 3)", in which we explained the value of data, the different types of data we can find, the alternatives for its storage, as well as defining the roles and processes that come into play when contracting this type of services.

However, in this article we will focus on more technical aspects, such as: Data extraction and transformation. To understand how this can enable us to drive data and start generating new value.

About this article

This is the second article in the series on Data Science. Each article can be read independently of the order as the content is separated by different stages that, despite having a strong connection, can be understood individually. Each publication aims to shed light on the processes that take place in the industry and that could help you decide whether your organisation can contract a service to migrate your data to the cloud or learn about how the development of this type of projects works in case you are a student. There are some concepts that are taken for granted, which are explained in the series so it is recommended to give it a full read in case some keywords sound unfamiliar.

There are different professionals involved in the development of Science projects, however in this case we will focus on the role of the Data Engineer.

What is the role of a Data Engineer?

The Data Engineer builds ETL flows so that data can be queried easily and in the appropriate format from different platforms. These flows contain a series of transformations that are nothing more than scripts that must be programmed to modify and enrich the data. The Data Engineer is expected to create automations to periodically run the flow and ETL processes. In essence, the Data Engineer ensures that all other roles have data to analyse. The ETL flow is explained in detail below.

Data enrichment: is any process from which a dataset goes through a transformation and results in the creation of new columns or new data to it. It can also result in standardising and cleaning the data after going through the transformation.

ETL Flow

To understand where to extract our data from and why we need to transform it, we must first understand what the ETL flow is.

ETL stands for Extract, Transform and Load. ETL is a process that extracts information from different data sources into a Data Lake, transforms the data into a path or zone and finally loads it into a Data Warehouse for consumption.

Extraction

In extraction, data is collected from various sources, and the selected data is stored in a staging area. It is quite common to take as data sources files extracted from SAP or other similar platforms that manage enterprise data.

Staging area: It is an area entirely destined for data processing, it is where the data are taken from to perform the transformations and at the same time it is where the processing files are stored, which consist of files that are generated and that will be used by other transformation steps of the ETL process.

During the extraction process, the selected data can have several formats, among them we can find:

  • Tables
  • SQL databases
  • NoSQL databases
  • Unstructured data
  • Text files

It is important to leave the extraction data in a staging area because it is very common for extraction data to contain inconsistent, dirty and unpredictably formatted data. This way by placing the data in a staging area it can be read by the process and all its errors corrected by transformations, so we can generate new data, correctly formatted and ready to be analysed.

Inconsistent data is considered to be information that from the same column or document has values with different formats. For example, if we have a column with date values and we receive some values with format 2021/11/09 and others with format 2022-03-10 14:30, etc.

Transformation

The transformation process takes the data that was loaded during extraction and converts it to a standard format in order to load it to another area of the Data Lake or to a Data Ware House and thus obtain results to analyse or present.

Transformations are performed multiple times with different scripts during the same flow or pipeline. They are usually performed in scripts that can be built with different languages and libraries. Most commonly, Python is used as the programming language and Pandas or PySpark as the library for working with data.

Data pipeline: A series of data processes that are executed in series where each process has an input and an output that continues the flow.

In transformation scripts, sequential tasks are performed to ensure that all elements process certain tasks such as:

  • Apply formatting: We can define the data type of each column and perform transformations according to the data type assigned.
  • Apply filters: To obtain only the values we need.
  • Clean columns: In this way we can apply a transformation on empty or null values and at the same time unify values that are repeated or recorded differently but point to the same thing.
  • Integrating data: We can obtain new data by integrating two different data sources or columns.

Charge

In this process the transformed data is taken and loaded into a Data Warehouse. This is the end result of the ETL process and the idea is that the whole flow from extraction to loading is executed in certain time periods on a regular basis.

Once the data is available in a Data Warehouse it can be queried and used to visualise the data from a reporting system such as Tableau or Google Data Studio. The second process, i.e. how the data is transformed, will be explored in depth below.

Triggering the data

Being able to leverage available data is essential for organisations, even more so if you want to move your business through a digital transformation process. One of the most common ways to leverage data is to take two or more datasets and obtain new results by blending the data.

We can build a new data based on some criteria, if we take a sales dataset and a supplier dataset, by relating them we can obtain a new column indicating the age of the supplier or build a new value by joining the content of values from both datasets.

Datasets: Datasets are sets of different data that are cohesive or generated from the same source. They generally have a tabular format, i.e. they have rows and columns. An example of this would be the data generated by a user when making an online purchase.

Data mapping

To start making the most of an organisation's information, it is first necessary to map the data. To do this, a data map can be generated, which consists of a diagram identifying all the data sources generated in the business processes.

The base component from which to start identifying the data being generated is the dataset. We can group a set of datasets under the same origin to form a Bucket of data. For example we can have a Bucket with all the Users/Customers datasets, another one with Corporate data and finally a Bucket of Industry/Competition data.

Identify unmapped data

These three Buckets make up the data map. If more datasets are added to the Buckets, much richer conclusions can be drawn from the new information found.

However, there may be times when there is data that we are not mapping or locating correctly, but we know it exists. In this case we can start from the Buckets to determine which dataset we are missing. For example: My organisation is generating data on operations that we have not yet captured, so in the data map we indicate that a dataset is pending in the Corporate data bucket.

Patterns of cloud change

Migrations to the cloud allow systems and solutions to access different benefits such as: development of more agile and secure solutions, being able to work with data from the cloud, saving costs and generating new value. In addition, if the on premise solution is obsolete or legacy, it can be completely transformed to integrate other services and modernise it.

Moving solutions to the cloud is not a process that always follows the same rules - it all depends on the context. There are at least five patterns that can be adopted when modernising an on-premise solution with the cloud.

On-premise: Refers to those applications or systems that have been developed and run locally or on the business' own infrastructure.

01. Move and change

This pattern consists of moving the application to the cloud, making minimal modifications to make it work and once it is running, the original solution can be updated and improved in a simpler and more agile way.

02. Change and move

The next pattern is similar to the previous one, but the steps are reversed. The on-premise solution is modified to prepare it for the cloud and once ready, the migration is performed to start running the application and apply any changes or improvements that are needed with the services associated with the selected cloud provider.

03. Maintain and extend

This consists of building from scratch a new solution from the cloud that is an extension of or connects with the on-premise application. An example of this type of pattern is the creation of APIs to connect the local application with other services or applications that cannot be integrated into the original application.

04. Maintain and replace

Another common alternative is to keep the on-premise application running while a new application is created in the cloud environment to replace the on-premise solution once it is developed and thus begin to realise greater value and agility for the business.

05. SOLO Mover

In some cases it is sufficient to simply release the application or data to modernise the infrastructure layer. This can be done with simpler or more common products. For example: migrating the organisation's data to the cloud or modernising the infrastructure of an application by creating a virtualised environment for disaster recovery, among many other cases.

In the following we will see a concrete case where transformations are applied to a concrete case.

Transforming data

As described above, transforming information helps us to get the most out of the data we have, while at the same time giving us more control over the data to be able to analyse it.

Use Case

To demonstrate how the transformations are performed, we will use the construction of a report as an example.

This on premise report can take a certain amount of time to produce and is built by taking many different data sources, which must be manually queried and configured to obtain the desired metrics. All the data is then assembled in an Excel spreadsheet and various calculations are applied. The final report would look something like this:

The resulting report simulates the data of a company distributing different products, we can see that it is a report that has concentrated information of the results obtained with some products, in addition to the calculation of the budget and Forecast for each one. This is the result table and to get to it we will have to make transformations on different sources, specifically a series of transactional and master data.

Transactional data

Transactional data are data that are processed every certain period of time, which each time they are processed usually bring new data, modifications of previous data or the same data.

In our example we have several transactional data, which we will name as sale_clothes.csv, sale_accessories.csv and costs_margin.csv.

What we will do is take the sales sources and group data based on product_type. At the same time we need to relate the data from this data to the master source to get the descriptive data that will be used to create the name and description columns.

Master data

Master data or master tables are those data that do not usually change over time and have already defined values. In the example, these are tables that contain the codes of each product together with its description.

In our example we will use a single master data which we will call type_product_detail.csv.

Both transactional and master data would have a format similar to the following:

Transformations applied to the case

In this case, we will solve the transformations using PySpark, a library for Big Data processes that is designed to work on Datalakes and Data Warehouses. It focuses on processing speed and Spark can be used in Python, Java, R and Scala. In our case, we will use Python as the programming language.

When starting to create scripts it is normal to first start developing scripts called scen scripts, which contain all the logic and tests to build the script and then, once tested, this script is converted to the development version with slight modifications.

Read data

To read the data in PySpark we first use read.format() to indicate the type of file we are going to read, then with .option() we select options separately, in this case we use it to define the delimiter of the file and if it includes headers or not. Finally with .load() we indicate the path where the file is located.

To read the data in PySpark we first make use of the read.format() to indicate the type of file we are going to read, then with .option() we select options separately, in this case we use it to define the file delimiter and whether or not to include headers. Finally with .load() we indicate the path where the file is located.

In the example we read only venta_ropa.csv, but the logic is the same for all other sources.

Headers: First row of the table containing the names of each column.

-- CODE language-python -- # We read the file ropa_df = spark.read.format('csv').option('delimiter', ';').option('header', False).load('/datasets/transactional/venta_ropa.csv') # Assign the headers clothes_df = clothes_df.toDF( 'NAME', 'PRODUCT_CODE', 'DATE', 'QUANTITY', ... )

Correct data types

In this case we create the function set_correct_data_types which is responsible for standardising the data so that the correct data type is assigned to each column and at the same time making modifications to these values depending on the data type assigned.

In this code example we correct values with negative left sign, wrong formatting in text values and date formatting.

-- CODE language-python -- def set_correct_data_types(df): columns_types_dict = { 'FECHA': DateType(), 'CANTIDAD': IntegerType(), 'US_VENTAS': DoubleType(), 'NOMBRE': StringType(), 'CODIGO_PRODUCTO': IntegerType(), ... } for column_name, column_type in columns_types_dict.items(): if column_type == DoubleType(): column = col(column_name) negative_col = concat(column.substr(-1,1), column.substr(lit(1), length(column)-1)) df = df.withColumn( column_name, when( column.endswith('-'), negative_col.cast(column_type) ).otherwise(column.cast(column_type)) ) elif column_type == DateType(): df = df.withColumn( column_name, to_date(col(column_name), 'yyyyMMdd') ) elif column_type == StringType(): df = df.withColumn( column_name, regexp_replace(column_name, '"', '') ) return {'add_correct_data_types': df}

Obtaining master relationships

In our example we have a master data called product_type_detail.csv which has the descriptive values used to construct two of the columns of the final report.

Transactional values such as sales_clothing.csv have a column called product_code which when related to the code column of the master file type_product_detail.csv allows us to obtain the product code name and the product description which we will rename description. The result is that we get to add two additional columns to sales_clothing.csv by relating it to the master file, as shown in the following image:

In the code we use .join to get the columns we need by matching the product_code column with code.

-- CODE language-python -- def add_tipo_producto_descr(df): maestro = dataframes_dict['tipo_producto_detail'].select('codigo', 'nombre') df = df.join(maestro, df.codigo_producto == maestro.codigo, 'left') \ .withColumnRenamed('nombre', 'nombre') \ .drop('codigo') return {'add_tipo_producto': df} def add_descripcion_descr(df): maestro = dataframes_dict['tipo_producto_detail'].select('codigo', 'desc') df = df.join(maestro, df.codigo_producto == maestro.codigo, 'left') \ .withColumnRenamed('desc', 'descripcion') \ .drop('codigo') return {'add_descripcion': df}

Filter to generate new data

Filters can typically be used to obtain only that information that is relevant to our objective. However, as we will see in the next case, it can also be used to construct new data.

With withColumn() we can modify the value of a column or create it if we add the name of a column that does not exist. Then with when we can add a condition and the value that will be applied in case this condition matches. In the example we use it to create the category column.

-- CODE language-python -- def add_category_filter(df): df = df.withColumn( 'CATEGORY', when( (df.product_type == 'T-Shirt') | (df.product_type == 'Sweatshirt') | | (df.product_type == 'Cap') \ lit('Clothing').cast(StringType()) \ .when( (df.product_type == 'Eyeglasses') | (df.product_type == 'Wallet') \ , lit('Accessories').cast(StringType()) \ .otherwise('') )

Understanding the outcome

Based on what we have seen above, we can describe the transformations in the following figure:

What we did was to take two data sources that shared characteristics, venta_ropa.csv and venta_accesorios.csv, we performed standardisation processes (data correction script), we obtained new columns by relating a column of the original file with the master source (master relations script) and through a filtering process we generated a new column by matching a condition (filtering script). Additionally, we could take the learned references and perform more complex calculations to, for example, use the transactional table of costs_margin.csv to obtain the Forecast values.

Additionally, because the data is now categorised by the category column, we could obtain the sum of the numeric columns, thus extending the scope and purpose of our transformations.

In the next article in the series we will look in detail at the next step of the ETL flow, loading data and squaring the results. We hope this article has been helpful, if you have any remaining questions or if your organisation needs help in solving projects of this type, please do not hesitate to contact us.

Carlos Brignardello

April 7, 2022

Previous entries