Data Science Series: Data Loading and Squaring (3 of 3)

Carlos Brignardello

June 9, 2022

Organisations depend on their data to make intelligent decisions, and this decision making tends to be done by senior leaders and management teams. In order to understand the data, it must come in a report format or similar, so that it is easy to interpret. There are many alternatives for generating reportability, however, the ideal is to build automated reports.

About this article

Este es el tercer artículo de la serie sobre Data Science Parte 1 y Parte 2

Cada artículo se puede leer independiente del orden ya que el contenido está separado por distintas etapas que, a pesar de tener una fuerte conexión se pueden entender individualmente. Cada publicación busca dar luz sobre los procesos que se llevan a cabo en la industria y que podrían ayudarte a decidir si tu organización puede contratar un servicio para migrar tus datos a la nube o aprender acerca de cómo funciona el desarrollo de este tipo de proyectos en caso de que seas estudiante. Hay algunos conceptos que se dan por sabidos, los cuales son explicados en la serie por lo que se recomienda dar una lectura completa en caso de que algunas palabras claves suenen desconocidas.

One of the fundamental processes in Data Science projects is data loading, in this article we will see what the loading processes consist of to visualise and how data squaring is performed to ensure that the developed product is returning the data expected by the business.

Data upload

Data loading is the last step in the ETL flow and enables data to be made available for data visualisation, which consists of presenting the data in a system in a visual form so that customers can make vital decisions. In this case, it is always sought that the data to be visualised is automated so that data transformation and data cleansing tasks can be carried out. To carry out the data loading and automation of all these processes there are several tools, however, today we have the cloud.

Why might it be in our interest to migrate to the cloud?

The cloud is the technology that allows you to store, manage and provide access to all your business data. Beyond the fact that the cloud allows organisations to avoid investing money in their own equipment and data centres in order to manage their business information. The cloud allows companies to access a wide variety of applications and services that are fundamental and very powerful when it comes to modernising the business, thus taking advantage of all the virtues of technology and turning it into a competitive advantage that drives your business forward.

Cloud platforms offer a multitude of unique services that allow you to extract the full digital potential of your organisation, including tools to manage and develop Big Data tools.

Charging in the cloud

Once the transformation processes are performed, the results are sent to the cloud within a zone in a Datalake Bucket. As this is the last transformation process, the results are stored in a provision-zone to be uploaded to a Data Warehouse Cloud system such as AWS Redshift or AWS Glue with Athena via a Crawler. So that the data goes from provision-zone to be available partitioned in a structured table ready to be queried by a reporting system and thus carry out a data visualisation and start to perform Business Intelligent practices. The tools we can find to manage big data processes are one of the following:

AWS Glue

This is an AWS service that enables ETL processes to be implemented for the purpose of categorising, cleansing, enriching and moving data. It crawls data sources and loads the structured data into a destination. When used with a Crawler and Athena, it builds a catalogue by interpreting the data types. Catalogues are repositories of metadata with data formats and schemas called Data Catalogs.

Glue Crawler

Crawlers analyse the data to create the metadata that allows Glue and services such as Athena to view the information stored in S3 as databases with tables. By making the data available in a structured format, the Glue Data Catalog can be created, and the Glue Data Catalog is generated and updated at each ETL flow execution.

Athena

It is a tool for data analysis. It is used to process complex SQL queries in a short time. It is a serverless service so it does not require infrastructure administration. So you only pay per query performed. With this service, data can be consulted from platforms such as Tableau.

SQL queries: SQL queries are those computational processes in which information is requested from a structured database and the database returns the specified information.
Serverless: These are those platforms or systems that work without the need to have a server enabled to operate, but are executed only when consumed, thus allowing to save computing power and servers that are not used.

Redshift

It is a Data Warehouse service in the AWS cloud that allows SQL queries for big data. You only need to create a cluster to start working with the Redshift Data Warehouse. It is designed to work with huge amounts of data. It allows you to process data in parallel in order to save time.

Cluster: Clusters consist of one or more servers that are shared as one to carry out a process that requires computing power.
Big data: Big data refers to all processes in which large amounts of information are processed.

Tableau

It is a data visualisation tool for data analysis and Business Intelligent. It takes data from queryable systems such as Athena or Redshift and allows you to visualise the data with multiple tools, and multiple views.

Visualisation: Visualisation consists of taking structured business data to generate graphs or visual representations of the data that allow the business to be understood quickly and simply, so that intelligent decisions can be made.
Business Intelligent: Business intelligence is about interpreting and analysing in the simplest and quickest way using technology and various processes to enable the business to make vital decisions.

Data squaring

Once data is loaded into a Data Warehouse it is available for consumption, however, whether or not sensitive data has been worked with, squaring tasks must be performed to ensure that the final result has not altered the expected values and thus ensure that the correct values are being arrived at.

Squaring tasks are processes where the aim is to reach the same results between the data arrived at by the user with his manual solution and those arrived at in the loading or in some of the transformation processes.

It is possible that in some cases instead of checking against a result constructed by the user or the client, it may be necessary to manually construct the data source against which to match by taking the data from the ingest area.

Alternatives for balancing

There are many different alternatives to perform a quadrature. You can use Excel or programming language libraries such as Python, including Pandas and PySpark.

We can square the values of some columns by obtaining the total sum or the number of elements, we can also perform the same calculations or specific filters with the transformed results and confirm that the same values are reached.

The process would be similar to the following:

  1. We download the original data from the ingest area.
  2. We read the source from a tool.
  3. We perform the calculations, transformations, filters, pivot tables, etc. To arrive at the same results as the script.
  4. We compared both sources and shared the results with the team.

It is essential that the data source used for the transformations and for the manual output is the same. If this is not the case, it is difficult to achieve the same results.

Squaring with Excel

To balance in this way, simply open our data source with Microsoft Excel or a similar tool.

We then proceed to calculate a set of data that is related to the transformation file, e.g. if we know that in the transformation we relate data, we should validate by grouping and filtering that particular data, in addition to the numerical values that are relevant.

We make a pivot table to validate that the groupings we generate give the same result as seen in the following image for the filters and values:

Squaring in PySpark

PySpark is a Python library that allows you to perform Big Data transformations with Spark. The following is a description of a series of scripts that can be used in data squaring processes.

Get grouping sum

To obtain the sum of the values of each grouping that we generate we can use groupBy(). This allows us to indicate the columns that will be used for grouping followed by the type of calculation to be performed on the groupings, in this case a sum with sum().

dataframe.groupBy('CATEGORY','PRODUCT_TYPE').sum('m3').sort('CATEGORY').show()

Recall that dataframes are a data format that can be handled by programming code using specialised Data Analysis libraries such as Pandas and PySpark.


""" Output
+-----------------+---------------+------------------+
|CATEGORY | PRODUCT_TYPE | sum(m3)|
+-----------------+---------------+------------------+
| null| null| null| 31|
| Clothing| T-Shirt| 556.291|
| Clothing| Sweatshirt| 10154.277|
| Technology| Ipad| 20905.502|
+-----------------+---------------+------------------+
"""

Get grouping quantity

You can make use of groupBy() as in the previous case, or we can use filters to check that the values obtained coincide by counting different indicators.

count_clothing = df.where(df.PRODUCT_TYPE=='Clothing').count()
count_technology = df.where(df.PRODUCT_TYPE=='Technology').count()

print('Clothing:', count_clothing, 'Technology:', count_technology)


""" Output
Clothing: 2314 Technology: 232
"""

Solving quadrature errors

It is extremely common that once the data have been checked they do not match, this does NOT always mean that the transformations have been incorrectly carried out, it can also happen that the data source is different or that when entering the data into a system to square them we are misinterpreting some numerical values, etc. We will have to identify in which cases the values are not matching, proceed to isolate them in order to analyse and contrast them.

Focus on error detection

Next we will correct a squaring error with real examples, in this case correcting a count of values that does not match. A useful alternative to identify the source of this error is to obtain a grouping of a column used in the transformation by filtering by the value that does not match, in this way we can count the values in order to export this result and be able to contrast the file in Excel.


df = df.select('group_column').where(df.PRODUCT_TYPE == 'ROP').groupBy('group_column').count()
df.show()
""" Output
+-----------------+---------------+
|group_column | count() |
+-----------------+---------------+
| 32413EH200| 1|
| 30413EH300| 2|
| 30413EH400| 1|
| 33313EH500| 3|
+-----------------+---------------+
"""

# We export the file for comparison in Excel.
test.write.csv('grp_matnr', sep=';')

We create a tab in Excel where we add the result of the CSV that we exported and the same calculation of the script but made in Excel, each one with its corresponding columns, as follows:

This will directly show us which values are different from our script and allow us to focus our analysis on a few data, rather than thousands of them.

Transform values that alter columns

On the other hand, we may encounter the case that some column values are altering the script's calculations or are disabling its calculation from the quadratures. To correct this we can generate scripts that transform the resulting data or the ingested data. For example below it was identified that the use of double quotes in some of the values was causing problems, so it is removed from the script.


df = dataframes_dict['main']

# Diccionario con los tipos de dato de cada columna
columns_types_dict = {
	#...
  'product_desc': StringType(),
}

# Iteramos por cada columna indicada en el diccionario
for column_name, column_type in columns_types_dict.items():
	...
	
	# Removemos la comilla doble de todos los String 
	elif column_type == StringType():
	  df = df.withColumn(
	    column_name,
	    regexp_replace(column_name, '"', '')
	  )

Correct separator in transformation

Sometimes the spacing in the output file does not match the tool we use for squaring. Fortunately, correcting the spacing of the generated document is very simple to do when exporting in the script.


test.write.csv('export', sep=';')

It is also possible to correct the separation by searching for all the values where the separator appears and replacing it in all cases with another one. A tool such as Notepad++ can be very useful for these tasks.

Correct sum of values

Sometimes, when obtaining data from management systems such as SAP with raw data, we may find ourselves with data that must be transformed to make them readable from the Data Warehouse and so that they can be squared, having to convert formats in the following way:

410.500,210 ⇒ 410500.210

Script to correct decimal format


df_double_obj = ['net_weight', 'quantity', 'us$_fob', 'us$_cif', 'us$_flete', 'quantity_aux', 'thickness', 'volume_m3', 'density']
    
    for column_name in df_double_obj:
        if column_name not in data.columns:
            continue
        data[column_name] = data[column_name].apply(lambda x: str(x).replace('.'','') if ',' in str(x) else x)
        data[column_name] = data[column_name].apply(lambda x: str(x).replace(',','.') if ',' in str(x) else x)

Sometimes we cannot use the script to square and we will have to correct the file manually. For example, in Excel, numerical values that have incorrectly formatted decimals do not allow the sums to be squared, this is because the numbers are misinterpreted, altering the real value of the results.

As in the previous case, it is possible to modify all these types of formatting with replacements in tools such as Notepad++.

‍If you can better understand these conversations, rank (word cloud), prioritize and engage users, you can improve a lot and build a true cult following.

Carlos Brignardello

June 8, 2022

Previous entries