Data Pipelines with SQL and Sqoop

Team Kranio

December 18, 2020

Review in this article how to transfer data from a relational database with limited permissions to a file system with unlimited permissions - Skoop.IO

Companies need to extract large amounts of data from different databases, transform them and enrich them so that they generate added value to the business. 

The extraction process is often complex to develop because, in some cases, the provider of that data gives access only to the views and usually limited access to the database. This makes sense for security and good practice. Now, how do we extract the data with limited access to the database from the providers, what alternatives do we have, and how do we develop the extraction so that it works independently of the database engine? Read on.

@ Kranio 2020


Basic concepts.

Before we start, we will review some concepts about big data and the Apache Sqoop, Apache Hadoop, Aws EMR and Aws DynamoDB tools.

Big data: This is a term that describes the large volume of data, both structured and unstructured, that floods into business every day. But it is not the amount of data that is important. What matters with Big Data is what organizations do with the data. Big Data can be analyzed for ideas that lead to better strategic business decisions and moves. Source https://www.powerdata.es/big-data

Apache Sqoop: Is a tool designed to efficiently transfer massive data between Apache Hadoop and data stores structured as relational databases or vice versa. Source: https: //sqoop.apache.org/

Apache Hadoop: It is an open source framework that allows distributed storage and processing of large data sets based on commercial hardware. Within Apache Hadoop is HDFS (Hadoop File System) which is in charge of storing large amounts of data with horizontal scaling, using MapReduce for the processing of these data. Source: https://blog.powerdata.es/el-valor-de-la-gestion-de-datos/bid/397377/qu-es-el-apache-hadoop

AWS EMR: Amazon EMR is a managed cluster platform that simplifies the execution of Big Data frameworks such as Apache Hadoop and Apache Spark on AWS to process and analyze large amounts of data. Source: https://docs.aws.amazon.com/es_es/emr/latest/ManagementGuide/emr-what-is-emr.html

AWS DynamoDB: It is a key database - non relational value. Source: https: //aws.amazon.com/es/dynamodb/

Now that we know these concepts in a general way, I will explain how we use these tools in a solution that we call "Skoop-IO".

There are different database extraction alternatives for big data, but the one that best suits our requirements is sqoop, thanks to the fact that this is a specialized tool for data extraction and is also open source. We developed Skoop-IO based on the sqoop extraction tool.

Skoop-IO is a program capable of extracting records from different database engines, regardless of their limitations. The processing is automatic and the only manual thing is to configure the credentials and the type of processing (full, partial or incremental).

Types of loads

Currently Skoop-io has 3 load types, fully configurable for importing data to an S3 loop in AWS.

Full: Gets all the data from the database and leaves it in the s3 loop in avro format.

Partial: Gets all the data using a date type field from the last few months that are set.

Incremental: Obtains non-existent records in our s3 storage at each skoop-io consultation.

Architecture.

To make Skoop-io we use AWS resources, specifically EMR as a cluster for the program execution, DynamoDB for the data import configuration and S3 for the storage of imported data.

@ Kranio 2020



Thanks to Big Data tools and cloud resources (AWS in this case), we were able to quickly implement a stable solution for relational data extraction, independent of database engines or limited access by providers. 

Team Kranio

December 18, 2020

Previous entries