Data Science Series: Data and Roles (1 of 3)

Carlos Brignardello

March 4, 2022

Data Science projects are all those developments where data is extracted from various sources, manipulated and visualised for analysis.

For these projects to be built, the client's business must be understood along with the data they possess in order to build a solution that delivers value to the organisation and allows them to make decisions.

About this series and article

This is the first article of the series: "Data Science". Each article can be read independently of the order as the content is separated by different stages that, although they have a strong connection, can be understood individually. Each publication aims to shed light on the processes that are carried out in the industry and that could help you decide whether your organisation can hire a service to migrate your data to the cloud or learn about how the development of this type of projects works in case you are a student. In this first part we will talk about the value of data and the roles that customers, users and developers play in data science projects.

The data

Data is all information that is useful to a business, and organisations today can access a great deal of information. This encompasses internal organisational data, external customer data and external industry or competitor data. Companies that have their operations digitised and therefore generate data that can be captured, processed and analysed.

In order to work with data it is first necessary to store it, and to do this we have several alternatives. Cloud computing services such as Google Cloud Platform or Amazon Web Services (although there are others) are extremely efficient and cost-effective, as each one provides us with a variety of services that help us with the purpose of storing data efficiently and securely.

The value of data

In order to derive value from data we must capture, store and structure it in a way that allows business decisions to be made. Data can be used not only to analyse past or current situations, but also to make predictions and take intelligent actions. This means that after capturing the data you must find a way to get real value out of it.

Once we have captured, identified or enabled a data source, we must store it. We can differentiate the storage of data in two different storage systems, which we will explain below.

Data Warehouse vs Data Lake

Both data warehouse services and data lakes aim to solve the same problem: storing large amounts of data. Their main difference lies in the fact that data lakes are designed to store raw data. While the data warehouse stores structured information, which is already filtered and has been previously processed in order to arrive at the structure of the data already stored in the data warehouse.

Structured vs. unstructured data

When storing data, we can find two formats:

Structured data: This is highly organised data, such as customer records, tables or tabular data and other data that tends to be quantitative. The advantage of this type of data format is that it can be easily stored and managed in databases. It should be noted that this type of data is generated by building models and structures that allow the data to accumulate in an orderly fashion. This type of information is stored in Data Warehouses.

Unstructured data: This is data that is not organised and is notable in that it tends to be qualitative, contains undefined information and comes in many formats. Examples of this type of data are: Images, audio and PDF files. This type of information is stored in Data Lakes.

Quantitative: Information that can be measured.
Qualitative: This is all information that cannot be measured and for which measurement scales or models must be created.

Below is a figure describing the differences between structured and unstructured data.

©Kranio SPA

Both structures can be used to obtain results and make intelligent decisions. However, unstructured historical data is much more difficult to analyse. But with the right cloud tools you can extract value from unstructured data. Using APIs to create structure.

Historical data: This is the information that organisations generate over the years. It is usually in disarray and comes from various sources.
APIs: These are tools that allow two different systems to integrate or communicate.

Roles

To carry out a project it is necessary that there is effective communication between three of the roles in a data project: The client, the user and the development team.

The Client

The client plays a fundamental role in this type of project, because when working with data from an organisation, it becomes imperative that the development team, throughout the project construction process, understands how the client company works and how they work with their data. In essence, it is a matter of understanding the business logic.

This understanding of the business is generated hand in hand between the developers and the client, the development team must ensure that all their doubts regarding the functioning of the business and the use of the data are resolved. On the other hand, the client must be able to resolve all these doubts, which will make the difference to obtain a good result.

Meetings with the client

The first thing to do at the start of the project is to initiate a series of meetings with the client to understand their expectations, and their flow, in order to define the solution they need. The first meetings that take place between the client and the developers are called "survey meetings" and all subsequent meetings are called "understanding meetings". Both meetings are intended to get to the heart of the problem to be solved and ideally together to determine what value the data has for the business. However, the real focus of these meetings is understand business logic (hence the name). This process of understanding encompasses knowing how the data is obtained, what manual processes are carried out with it, how it is presented and ultimately how it is expected to be viewed or accessed, in short: the objective of the understanding meetings is to trace the data that make up the business logic.

During project development it is recommended that the customer enables a Product Owner in order to ensure that communication with the customer is even more effective and agile. At the same time a Product Owner in the development team can guarantee or ensure as much as possible that the efforts made by the development team are aimed directly at what the customer is looking for and thus reduce discarded development or development time investments that are later modified or discarded because they are far from the customer's needs.

The Product Owner in agile projects is the team member belonging to the client organisation who will support the developers and the Scrum master in driving the project forward in line with the vision and requirement of their own organisation.

Automated customer intake

For the correct development of the project, the client must ensure that the development team has sufficient data available to build the logic.

The "upload" of these files is usually stored in ingest zones in a cloud computing service, such as S3 from AWS.

Ingest zones correspond to those directories or spaces within a Data Lake where the data that will be part of the ingest process are stored in the pipeline or data flow. These are used by the development team to test and build the expected result.

However, this process is not limited only to the development process, but, for the data flow of the solution to operate correctly, new data must be supplied periodically in these ingest zones. Generally, the frequency with which this data is uploaded is directly related to the occasions in which the whole process is triggered, this in case of working with a serverless project.

A serverless project is one that only consumes resources and only runs when required.

The user

When developing the solution, it should always be built with the user in mind. For a data project, a user can range from the management of an area and its supervisors to a worker who has been generating reports for some time and who has already built the solution manually. This solution will now be moved to the cloud to automate their work.

Users are required by the development team to understand the solution to be developed and in case the solution already exists and is to be moved to a Cloud service.

In order to ensure that the solution is consistent with what the user is looking for, the user should have a meeting directly with the development team. In this way the team that will develop the solution will have all the context, so that they can understand the data model and how the necessary metrics are obtained. As with the understanding meetings, this is an iterative process in which all the details related to the data, specifically the flow of the data, must be resolved.

Development team

The development team, made up of professionals from various IT disciplines. They are in charge of moving the solution forward.

To give an example, we can find professionals who fulfil the roles of: Data Engineer, Data Ops, Dev Ops, Cloud Engineer, Data Analyst.

The context of the project changes throughout development and the team must always be aware and focused. Feedback from both users and customers allows developers to build a deliverable that meets the expectations of both roles. We can synthesise the concepts seen in this last section using the following image:

@Kranio SPA

In the next article of the series we will see in detail what an ETL flow is and how data is extracted and transformed. We hope the article has been helpful, if you have any questions or your organisation needs support to solve projects of this type do not hesitate to contact us.

Carlos Brignardello

March 4, 2022

Previous entries