Methodology for Successful Data Projects

JP

June 22, 2020

Many businesses develop initiatives around their data by implementing Data Lakes or other technologies that allow them to extract, store and organize the maximum amount of data for analysis and better decision making. The objective is to better understand the needs of their clients, improve service quality, predict and prevent outcomes by taking advantage of significant knowledge embedded within all this data and information. 

With organized and stored data we can process and analyze current and future scenarios, answer questions such as: what is happening (Descriptive), why is it happening (Diagnostic), predict what will happen (Predictive) and what effective solutions you should take to optimize and increase efficiency (Prescriptive). 


Value Matrix - Complexity in Data Projects - @2020 Kranio.io


The Data Lake is a living system, dynamic and evolving, that receives data from different sources (structured and unstructured), with a variety of formats, where the data comes in raw, not optimized or transformed for specific purposes, therefore it is important to understand the characteristics, regulations and standards this data must comply before it is consumed by users. 

You should always consider, in the planning of a Data Lake, to define its data governance. The first thing is to understand what data governance. There are many definitions, for us these are all the processes, policies and tools that allow to assure the security, availability, usability and quality of the data, guaranteeing that only authorized users get access to explore and exploit the data, that these are updated and integral, avoiding the risk of exposing sensitive and confidential information of people and organizations. 

Deciding how to protect the data, how to maintain consistency, integrity and accuracy, and how we will keep it up to date, are the points we will cover in this article and the core Kranio's data process and project methodology. 

How do we do it? - The Secret Ingredient 


Methodology for Data Projects - @2020 Kranio.io


The following is a summary of the Kranio methodology applied to the data process, which consists of several stages that are not necessarily sequential:

‍If you can better understand these conversations, rank (word cloud), prioritize and engage users, you can improve a lot and build a true cult following.

Preparation

Here begins the data process. We initially define with the business and incorporate an agile framework, we lead an organized design sprint, in time, people involved and their context, to obtain and define the KPI's, iteration cycles in building the product, understand the definitions and business rules. We conclude with an initial backlog of activities that will be prioritized during the execution of the project.

Depending on the type of project, we use or propose a framework based on agile methodologies, mostly Scrum or Kanban, that allows to prioritize tasks over time, define routines of follow-up with the business to have visibility of the product to all stakeholders (weekly dashboards are very good fo this).

Regarding the form and tools of communication, monitoring and documentation our vision is agnostic, we adapt to what customers have, and if they do not have any, we recommend and implement the minimum to meet the expectation and success of the project, supporting the establishment of standards for data projects.

The key factors to consider in the preparation phase of a data project are: 

  • Get involved with the needs and expectations of the business, understand what the problems are and what value will be unlocked from the implementation of the project. What are we doing the project for? What is the real value for the business? Having clarity about this allows to not only take the requirements of the business, but also to contribute with your experience and make suggestions and value proposals during the execution.
  • Identify the interlocutors and stakeholders by clearly establishing the role they play within the project.
  • First identify the sources of information that are available, both external and internal to the Data Lake. This helps better manage expectations, and allows you to raise early warnings to prepare and present action plans.

Data Ingestion 

As important as understanding the problem to be solved is, understanding the data we have available is fundamental. Customers have a way of storing and manipulating data that does not necessarily is the right way, or may not have the necessary platforms to do so. With that, begin to define and agree with the client a technological architecture oriented to satisfy the needs of the current business and thought in future terms of usability, scalability and ease of maintenance. 

Photo by Lasseter Winery on Unsplash


With the initial defined architecture, identify the sources that need to integrated, the best way to extract the data (tool or technology), the frequency and also evaluate if the data contains structures that allow to identify people (PII) or if it contains confidential data. This is important to be able to give it the appropriate treatment, and then be stored in an organized and secure way in the format and structure defined in the Data Lake. Analyze the current information available, identify what data you can reuse and avoid extra effort.

Kranio DataOps start the construction of the digital products (code) that move data from the different sources to the Data Lake. We can use an infinite number of tools and services that support the extraction and storage process, however, the creation of a data pipeline is vital as it helps to automate the validation and data loading processes. Thus, it provides a centralized tool for orchestration and monitoring, incorporating tracking of the execution of a process, generation of alerts, error logging and audit trails. 

Learn how to create a simple and robust data pipeline between legacy applications and the Data Lake in this video 

Checklist to ensure at this stage:

Define Standards:

  • Establish which are the programming languages, code repositories, libraries, cloud services, orchestration tools, monitoring, management and auditing. 
  • Generate the parametric code: never static leave code within the programs, use files/configuration tables.
  • Use standardized nomenclature of buckets and stored files
  • Define the data storage format in the transformation and consumption layers.
  • In case the project requires, define a Data Model, ideally this should not be oriented to a specific requirement, we must think about scalability. A robust model allows to lay the foundations for future requirements, not only the specific one of the current project.
  • Products must have validation points to ensure that, apart from finishing correctly, correct and consistent data will be generated.

Create monitoring and auditing processes:

  • Provide traceability of all executions (successful and unsuccessful)
  • Record all actions of captures, transformations and data outputs
  • Provide enough information to minimize diagnostic time in case something goes wrong.
  • Provide a centralized and easily accessible log repository that allows us to provide solutions to a problem. 

Ensure the Quality of the products:

  • The delivery of the products must include evidence of the respective quality controls carried out. Guarantee not only that a process is executed, but that it is executed well and with the expected result.
  • Generate evidence of the correctness of the generated data, considering evidence of what was correct and under which scenario and conditions it was generated.
  • Reliable data with self-correctness validations and high-coverage monitoring prevents any error or mismatch from ever questioning the credibility of the digital product. Your best ally is to deliver a certified work, free of errors, consistent, guaranteed.
  • Products designed for operational continuity, with all the resources available for easy delivery and control by the Operations team.
  • Automated end-to-end products avoid manual interventions that jeopardize operational continuity.

All these allow, besides certifying the work, that all the implementations have the same guideline and way of doing things, and we manage to reduce the time of construction, improve the quality and clarity of the understanding and traceability of each part of the process. 

Data processing and enrichment 

Once the data in the Data Lake is secure and organized, we establish within the framework the procedures and transformations to move from data stored in raw form, to information usable by customers. 

Knowing that data is obtained from multiple sources that may be unreliable, it is vital to be able to have a process for analyzing the quality and usability of the data. This process can be initiated manually, but must end in automated processes with tools. The role of the architects and data engineers is vital since they help us to evaluate and detect if more information is required to define a reliable dataset that allows us to perform the required analyses. They also help to identify missing data for the correct creation of the dataset fundamental to the development of the project, and to eliminate incorrect, duplicated or incomplete data.

In this process specialists implement the best practices of data projects (BigData), organize the information through various categories and classifications as the data allows, each subset is analyzed independently and make the transformations of structuring or enrichment of the data by adding new columns, calculated data that are generated from existing data or incorporating information from other external sources. The final data can be made available to the user through a relational model in a data warehouse, a view for data access or a consumption file.

Taking advantage of the "intelligence" provided by the data is the role of the data scientist who is in charge of analyzing simulated scenarios to make predictions, provide mathematical modeling, statistical analysis, machine learning techniques, develop predictive analysis, clustering, regression, pattern recognition to bring new data that enriches the project and gives more value to customers. 

Information security and proper user access to Data Lake must be established in data governance where ownership is identified, forms of access, security rules for confidential data, data history, source sources and others.  

Exploitation, consumption and operational continuity 

The last aspect is how users will use the data. As for the consumption and exploitation of the generated data, the focus must be on the business user's needs, with whom we co-design a solution proposal that meets the requirements and parameters defined in the preparation phase. This work includes the review of the requirements defined in the preparation of the project and the selection of an appropriate platform for their visualization needs such as Tableau, PowerBi, AWS QuickSight or other. You must create a storyboard for the different categories of users and prepare a customized and efficient design for the understanding of the data being presented.

A good data discovery and exploration work will define the basis of the objective of having a self-service data visualization platform, where users get insights to improve management and decision making, in an intelligent way and supported by reliable information. For example, see how a company 'listens' to what its customers say on forms, contact centers and social networks and uses it to improve customer service

The quality and reliability of the data is of vital importance, since if the data arrives badly to the platform, there will be errors in the visualizations and the dashboards and the generated files will not provide the information that the client requires. You must emphasize this point since the analysis corresponds to 70% of the total time required to create a good Dashboard. 

Well, not everything ends up in the deliverable of a dashboard or file for customers, you must also think about the future, ie consider other aspects that are important:

  • Operational continuity of the digital products developed. Most of the time, the operation and monitoring in the productive environment will be in charge of the customer so the focus is the life of those people, that is, minimizing the monitoring time, minimizing the time they take to review the causes of a problem, minimizing the time to solve a failure.
  • Scalability of the solution in all components such as infrastructure, architecture, together with the tools that will be used during the development and that allow to grow as the business requires it.
  • Easy to trace the eventual problems. To be able to quickly find information, have an initial diagnosis and a plan of correction.
  • Minimize the complexity of processes to simplify future adjustments or improvements.
  • All projects include complete documentation to facilitate collective understanding. 
  • Avoid using a multiplicity of tools that in the long run do not give value to the business. 
  • It generates the instances of communication with the client that allow to make an effective and clear transfer of all that has been done. The more clarity the client has about all the good work that was done, the higher the degree of satisfaction, and always considers the profile of the interlocutor (operations area, business and others).

Conclusion 

Ensures success in the design, implementation and development of a data project by applying a data project methodology. In this article we show you the Kranio methodology applied and refined in dozens of data projects in several countries. If you apply this Methodology you have a better chance of meeting expectations and avoiding errors that can ruin the project. 

Another fundamental aspect that you must ensure is the participation of the business user. They are the ones who consume the data to improve decisions. From the beginning, at every stage, robust, aligned and well communicated teams deliver better projects.

Do you want to review your methodology or implement this Methodology for successful projects?

Contact us and get articles like this in your mail

JP

June 22, 2020

Previous entries