According to this Gartner report, the cost of poor data quality is $15 million per year on average for an organization. Not only does bad data quality impact the revenue of an organization but also hinders it to become more data-driven. Data teams must have a solid and robust strategy to make sure that data products are trusted by the stakeholders. Especially since this ultimately increases the ROI of data teams and makes them a crucial component in the success of the organization.
Therefore, focusing on data observability and data quality becomes one of the most critical tasks for a data and analytics team. Data Observability is defined by Validio in their whitepaper as
“The degree to which an organization has visibility into its data pipelines. A high degree of Data Observability enables data teams to improve data quality.”
Good data quality is the end product of the data observability process, but what does data quality mean?
Validio defines data quality as
“The extent to which an organization’s data can be considered fit for its intended purpose. Data quality should be observed along five dimensions: freshness, volume, schema, (lack of) anomalies, and distribution. It is always relative to the data’s specific business context ”
This article explores the role a Data Quality (DQ) Engineer can play in solving data quality-related challenges. I also address why a DQ Engineer is needed, what value they bring to an organization, what their responsibilities could be, and the required skills for succeeding in this role. I’ll also discuss key stakeholders that a DQ Engineer should work with.
Why the Data Quality Engineer is Needed
A search on LinkedIn for the Data Quality Engineer job profile shows perhaps two to three openings for this role. Although 46% of participants in a survey conducted by data transformation tool DBT, emphasized that they want to invest more in data quality and data observability. So why are data teams not hiring Data Quality Engineers?
There are several answers to this question:
- While data quality as a concept is not new, the industry is only recently realizing its importance. After they see that the results of predictive and descriptive data engineering initiatives have a very low business impact because of bad data.
- Many people perceive data quality as a boring and mundane topic. It is not as exciting as building machine learning for predictive analysis, and usually data quality is an afterthought. In contrast to ML teams that have clear models and Data analytics teams with specific dashboards, data quality usually lacks something tangible.
- Resource constraints: Small companies or startups usually lack the budget to hire a dedicated Data Quality Engineer. In such cases, data quality responsibilities might be assigned to data analysts, developers, or other existing roles.
Nevertheless, Data Quality Engineers are becoming increasingly important, especially during the current challenging economic times. Let's explore why this is the case.
The data pipeline is the starting point of the data life cycle for most of the data products like machine learning models, BI dashboards, reverse ETLs, etc. Data Engineering teams should pay considerable attention to make sure that the data quality is high at this stage. Otherwise, it affects all the data products which consume this data later.
Data Quality Engineers who can implement a robust data quality assurance process are game changers for organizations that rely on data. High-quality data products can enable organizations to experience faster adoption of these products, which in turn also increases trust among their stakeholders.
The responsibilities of the Data Quality Engineer
In this section, I will define the core focus areas of a Data Quality Engineer, so that we can differentiate this role from the other roles within the data industry, such as data engineer, data analyst, data scientist, and analytics engineer. The Data Quality Engineer should validate the data flow as a whole, from data sources to data consumers, such as BI dashboards and ML models.
- A Data Quality Engineer should identify metrics that are used to measure data quality. Ideally, all other data stakeholders should be involved in this process, including business users with relevant domain knowledge.
- A Data Quality Engineer should focus on data products or datasets with a high revenue impact on the business. Doing so leads to a high return on investment for the data team, and increases the impact they have in the business. This in turn sets them up for success.
- A Data Quality Engineer should also identify or establish validations that cover both functional and domain-specific data quality dimensions. There are many ways to put dimensions of data quality into buckets and categories. One such way is covered by Validio in this whitepaper. There, they list The Data Quality dimensions as: Data Freshness, Adherence to the defined Schema, Data Volume, Lack of Data Anomalies, Expected Distribution of datapoints in a dataset.
- When it comes to the Domain Specific category of testing, the data product is tested against business rules derived from the requirements of the data product. Here we emphasize making sure that we are building the right product to address the needs of our business users. For example, if a data product is a report showing the number of impressions and clicks received on an advertisement placed on a Youtube video, we also need to make sure that all the clicks and impression events are captured correctly and reflected in the report for a particular advertisement.
- Monitor data quality metrics and perform root cause analysis of data quality issues that lead to data downtime, and take appropriate steps to address these issues. For example, if the data freshness of a particular data source is identified as an issue that leads to significant data downtime, then a concerned data source owner should be contacted, and a data contract with associated validators put in place to minimize these data freshness issues.
Skillsets of the Data Quality Engineer
Now that I’ve covered the need for, and the responsibilities of the DQ Engineer it’s time to talk about the skillsets. They could include the following:
Data management
A strong understanding of data management concepts, including data profiling, data mapping, and data integration. This helps to identify and establish data quality checks and data quality issues.
Data quality assessment
The skills to perform data quality assessments helps with identifying data inconsistencies, inaccuracies, and incompleteness in data products.
Data analysis
Strong analytical skills to identify patterns, trends in data, and data quality issues.
Data Observability tools
Knowledge and familiarity with using common data observability tools like Validio, Great Expectations, Elementary, etc. This expertise helps them quickly identify the health of data in the organization and come up with checks to address data quality issues.
Programming languages
Proficiency in programming languages, such as SQL and Python helps write and automate common functional DQ checks.
Engineering Practices
Familiarity with common data engineerings terms like GIT, Data Contracts, SCD, and CI /CD systems, such as Jenkins and CircleCI. This enables them to test the data pipelines written by data engineers and make sure they fulfill all requirements.
Communication skills
Excellent communication skills are required to collaborate with stakeholders, such as data analysts and data scientists. These skills are also needed to meet data quality standards set by other stakeholders.
Attention to detail
A strong attention to detail is crucial to ensure the accuracy of the data and to identify data quality issues.
Data governance
A good understanding of data governance concepts, such as data ownership, data privacy, and data security, is required to ensure compliance with regulations and standards.
Continuous improvement
A mindset of continuous improvement helps with identifying areas for improvement in data quality processes and workflows.
Problem-solving
Ability to troubleshoot and resolve data quality issues, as well as identify the root cause of problems and implement effective solutions.
Interface within Organisation
The Data Quality Engineer is a role that interacts with stakeholders in the organization to make sure that the quality of different data products fulfills the intended business purpose. Collaboration with these stakeholder personas is critical to embrace data products in the organization, as it helps maximize the ROI of data teams. In everyday operations, the Data Quality Engineer interacts with the following personas within their organization:
- Data owners and Product managers: Understand the data they are responsible for, including its quality, sources, and usage. This enables the Data Quality Engineer to identify potential issues and work with the owners to resolve them.
- Business stakeholders: Understand their requirements, and priorities, and make sure that data quality initiatives align with the organization's business goals.
- Data analysts: Identify patterns and trends in the data, validate the data, and make sure that the data is fit for their analysis.
- Data Engineering team: Make sure that data quality processes are integrated into the data architecture and that data quality issues are addressed in a timely and effective manner.
- Data Governance team: Data Quality is indeed a function of Data Governance. The data quality engineer should support all the data governance initiatives in the organization.
Conclusion
I have discussed why the role of a Data Quality Engineer is crucial to ensure high data quality within organizations. In the past, data quality has been neglected, as the focus was mainly on building new data products. Now, after realizing the cost of poor data quality, the industry recognizes its importance. A Data Quality Engineer is responsible for implementing a data quality assurance process that increases the adoption of reliable data products built by the Data Engineering Team. This requires a high level of technical expertise, great attention to detail, and strong communication skills to work together with other data users, such as business stakeholders, product owners, and data engineers.
Overall, a Data Quality Engineer plays a critical role in maintaining high-quality data that is used to make different data products and breaks the cycle of Garbage In, Garbage Out that is currently plaguing organizations.
Please let me know what you think! You’re more than welcome to contact me on Linkedin.
P.S. A big shoutout to Sara Landfors and the whole team at Validio who helped me in writing this article.