I got involved in the data space when I joined Collibra, a platform for data governance, cataloging, and discovery back in 2010. At that time, the role of the Chief Data Officer / Head of Data didn’t exist, and organizations were struggling to understand what was going on with their data, let alone manage data quality issues as the tooling that was available, was unwieldy and antiquated, and did not support the needs of the business (still does not in most cases). When I was introduced to Tom Baeyens, my fellow co-founder, it didn’t take long to connect the dots — and the data — and realize that there was a real world problem that we could solve combining Tom’s open source background and my data experience.
The issue we identified, and stand by today, is silent data issues. Silent data issues are at the core of what needs to be solved. Most data teams today are flying blind without systems or processes to detect problems with data. And as a result, data issues remain silent. Data issues are hidden across an organization’s entire data stack and largely impact data products — software that uses data to drive outcomes such as automated decision making, algorithms, derived data, raw data and decision support. Not only does the volume of data continue to grow exponentially, so does the number of data products that need to be managed on an ongoing basis. The data of these products can be compromised anywhere, at any time, as it moves from source to decision.
The issue continues to get bigger as systems keep going and processing bad data, with uncontrolled consequences including producing unexpected or erroneous results. This is why we call them silent data issues — data quality issues that are identified only once datasets are put to use in reports, campaigns, models, and used for decision making.
This can, and often does, result in data engineers spending too much time fire-fighting a data issue; data consumers having no confidence to trust the data; and the business spending too much time trying to resolve the far-reaching effects and consequences.
Let me be clear: data quality is an age-old problem. You’ll find it referenced in DAMA’s Data Management Body of Knowledge, whose first edition was published in 2009. The overall principles are solid; it’s the process, and technology that need to be hit with a refresh.
Our big idea was to empower data teams with tools to detect and resolve issues that matter earlier upstream to enable data confidence. And so Tom, the team and I, set to work to remove the fear of not knowing and the pain of finding out too late that a silent data issue has had a downstream impact.
Early in our market research, it struck us that there was a general lack of observability into data systems (commonly called Data Observability). Many organizations had a way to index and discover data (catalog), but very few organizations could automatically discover and track the data problems across and within data sources that surface in the context of building data products. As a result, most data issues remained silent.
When teams were implementing a system to detect issues, it was mostly a rules-based testing service. To achieve this, they were relying on homegrown frameworks, most often using YAML files to configure the checks that needed to run each time new data arrived. At runtime, the DSL translates into a set of compute instructions to calculate metrics — typically in SQL or Spark — and evaluate checks.
The biggest problem data teams have with this set-up is that it doesn’t scale. It doesn’t scale because rules are hard to set-up and maintain, and it doesn’t scale because there are no automated issue discovery systems (e.g. tables are not refreshing based on the historical refresh schedule we’ve inferred from the query logs).
Another, even more important, scalability problem that we identified had nothing to do with technology, but with people and process. The majority of high-value checks were defined by subject matter experts (SMEs) of data, who are often not comfortable with Git + a YAML DSL. This limits adoption of both the new process and technology tremendously.
Data teams require a common framework to define and manage expectations for data behavior. Setting “Service Level Agreements” between data engineers and data consumers would bring clarity and consistency to teams when creating data products, eliminating (unwanted) assumptions and improving data quality on a continuous (automated) basis.
Soda Cloud is a new, prescriptive approach to get ahead of silent data issues and manage data quality. It combines predictive capabilities with a super simple, yet powerful rules-based system. This allows data teams to create coverage very quickly, and get end-to-end observability.
Soda Cloud is designed for a broad range of team members to get involved, from data platform engineers, to analytics engineers, product managers, and analysts. The goal is to help them to discover, prioritize, and resolve data issues collaboratively — and sooner. This way, the CDO / Head of Data can keep the oversight they need to help remove bottlenecks and ensure governance.
Through automated monitoring, metadata is collected, tracked, and monitored across core data quality dimensions including timeliness, completeness, consistency, and validity. By tracking datasets over-time, Soda learns the data refreshes interval, the typical volume of data processed, as well as any changes in the table schema (including changes inferred types). By doing this automatically for all datasets, a large portion of data issues can already be discovered.
Next to this automated anomaly detection for “known unknowns”, data issues can also be discovered by testing and validating data. For this, our team has developed a simple, yet very powerful low-code, Domain Specific Language (sometimes called DSL) that allows you to do a wide range of checks. These include, but are not limited to consistency-over-time checks, reconciliations, reference data checks, and virtually any business logic.
Triaging and prioritizing data issues today has become an ever greater challenge, largely because of the volume of data that’s being amassed — and the variety of different data owners, consumers and stakeholders across the business. Data teams are realizing that not all data is created equal, and in order to keep quality high, an ongoing process needs to be put in place where data owners take responsibility for both prioritizing and resolving data issues. This thinking has most recently been outlined in the so-called data mesh.
Data Freshness Dashboards — a mechanism to manage product performance
Soda Cloud gives teams a tangible framework to define, visualize and track data behavior through Data Freshness Dashboards. Whether it’s for a BI report for Finance, or machine-learning instructions set up by Operations, This approach enables the data owners to manage their data as a product, and understand what every team expects of the data downstream. These dashboards show the data owner the discovered data issues from across the organization to accelerate both discovery and resolution.
Further analysis on the root cause of a data issue is often also needed. This can be achieved by, for example, exploring the data lineage in your data transformation, orchestration, and/or data cataloging tool.
There’s no point in discovering and prioritizing data issues if there isn’t a robust follow up process to resolve them. Behind our mission is the ethos that the right people need to be brought together, at the right time. In this part of the process, it is to collaboratively resolve issues, and ultimately prevent issues.
At Soda, we believe that data quality is a team sport, however, you need to make sure that it’s the right people getting involved through role based alerting. The Soda platform facilitates collaboration by creating a shared context, and a clear resolution workflow for prioritizing and resolving issues and assigning tasks that matter most to the business.
Our approach and workflows take into consideration and recognize that it is the data owner that can ultimately make decisions on where to invest time and effort to improve the quality of data. Data owners should be able to easily ask both data engineers and data SMEs in the business to help analyze and fix issues. To streamline communications, integration options are available for the most-used channels such as e-mail, Chat, Slack or ServiceNow
Strongly driven by the community that is forming, we are building and motivated by our mantra that data quality is a team sport. The moment that an organization can bring everyone (and we believe that is every single person in the business) closer to the data, that is when — well, the magic happens!
Soda’s approach is different. Whilst there are a number of offerings in the market (with more and more joining this growing data space every day), they tend to focus on only one part of the problem, or solve the challenges of only one role in the data team.
With Soda Cloud, the entire end-to-end data quality process is brought together in a single platform, integrated and centralized. Each user can work with the right tools and workflows, in an environment that best suits their needs and expertise. This is what we mean when we say that Soda brings everyone closer to the data.
We solve the problem with a combination of a cloud platform and a set of open source developer tools.
Soda Cloud prescriptively solves the problem of discovering the silent data issues that matter, by giving data teams a central platform to track and score the health of data across core quality dimensions. For us, end-to-end observability and real-time collaboration will only occur when:
For us, the goal of our community is to simplify the effort and share best practices. Everyone is looking to solve common problems with common solutions. And to help, we’re being open and reducing the friction to quickly get started with testing and monitoring data. We also want to reduce the fear, pain and sleepless nights caused by no solution or homegrown solutions that are not solving the problem.
Soda SQL is delivered in a white box approach that puts engineers in control. The Soda Developer Tools kit, available on GitHub, is built to fit naturally into the data engineer’s workflow.
Soda Cloud is available as a free trial version (now extended until June 30 2021). Users can get monitoring in just minutes and realize the power of Soda on their own data.
There is power in the community and there is power in good data.
If I can help, let me know.