Blog

Data Testing

Test Your Data as You Would Test Your Code

Published May 11, 2020

Updated Jan 30, 2026

Kyra Petrov

Former Marketing Content Manager at Soda

Table of Contents

Did you ever have unexpected or bad data flow into one of your analytical products?

Bad data leads to compromised analytical integrity, faulty decision-making, and a loss of trust in your data.

As organizations are doing more with data, the need for high-quality data only increases. When it comes to machine learning for example, it is crucial to first find an unbiased, clean subset to train your predictive model on. Bad data leads to faulty predictions.

This is why companies are warming up to the idea of testing their data.

Testing datasets is not an entirely new concept. In software engineering, TDD, also known as test-driven development, has been a common practice for many years. Automated testing goes hand in hand with writing software code. If you are not testing your code automatically after every commit, then you’d be considered an amateur instead of a professional. In order to trust software, you need to have automated test suites. This is no different when it comes to data that’s used in production.

In the last few years, data product development has made big progress. Mainstream adoption of new data technologies became a reality. Even for machine learning, one of the frontiers of software development, many off-the-shelf solutions can readily be found online.

Next to ML, traditional data warehousing has developed just as well. Companies & consumers started to produce a lot more data, so a new type of cloud-native data warehouse (like Snowflake) came to market, where storage and compute are separated (read fast & scalable), and consumption is based on power & uptime (read low cost).

Before we continue, let’s first define what we mean by “data in production” and introduce the concept of “data product”.

A data product is similar to other software products (e.g. they have an interface and get new features that are deployed in dev & production), but it’s different in that the primary goal of the product is to provide value through analytical datasets. So data products heavily rely on data to drive outcomes. Another characteristic is that it’s often the most recent data (how many traffic jams do we have now) that is the most important.

When developing a data product, it makes sense to start testing data as soon as it goes into production. Examples could be testing the daily set of transactions that came in, or the distribution of predictions we’ve made across all segments.

It’s vital that the data is clean and as expected, to ensure the integrity of your data product.

Why should you test data products once they are put into production?‍

Just like in software development, developers test because it is necessary to spot bugs and their underlying causes early, as they tend to lead to crashes. Clearly, bugs that cause your systems to crash aren’t a good thing, but at least they signal that something is wrong. In software development, these bugs are easily detectable.

In data product development, testing is even more crucial for a multitude of reasons. Many factors force data to change constantly. The people who cause these changes are frequently in different parts of the organisation and they don't necessarily always assess the impact on the data value chain when making changes to the operational systems.

Furthermore, data issues are often silent. A machine learning model or algorithm, for example, will continue to work, even if some of its inputs are off. It can take weeks, months or even years to spot an issue so it’s crucial you detect anomalies as soon as possible.

Keeping in mind data lakes & new data warehouses can become very complex, which means manually testing is no longer the only option to prevent data lakes from becoming data swamps.‍

When should you start testing your data?‍

The answer here is short and sweet, as soon as possible. This means introducing a culture change into your teams that starts with creating transparency on errors and data deficiencies. We definitely recommend to start testing your first data product, after every build or change you make (even if your product is a report!).

Your data products will break when they are not monitored. Knowing why they break is crucial for further data product development. On a more technical level, it’s recommended to test after each step in the pipeline, as well as in between the different pipelines.‍

Who should be involved in data quality testing?‍

Historically, it has always been the engineering team and because of their rather limited data domain knowledge, they focussed on operational metrics. Experience has shown that testing data without domain knowledge is quite useless, as many high-impact issues will be hidden in the dataset.

We therefore recommend to always include Data SMEs in the testing of the source data that goes into all your data products.‍

Conclusion‍

Data product development teams can and should apply some of the best-practices that have been developed in software engineering and data management over the last decades.

Data issues often go unnoticed for a while (silent errors) and have a high impact on product quality. Most data products won’t even break, so issues go unnoticed for a while, resulting in poor product quality and days/weeks of clean-up work.

Nobody wants to be a data janitor, so start testing your data products today!

Product

Solutions

Pricing

Templates

Blog

Book a demo

Product

Solutions

Templates

Pricing

Blog

Book a demo

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Terms & Conditions

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Terms & Conditions

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Test Your Data as You Would Test Your Code

Test Your Data as You Would Test Your Code

Test Your Data as You Would Test Your Code

Why should you test data products once they are put into production?‍

When should you start testing your data?‍

Who should be involved in data quality testing?‍

Conclusion‍

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Your data has problems.
Now they fix themselves.

Your data has problems.
Now they fix themselves.

Your data has problems.
Now they fix themselves.