Blog

Data Quality

6 Ways to Improve Your Data Quality (With Automated Checks)

Published Jan 9, 2024

Updated Nov 7, 2025

Janet Revell

Former Technical Writer at Soda

Table of Contents

In the era of big data, decision-making is all about inferring a future state by understanding the past and present. But when your data doesn’t properly capture the reality of your business, it won’t serve as a reliable basis for any predictive model. Rather than driving smart business decisions, data that’s not validated for quality and reliability can be worthless, or even damaging, to your company.

Dedicated frameworks for data quality management help engineers concentrate on designing and optimizing reliable data pipelines that provide the best value for businesses.

Unlike buggy code, which causes software to break, low-quality data can remain undetected for a long time. But when it creates issues, the firefighters (a.k.a. the data engineering team) are called to the rescue. At Soda, we often see data engineers within the industry who spend too much of their work time patching up existing data pipelines and debugging data issues when their expertise would be much better used for designing and optimizing the company’s overall data infrastructure, or building new data products.

It is for these firefighters that Soda exists. Data quality and reliability checks help businesses detect data-related issues long before they have a negative impact. In this blog, we share a few simple, effective checks that you can implement today to help your business run more smoothly and efficiently. Plus, we’ll share our thoughts on some longer-term solutions that will help you place good data at the heart of your business model.

This blog was created with previous versions of Soda Core, so there might be minor code syntax differences. If you have any questions, refer to https://docs.soda.io/

What Are Data Quality Checks?

Data quality checks formulate your expectations of the tables in your database or of the columns within a table. You could, for example, specify that your datasets shouldn’t be empty or that a certain column shouldn’t contain duplicate values.

The Soda Checks Language (SodaCL) is a concise and readable language built expressly for data quality and reliability. Data quality expectations can be defined in Soda in a number of ways.

Data engineers and technical users can write SodaCL checks directly in a checks.yml file, or leverage check suggestions in the Soda Library CLI to prepare a basic set of data quality checks for you. Alternatively, you can add SodaCL checks to a programmatic invocation of Soda Library. Non-technical and business users, such as data analysts or data scientists, can use a simple user interface. Dropdown menus and pre-populated fields make it easy to specify the data quality rules with no-code checks. In addition, you can provide natural language instructions to SodaGPT, the first AI co-pilot for data quality, to receive fully-formed, syntax-correct checks.

SodaGPT has become SodaAI.
Go here for more information: Announcing Soda AI: A Leap Towards a GenAI-first Data Quality Platform

To compare the expectations outlined in your quality checks file to your actual data, Soda utilizes a scan that it runs against your datasets to extract metadata and gauge data quality. The results of the scan alert you to irregularities in your data.

Depending on the type of alert and the relevancy of the affected data, you may take different measures to address the issues, such as fixing the source of the problem or attaching a warning to the data before handing it over to another team. For a detailed introduction to Soda, have a look at our guide to implementing data quality checks.

Proactive vs Reactive Data Quality: Which Approach Works Better?

Proactively checking data in order to prevent downstream impact introduces an element of foresight into the processes and workflows that rely on (good-quality) data.

This approach is very different from the reactive approach that we’ve seen in many companies. In a reactive workflow, when a problem occurs, the data engineer has to go in ASAP and write ad-hoc checks and fixes. Too often, this means that they are inundated with tickets, resulting in the notorious data engineering bottleneck and frustration across the team.

We’ve also seen data engineers routinely repeat the same manual reliability checks over and over again — for instance, at ingestion or after a transformation. They usually know that this situation is far from ideal but don’t have the time or resources to look for alternatives.

Aspect	Reactive Approach	Proactive Approach
When issues are detected	After downstream impact occurs	Before data reaches consumers
Team response	Firefighting mode, urgent fixes	Planned improvements, root cause analysis
Data engineer workload	Constant interruptions, ticket overload	Focus on infrastructure and optimization
Business impact	Decisions made on faulty data	Trustworthy data enables confident decisions
Cost	High: firefighting + downstream damage	Low: prevention + early detection
Example scenario	Dashboard breaks, CEO asks "Why are Q3 numbers wrong?" → scramble to fix	Freshness check alerts engineer 2 hours after ETL failure → fix before dashboard updates

Automated data quality checks shift you from reactive to proactive, catching issues at ingestion or transformation—long before they impact business decisions.

6 Data Quality Checks You Can Implement Today

Here’s the good news: if you’re a data engineer looking to automate your data quality procedures, you don’t need to reinvent the wheel. As experts in the space, we’ve identified some checks that will make your life easier from day one and require almost no domain knowledge. If any of these checks sound an alarm during a scan, then there’s a high likelihood that something is off.

1. Track the number of rows in your dataset

Simple but effective, a row count check lets you make sure that your datasets aren’t empty — an important prerequisite for any downstream task. Row count checks can also alert you to unusual spikes in the volume of your data.

Example: When a transformed dataset suddenly contains many more rows than expected, it could point to a bug in your analytics code, such as an outer join being incorrectly used to join two tables instead of an inner join.

SodaCL:

checks for orders_table:
  - row_count > 0
  - row_count:
      warn: when > 10000  # Unusual spike
      name

2. Track your schema evolution

A schema describes the columns in your dataset. Although dataset schemas may change during the early stages of your business – columns added or removed, or changes to column ordering – they should stabilize at some point.

Example: Add a schema evolution check to automatically monitor changes to your schema and notify you when anything happens. Run two scans to start seeing results: first to capture a baseline measurement, another to run a comparison.

SodaCL:

checks for dim_product:
  - schema:
      fail:
        when required column missing

3. Check the timeliness of your data

At a time when new data points are produced and transmitted in a continuous flow, it is particularly important to keep an eye on the timeliness of data. To that end, you can use SodaCL to implement a freshness check on a date or timestamp column.

Example: You could use it to configure an alert if the youngest data in a dataset is older than a day. When triggered, it alerts you to roadblocks in your larger data ecosystem. Perhaps a third-party supplier accidentally sent a file with old data? Or maybe a pipeline didn’t run correctly? With a freshness check, you’ll know.

SodaCL:

checks for inventory

4. Check that values are unique

Duplicate values can greatly distort datasets. Apply a duplicate check to make sure a column contains only unique values.

Example: Apply it to both order_id and account_numberto make sure that orders are not falsely duplicated.

SodaCL:

checks for orders_table

5. Surface invalid values

Use a validity check to issue warnings when data in your dataset is invalid or unexpected.

Example: Did someone accidentally enter a date incorrectly? Should a column of order numbers contain a certain number of characters?

SodaCL:

checks for orders_table:
# check for order_date validity
  - invalid_count(order_date) = 0:
      valid format: date
      valid min: 2020-01-01
      name: Order date validity  
  - failed rows:
      fail condition: order_date > current_date()
      name: Future dates not allowed
  # check for order_id validity    
  - invalid_count(order_number) = 0:
      valid regex: 'ORD-[0-9]{6}'
      name: Order number format pattern check
      warn

6. Find the missing pieces

Use a missing check to find the NULLs and make sure the data that your teams are working with is complete.

Example: A report on forecasted revenue will not yield very accurate predictions if a monthly_payments column is missing values.

checks for suppliers:
# a check with a fixed threshold
  - missing_count(payment_id) = 0
# a check with a relative threshold

What Happens When You Start Automating Your Data Quality Checks?

We never get tired of repeating it: automating your data quality checks will bring your company nothing but positive results. Data engineers can go back to doing their actual jobs and hopefully be relieved of the pressure associated with undetected data quality issues. No more data engineer-related bottlenecks!

Of course, unreliable data is not just a constant source of stress for the data engineer. It also results in an environment in which you never really know how much you can trust your data-informed decisions. After all, even the cleverest machine learning model will only be as good as the data it’s trained on. Further, having automated data quality checks in place also increases the potential for self-service analytics, which we’ll go into in another guide.

What Other Measures Can Your Company Take?

Data quality isn’t inherently good or bad. That judgment depends very much on what you want the data to achieve. For example, the same dataset can have different quality requirements depending on whether it’s used for reports that only a few people read, or for making strategic decisions for a whole department.

When everyone in your company is clear about what they expect from the data they use, you get better-informed conversations about data. Here are two more ways you can guide your teams toward an environment of trusted data.

1. Establish the concept of data owners

Regular, automated quality checks are an important foundation for any data-driven business. But they can only provide true value when someone is responsible for addressing the alerts raised during a scan.

That’s why every dataset should have a data owner, a person who is ultimately accountable for the quality of that data. When there’s an issue or someone further downstream requires a change, the data owner is their contact person.

Note that data owners are not typically data engineers. That’s because a data engineer’s expertise lies in managing the data rather than understanding the content and context of the data itself.

A data owner brings domain expertise to the table with their intimate knowledge of what the data represents and the processes that generate it. Data owners and engineers work closely together to bring high-quality data to everyone on the team who needs it.

2. Attach a health score to your data-based products

Teams often want their data-based products to be 100% accurate but are unaware of how unrealistic that expectation is. In reality, data that is truly interesting can also be very messy!

Real-life data always has missing values, outliers, and other noise. A good way for your company to respond to your data’s inherent variability is by quantifying the reliability of the data as a “health score.”

Let’s imagine for a moment that one of the datasets used in a periodically-updated dashboard fails the freshness check. By introducing a health score, you can still update your dashboard despite the inaccurate data, but signal to the viewers that it is slightly less reliable than previous iterations. The users of your data can then make a decision on whether to wait for more reliable data or work with what they already have.

Start Checking

Getting a grip on data quality can feel like an insurmountable challenge, but not anymore!

By introducing procedures dedicated to data quality and reliability into your workflow, you can enable data engineers to put their expertise to its best use.

Plus, everyone in your company is rewarded with better-quality, trustworthy data to work with.

Start a free trial of Soda to implement foundational data quality checks today and avoid the pain of not knowing, or finding out too late, that a data quality issue has had a downstream impact. I

f you’d prefer to talk directly to us, schedule a meeting.

Good luck!

Dedicated frameworks for data quality management help engineers concentrate on designing and optimizing reliable data pipelines that provide the best value for businesses.

This blog was created with previous versions of Soda Core, so there might be minor code syntax differences. If you have any questions, refer to https://docs.soda.io/

What Are Data Quality Checks?

The Soda Checks Language (SodaCL) is a concise and readable language built expressly for data quality and reliability. Data quality expectations can be defined in Soda in a number of ways.

SodaGPT has become SodaAI.
Go here for more information: Announcing Soda AI: A Leap Towards a GenAI-first Data Quality Platform

Proactive vs Reactive Data Quality: Which Approach Works Better?

Proactively checking data in order to prevent downstream impact introduces an element of foresight into the processes and workflows that rely on (good-quality) data.

Aspect	Reactive Approach	Proactive Approach
When issues are detected	After downstream impact occurs	Before data reaches consumers
Team response	Firefighting mode, urgent fixes	Planned improvements, root cause analysis
Data engineer workload	Constant interruptions, ticket overload	Focus on infrastructure and optimization
Business impact	Decisions made on faulty data	Trustworthy data enables confident decisions
Cost	High: firefighting + downstream damage	Low: prevention + early detection
Example scenario	Dashboard breaks, CEO asks "Why are Q3 numbers wrong?" → scramble to fix	Freshness check alerts engineer 2 hours after ETL failure → fix before dashboard updates

Automated data quality checks shift you from reactive to proactive, catching issues at ingestion or transformation—long before they impact business decisions.

6 Data Quality Checks You Can Implement Today

1. Track the number of rows in your dataset

SodaCL:

checks for orders_table:
  - row_count > 0
  - row_count:
      warn: when > 10000  # Unusual spike
      name

2. Track your schema evolution

SodaCL:

checks for dim_product:
  - schema:
      fail:
        when required column missing

3. Check the timeliness of your data

SodaCL:

checks for inventory

4. Check that values are unique

Duplicate values can greatly distort datasets. Apply a duplicate check to make sure a column contains only unique values.

Example: Apply it to both order_id and account_numberto make sure that orders are not falsely duplicated.

SodaCL:

checks for orders_table

5. Surface invalid values

Use a validity check to issue warnings when data in your dataset is invalid or unexpected.

Example: Did someone accidentally enter a date incorrectly? Should a column of order numbers contain a certain number of characters?

SodaCL:

checks for orders_table:
# check for order_date validity
  - invalid_count(order_date) = 0:
      valid format: date
      valid min: 2020-01-01
      name: Order date validity  
  - failed rows:
      fail condition: order_date > current_date()
      name: Future dates not allowed
  # check for order_id validity    
  - invalid_count(order_number) = 0:
      valid regex: 'ORD-[0-9]{6}'
      name: Order number format pattern check
      warn

6. Find the missing pieces

Use a missing check to find the NULLs and make sure the data that your teams are working with is complete.

Example: A report on forecasted revenue will not yield very accurate predictions if a monthly_payments column is missing values.

checks for suppliers:
# a check with a fixed threshold
  - missing_count(payment_id) = 0
# a check with a relative threshold

What Happens When You Start Automating Your Data Quality Checks?

What Other Measures Can Your Company Take?

1. Establish the concept of data owners

Note that data owners are not typically data engineers. That’s because a data engineer’s expertise lies in managing the data rather than understanding the content and context of the data itself.

2. Attach a health score to your data-based products

Teams often want their data-based products to be 100% accurate but are unaware of how unrealistic that expectation is. In reality, data that is truly interesting can also be very messy!

Start Checking

Getting a grip on data quality can feel like an insurmountable challenge, but not anymore!

By introducing procedures dedicated to data quality and reliability into your workflow, you can enable data engineers to put their expertise to its best use.

Plus, everyone in your company is rewarded with better-quality, trustworthy data to work with.

Start a free trial of Soda to implement foundational data quality checks today and avoid the pain of not knowing, or finding out too late, that a data quality issue has had a downstream impact. I

f you’d prefer to talk directly to us, schedule a meeting.

Good luck!

Dedicated frameworks for data quality management help engineers concentrate on designing and optimizing reliable data pipelines that provide the best value for businesses.

This blog was created with previous versions of Soda Core, so there might be minor code syntax differences. If you have any questions, refer to https://docs.soda.io/

What Are Data Quality Checks?

The Soda Checks Language (SodaCL) is a concise and readable language built expressly for data quality and reliability. Data quality expectations can be defined in Soda in a number of ways.

SodaGPT has become SodaAI.
Go here for more information: Announcing Soda AI: A Leap Towards a GenAI-first Data Quality Platform

Proactive vs Reactive Data Quality: Which Approach Works Better?

Proactively checking data in order to prevent downstream impact introduces an element of foresight into the processes and workflows that rely on (good-quality) data.

Aspect	Reactive Approach	Proactive Approach
When issues are detected	After downstream impact occurs	Before data reaches consumers
Team response	Firefighting mode, urgent fixes	Planned improvements, root cause analysis
Data engineer workload	Constant interruptions, ticket overload	Focus on infrastructure and optimization
Business impact	Decisions made on faulty data	Trustworthy data enables confident decisions
Cost	High: firefighting + downstream damage	Low: prevention + early detection
Example scenario	Dashboard breaks, CEO asks "Why are Q3 numbers wrong?" → scramble to fix	Freshness check alerts engineer 2 hours after ETL failure → fix before dashboard updates

Automated data quality checks shift you from reactive to proactive, catching issues at ingestion or transformation—long before they impact business decisions.

6 Data Quality Checks You Can Implement Today

1. Track the number of rows in your dataset

SodaCL:

checks for orders_table:
  - row_count > 0
  - row_count:
      warn: when > 10000  # Unusual spike
      name

2. Track your schema evolution

SodaCL:

checks for dim_product:
  - schema:
      fail:
        when required column missing

3. Check the timeliness of your data

SodaCL:

checks for inventory

4. Check that values are unique

Duplicate values can greatly distort datasets. Apply a duplicate check to make sure a column contains only unique values.

Example: Apply it to both order_id and account_numberto make sure that orders are not falsely duplicated.

SodaCL:

checks for orders_table

5. Surface invalid values

Use a validity check to issue warnings when data in your dataset is invalid or unexpected.

Example: Did someone accidentally enter a date incorrectly? Should a column of order numbers contain a certain number of characters?

SodaCL:

checks for orders_table:
# check for order_date validity
  - invalid_count(order_date) = 0:
      valid format: date
      valid min: 2020-01-01
      name: Order date validity  
  - failed rows:
      fail condition: order_date > current_date()
      name: Future dates not allowed
  # check for order_id validity    
  - invalid_count(order_number) = 0:
      valid regex: 'ORD-[0-9]{6}'
      name: Order number format pattern check
      warn

6. Find the missing pieces

Use a missing check to find the NULLs and make sure the data that your teams are working with is complete.

Example: A report on forecasted revenue will not yield very accurate predictions if a monthly_payments column is missing values.

checks for suppliers:
# a check with a fixed threshold
  - missing_count(payment_id) = 0
# a check with a relative threshold

What Happens When You Start Automating Your Data Quality Checks?

What Other Measures Can Your Company Take?

1. Establish the concept of data owners

Note that data owners are not typically data engineers. That’s because a data engineer’s expertise lies in managing the data rather than understanding the content and context of the data itself.

2. Attach a health score to your data-based products

Teams often want their data-based products to be 100% accurate but are unaware of how unrealistic that expectation is. In reality, data that is truly interesting can also be very messy!

Start Checking

Getting a grip on data quality can feel like an insurmountable challenge, but not anymore!

By introducing procedures dedicated to data quality and reliability into your workflow, you can enable data engineers to put their expertise to its best use.

Plus, everyone in your company is rewarded with better-quality, trustworthy data to work with.

Start a free trial of Soda to implement foundational data quality checks today and avoid the pain of not knowing, or finding out too late, that a data quality issue has had a downstream impact. I

f you’d prefer to talk directly to us, schedule a meeting.

Good luck!

Product

Solutions

Pricing

Templates

Blog

Book a demo

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

At the end of the day, we don’t want to be in there managing the checks, updating the checks, adding the checks. We just want to go and observe what’s happening, and that’s what Soda is enabling right now.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Investing in data quality is key for cross-functional teams to make accurate, complete decisions with fewer risks and greater returns, using initiatives such as product thinking, data governance, and self-service platforms.

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Our goal was to deliver high-quality datasets in near real-time, ensuring dashboards reflect live data as it flows in. But beyond solving technical challenges, we wanted to spark a cultural shift - empowering the entire organization to make decisions grounded in accurate, timely data.

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Terms & Conditions

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Terms & Conditions

Case studies

Trusted by the world’s leading enterprises

Real stories from companies using Soda to keep their data reliable, accurate, and ready for action.

Sid Srivastava

Director of Data Governance, Quality and MLOps

Read the story

Mario Konschake

Director of Product-Data Platform

Read the story

Soda has integrated seamlessly into our technology stack and given us the confidence to find, analyze, implement, and resolve data issues through a simple self-serve capability.

Sutaraj Dutta

Data Engineering Manager

Read the story

Gu Xie

Head of Data Engineering

Read the story

4.4 of 5

Your data has problems.
Now they fix themselves.

Automated data quality, remediation, and management.

One platform, agents that do the work, you approve.

Book a demo

Trusted by

Terms & Conditions

6 Ways to Improve Your Data Quality (With Automated Checks)

6 Ways to Improve Your Data Quality (With Automated Checks)

6 Ways to Improve Your Data Quality (With Automated Checks)

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Trusted by the world’s leading enterprises

Your data has problems.Now they fix themselves.

Your data has problems.
Now they fix themselves.

Your data has problems.
Now they fix themselves.

Your data has problems.
Now they fix themselves.