Data teams tackle data migration projects when they need to permanently transfer data from one storage system to another. The benefits of these projects are typically reduced costs that come with consolidating systems, and increased productivity, such as faster queries on unified data.
However, data migration projects come with risks. Historically speaking, almost 85% of data migration projects overran time and budget(1). More often than not, the root cause of these failures is a lack of investment in systems and processes that ensure reliable, high-quality data. When newly-migrated data appears to be inaccurate in a new storage system, end users are likely to distrust the data and go back to the old way of working.
In an effort to mitigate that outcome, this article discusses several data migration best practices that use Soda to help address data migration risks.
Decide on your approach
The first thing to think about is your approach. Are you going with an incremental approach, or a big bang? Do you plan on running both systems in parallel for a while and if so, for how long? To decide on an approach, it’s best to create a stakeholder map that helps you understand who is using the system, for what purpose, and when each team can invest the time to migrate.
Plan your data migration
What follows are some of the steps to consider adding to your next data migration project planning.
Catalog data to be migrated
- Document schemas and data types; Understand the variety of data in the system
- Document volumes; Understand the volume of data processed daily
Define scope, priority, and stakeholders
- Stack-rank datasets; What’s most important to do first?
- Identify technical stakeholders; Which data and infrastructure engineers need to be involved?
- Identify domain experts (SMEs); Which business SMEs can you leverage?
- Identify key data consumers; Which consumers will do acceptance testing?
Document legacy data quality issues; Understand the problems of the past
Evaluate and pick tools
- CI/CD; Which tool will you use to store and release code?
- Ingestion; Which tool will move data from source to target?
- Warehousing; Which SQL engine will you use?
- Transformation; Which tool will build derived models and views?
- Cataloging; Which tool will document data and stakeholders?
- Quality; Which tool will evaluate data quality and debug issues?
Design first iteration
- Select datasets; Take into account other projects
- Profile and add baseline checks; Iterate on quality checks with the stakeholders
- Map source to target types; Convert to more specific types, where possible
- Migrate data transformation logic; Rewrite the transformation logic in the new tool
- Write schema and reconciliation checks; Write checks that compare source to target
- Share data quality results; Share the results of data quality checks with stakeholders
Test user acceptance; Ask SMEs and consumers to test
Decommission; Proceed team by team, or use-case by use-case
Report on TCO; Analyze and share the results of the project
Manage data quality
If data quality is the most common root cause of data migration failure, then let’s unpack what can go wrong and why.
Data teams can measure the success of a data migration project by aggregating the check results. SodaCL reconciliation checks provide the most important measure of accuracy as they compare the source and target tables to identify differences at a metric and record level.
For example, use a metric reconciliation check to evaluate the percentage of missing values in a column, and a record reconciliation check to validate that employee name is the same in both the source and target tables. These checks make sure that the state of data in one system exactly matches the state in the other system. Sharing the result of these metrics with everyone increases transparency and trust in the migrated data.
You can implement Soda reconciliation checks as in the following SodaCL example which compares only the data specified in the list of columns, mapped according to order in which they are listed.
With more than six inconsistencies between data in the columns, the data quality check fails. The check result output in Soda Cloud identifies data quality reconciliation issues that you can address problems early, ideally in a Staging environment, so as to prevent migrating data quality issues into Production as you migrate the data itself.
Transition to the new source
Many teams are reluctant to change for a variety of reasons. Therefore, it’s very important to work closely with your stakeholders, like subject matter experts and consumers.
Make sure they plan time to test and migrate their processes to the new data source. Check in with them as they progress, and when they are finished, decommission the datasets in the old data source. Best practice dictates that you run both systems in parallel for a short while, and work iteratively towards complete transition. “Big bang” transitions seldom work out well.
To further motivate data teams to make the transition, you can also introduce new data management capabilities that further drive the adoption of the new data source. You can, for example, embark on a project to improve data cataloging, quality, security, retention, or access management. The easier the tools and processes, the more likely users will migrate.
Measure twice, cut once
This old adage holds up well in the digital world of data migration: measure data quality in Staging using SodaCL reconciliation checks before cutting over to Production. We could even add “An ounce of prevention is worth a pound of cure” or, “Garbage in, garbage out”. These proverbs prove that we’re all addressing the same kinds of challenges, and those of us who make the effort to prevent data quality issues early in a data migration project will fare much better.