Introduction
In data warehousing and business intelligence, ensuring data integrity is paramount, especially after implementing code changes. Data transformation jobs are often modified to improve performance, correct errors, or accommodate new business logic. However, these modifications must not unintentionally alter the correctness of the data. This blog post discusses a practical SQL approach for comparing two data sets — typically, one before and one after a transformation — to ensure that the output remains consistent. This technique is applicable in various environments, including validating a development ("dev") table against a production ("prd") table prior to migrating code.
The Use Case
Imagine you have made code changes to a transformation job that populates a key dimension table. Now, you need to verify that the modified version produces the same results as the previous one. This could mean comparing a development copy of the table to a production copy or even comparing two versions of a table within the same environment. While there are tools and frameworks available for data validation, a simple SQL approach can be highly effective, especially in environments like Snowflake where data can be queried directly with minimal setup.
The SQL Query: Set-Based Comparison
The SQL method utilizes MINUS and UNION ALL operations to identify any discrepancies between two data sets. Here's a basic template:
select 'A-B' as set_name, set_a.* from dev_db.schema.table as set_a
minus
select 'A-B' as set_name, set_b.* from prd_db.schema.table as set_b
union all
select 'B-A' as set_name, set_b.* from prd_db.schema.table as set_b
minus
select 'B-A' as set_name, set_a.* from dev_db.schema.table as set_a;
This query helps identify two types of discrepancies:
- Records present in set_a but not in set_b (A-B).
- Records present in set_b but not in set_a (B-A).
Practical Example
select 'DEV-PRD' as set_name, set_a.* from dev_edw_db.analytics_base.dim_client as set_a
minus
select 'DEV-PRD' as set_name, set_b.* from prd_edw_db.analytics_base.dim_client as set_b
union all
select 'PRD-DEV' as set_name, set_b.* from prd_edw_db.analytics_base.dim_client as set_b
minus
select 'PRD-DEV' as set_name, set_a.* from dev_edw_db.analytics_base.dim_client as set_a;
The result set will display any rows that are mismatched, making it easy to identify specific differences between the tables. The columns in the output will include all fields in the table, preceded by the set_name field to indicate the discrepancy type.
Advantages of This Approach
- Simplicity: No additional setup or complex frameworks required.
- Readability: SQL is easy to read and modify, making it accessible to data engineers and analysts.
- Scalability: Works with large data sets, leveraging Snowflake's performance optimizations.
Limitations and Considerations
- Schema Differences: This approach works best when the two tables have the same schema. If there are schema mismatches, additional handling is needed.
- Performance: MINUS and UNION ALL can be computationally expensive for very large tables. Use appropriate filters or sampling if necessary.
- Null Handling: Ensure that NULL values are handled consistently, as MINUS treats NULLs in a special manner.
Conclusion
This SQL pattern is an effective way to validate changes to your data transformation logic. It provides a quick and reliable method to ensure that code changes have not introduced unintended differences in your data. Incorporate this technique into your development and testing workflows to enhance data quality assurance and maintain the integrity of your data warehouse.