Using Minus Query Method for Data Validation Testing
The Minus Query data testing method is one of the 2 most popular methods of testing the ETL process and validating that the ETL mapping specifications have been implemented properly (the other popular method is Sampling).
What is a Minus Query?
A Minus Query is a query that uses the MINUS operator in SQL to subtract one result set from another result set to evaluate the result set difference. If there is no difference, there is no remaining result set. If there is a difference, the resulting rows will be displayed.
So the first minus query for these tables is Table_1 MINUS Table_2:
(SELECT f_name, l_name, position FROM Yankees1 MINUS SELECT firstName, lastName, Positions FROM Yankees2)
The result set should be the rows we have highlighted in RED below:
Then you need to subtract Table_2 MINUS Table_1:
(SELECT firstName, lastName, Positions FROM Yankees2 MINUS SELECT f_name, l_name, position FROM Yankees1)
The result set should be the rows we have highlighted in GREEN below:
The result set for the first minus query are the rows in Red from Table_1 that do not exist in Table_2. The result set for the second minus query is the row in Green in Table_2 that does not exist in Table_1.
How Do You Test with Minus Queries?
The way to test using Minus Queries is to perform source-minus-target and target-minus-source queries for all data, making sure the extraction process did not provide duplicate data in the source and all unnecessary columns are removed before loading the data for validation.
Challenges with the Minus Query Method
While working with Minus Queries, we identified some common business and technical issues with this data testing method:
Minus Queries that pass will return no data because the result sets are the same. This has the impact of:
- Preventing any historical review or analysis of data from past testing
- Presenting a drawback for organizations that have compliance requirements and must document test results for audit purposes
- Potential false positive results
- Minus Queries are processed on either the source or the target database, which can draw significantly on database resources (CPU, memory, and hard drive read/write)
- In the standard minus query implementation, minus queries need to be executed twice (Source-to-Target and Target-to-Source). This doubles execution time and resource utilization
- If directional minus queries are combined via a Union (a union reduces the number of queries executed by half), information about which side the extra rows are on can be lost
- Result sets may be inaccurate when duplicate rows of data exist (the minus query may only return 1 row even if there are duplicates)
QuerySurge - The Automated Data Validation & Testing Solution
QuerySurge is the leading solution for automating the testing of Data Warehouses and Big Data. QuerySurge ensures that the data extracted from data sources remains intact in the target data store by analyzing and pinpointing any differences quickly. And QuerySurge makes it really easy for both novice and experienced team members to validate their organization’s data quickly through our Query Wizards while still allowing power users the ability to write custom code.
All with Deep-Dive reporting, a Data Health dashboard, Test Management integration, DevOps compatibility and the ability to connect to any data source. QuerySurge is everything you need to automate the testing and validation of your project’s data.
Learn more about QuerySurge here»
Revolutionize your Data Validation and Testing with QuerySurge
QuerySurge seamlessly replaces minus queries with a full-featured, reliable solution for testing & validating up to 100% of your data. QuerySurge provides immediate value by building on and optimizing your current method in the following ways:
- Validates 100% of your data quickly, storing complete data sets from both source and target data (at a 90% compression rate). This exposes all data mismatch failures, row count differences, and column type mismatch failures.
- Provides robust reports and data health dashboards for analyzing, sharing, and documenting test results.
- Eliminates the processing overhead of minus query execution on target & source servers and distributes the computing to its own, separate infrastructure.
The Benefits of QuerySurge
- Easily automate your manual testing effort for repeatability
- Provide testing across different platforms – data warehouses Hadoop and NoSQL stores, traditional databases, flat files, Excel, web services, json, XML and others
- Speed up testing up to 1,000 x while providing up to 100% data coverage
- Continuous Delivery — integrates an out-of-the-box DevOps solution for most Build, ETL & QA management software
- Deliver shareable, automated email reports and data health dashboards
- Provide a huge Return On Investment (ROI), as much as 1,600%