A pharmaceutical firm organizes its safety data by pulling information from multiple source systems into a data warehouse. At the time of this writing, the data warehouse holds in the range of 1,500,000 cases. An average case size is roughly about 67 KB of data, which, with case versioning, brings the size of the warehouse to about 500 Gigabyte (GB). The goal of the software quality effort is to verify the data integrity of each new build of the ETL (Extract, Transform, Load) and database code by sampling the data in the source and target systems, with a target sampling for each run of 50% of the case data.
In the Pharmaceutical industry, where the correctness of Safety Data is paramount for reasons of both public health and compliance with Federal regulation, the issue of data quality takes on even greater proportions.
- Testing large amounts of data that are derived from multiple sources into the main data warehouse
- Produce audit trail documentation and reporting for compliance with regulations and data quality standards
- Deploy a data quality solution that will scale to support increasing growth in data volume, data warehouse users, and complexity of business intelligence demands
Implement a high volume data sampling effort that allows upwards of 50% of the data to be verified for each build.
- RTTS used QuerySurge™ to set up SQL-based data comparisons between source systems and the data warehouse.
- Approximately 1000 SQL queries were written against the combined systems to compare case data from the source systems to the analogous data in the data warehouse after ETL.
- Approximately 45% of the case data was sampled in each run
- QuerySurge™ was used to vet the data quality of the system after ETL for each build of the system.
During this development cycle, 14 builds were deployed by the development team and vetted by the software quality team. Data integrity runs were performed on each build in which new ETL code was deployed, sampling ~45% of the total case data (ca. 700,000 cases) on each build.
Data throughput on the system is increased simply by increasing the number of QuerySurge agents used in a run. For the runs described here, 11 clients were used and a complete run finishes in ~72 hours. This permitted the development team to deploy ca. 2 builds per week, which the software quality team could provide solid “build health” numbers for.
During the development cycle, approximately 115 defects were discovered and remediated. As is expected, discovery of defects leveled off toward the end of the development cycle.
During the build cycles, the defect rate per build was monitored as new functionality was delivered to the software quality team. Because of the complexity of the application, the delivery of new functionality often caused defects to appear in existing functionality. Build 3 was clearly a highly problematic build. Around build 9, the team hit the “coalescence point” where the defect rate was reduced with each build. The final builds brought the number of defects to 0 and the release candidate was deployed.
- By using the automated ETL testing tool, QuerySurge, the team was able to increase testing coverage by approximately 10-fold and decreased testing time by a factor of 3.
- The software quality team was able to vet each build at the same high level, so build quality could be compared and the development trend was available to the whole team.
- Automated data verification of the source-to-data warehouse leg for each source permitted rapid localization of defects to specific blocks of ETL code.
- Using a large sample size led to significantly high confidence interval results.