Should "bad" data be corrected in the Data Warehouse? Of course! Or should it?

Mar 05, 2012

I ran across an interesting issue the other day while developing a new report for a client from their data warehouse. As I was going through the validation process, I discovered that there were two pieces of data in some of the records that didn't agree, yet by definition (on the surface at least) they should have.

The specific example was a college student with a term registration status of "first time student" in the "2009 Fall" term (as opposed to "returning" which is assigned to subsequent registration terms), yet the student had an entry term cohort value of "2001 Fall." The entry term cohort value didn't match the term value selected in my query of "first time" students registered in "2009 Fall."

How could this happen? How could the data ended up this way in the data warehouse? A quick data validation query showed a clear discrepancy on about 10% of records where the student had a "first time" term registration status, but that registration term was not their entry cohort term.

Further research and discussion with the client revealed that their business process dictates that after certain periods of absence the student is required to reapply and therefore be treated as a "first time student" again, even though the student's original entry term cohort value is not change manually or automatically. So what should be used to query for those entering for the first time? The Registration Status? or the Entry Cohort? Each returns slightly different values. The question is now before the data governance team as to what to do with the "Entry Cohort" values in this case and what the impact is of making a business rule change given the apparent conflict of definition between these two data elements and the different purposes they serve, often for different business units at the college.

Ultimately, it raises the question about what to do about the data in the data warehouse. Should it be changed? Or should it be left as is? It is an interesting dilemma that could apply to even simpler scenarios where data may have been entered incorrectly like the student ethnicity and a week later it is corrected. The data warehouse will capture these data changes and show the student with the "incorrect" ethnicity for that period of time. If a report showing the breakdown of students by ethnicity is requested for that very time period, the report would be "incorrect."

This caused me to do some thinking and searching for other people's wisdom on this topic. It seems there is no clear consensus on the best approach with pros and cons on both sides. In the first example I came across, there is a way this type of discrepancy can be captured in data quality exception reports during ETL and either fixed to match an agreed upon business rule or left to the business users to review and implement manually in the source transaction system to flow into the warehouse during the next load. The second example however has no automated way of correcting since there is no way for the system to truly know what the correct data value should be.

The decision to correct data anomalies of this nature depends somewhat on the business requirements and the general philosophy on data quality. There is a distinct argument for saying "leave it as is since the warehouse is meant to represent the historical state of transaction data and the errors will not create a variance to change decisions made by analysis of that data." Others argue that data warehouses are meant to provide as clean an inputs to business decisions and should be cleaned when known to be wrong. This blog post has a good summary of the perspectives with thoughts from some of the founders of data warehousing Ralph Kimball and Bill Inmon.

What do you think? I'd be interested in your comments and experiences.