BI, Uncertainty, and the Two Watch paradox – Part 2

January 29, 2013 | 1 Comment

In part 1 of this post, I described what I called the “two watch” phase of BI system adoption, when discrepancies between existing reports and the results of a newly implemented BI system cause angst of the part of the client and can bring the process of system acceptance and adoption to a screeching halt.   In this second part of the post, we move from describing the problem to asking whether there are constructive steps to help everyone through it.  In other words, what can be done to get a client to the point where they feel comfortable putting that old watch back in their pocket for good, and accepting the numbers coming out of the BI system as the new single watch allowing them to know the time without any lingering doubts?  

I think there actually are a couple of things that can help.  First, you can discuss the questions up front that are going to get asked anyway when there are discrepancies between old and new reports.   What level of accuracy is needed from the new BI system? Does the analysis to be done require correct tallying of every single system transaction?   Are the existing reports known to REALLY be accurate?  Are they validated on an ongoing basis in any way?   What will make the client comfortable about pulling the plug on the old reports for good?  What are the essential acceptance criteria for the new system?  Discussing these issues ahead of time can instill confidence, exactly as asking them only after the results of the new and old systems diverge will sound defensive and instill doubt.

However, the most critical aspect of the old and new comparison game is the ability to pinpoint differences in terms of the specific transactions causing the deltas in total numbers.   The conceptual cul-de-sac to be avoided at all costs is the deadly standoff where the new system is trying to simply match an aggregate number – “Our old report says 45,039 and the new one says 44,823.  That can’t be right!”.  In this case, you are simply shooting in the dark, not knowing if the old report is any good, or if there really is a bad business rule or programming error in there somewhere.   It is critical that both the old and new reports be traceable to the individual transactions which they represent, and those that are causing any discrepancies can be individually evaluated.  This can actually translate a negative (the reporting discrepancy) into a positive (increased customer confidence) when individual transactions can be isolated and their inclusion or exclusion explained in terms of the organization’s business rules.

And finally, at the right time, the client needs to agree to turn off the old reports, to cut the cord to the past.  It is always tempting to keep the old ones around “just in case”, but this will always provide a lingering organizational dependence that really needs to be nipped in the bud.   If the new reports are validated, it should be in with the new and definitely out with the old. 

However, at the end of day, there is that moment when that happens, when existing reports can be safely relegated to the vast scrap heap of obsolete software and everyone can settle down with the one, shiny new watch that delivers the single accepted version of reality.  And that day, rather than any other milestone in the BI project, is the real finish line we should keep our eyes on from day one.

The Last Mile Problem and Business Intelligence

July 18, 2012 | Leave a Comment

The most expensive piece in a telecommunications grid is not the huge data pipes that make up national and international data networks, or even the incredibly specialized switching equipment that control the staggering amount of data which moves across these networks, or even the back-office billing systems that somehow tally the charges for all this data.  In each of these cases, economies of scale bring the per-user price of these components into a manageable cost structure.  No, the most expensive and problematic aspect of the whole system is what is called the ‘Last Mile’, that critical connection from the telecom switch to the consumer’s home or place of business.   This is pure infrastructure, and involves digging up streets, running wires, and working with customers one at a time.  There are no economies of scale and every possible avenue to more efficiently make this connection has been explored.  My favorite examples of the innovative solutions developed to solve the ‘Last Mile’ problem when I worked in the telecom field were: first, a company in the 90’s that built robots which would crawl through sewer lines to run telecom cable to access points within a customer premise and another which used free-space optical beams to send information to receivers on customer rooftops, a solution with obvious challenges during periods when the environment is not cooperative, for example, rain and fog!  

So how does this ‘Last Mile’ problem in telecom relate to Business Intelligence?  I would argue the same connection problem exists between the BI solution (including the data warehouse, the BI infrastructure and presentation tools, the cubes and reports) and the end user.  It has struck me that there are so many amazing BI solutions out there that provide so many potentially game-changing capabilities for their users but which still, well, fail.  They fail to make an impact, fail in their adoption by business users, and fail to meet the rosy expectations of their institutional sponsors.   In my experience, the most common reason for this is the inability to effectively bridge the ‘Last Mile’ of BI solution delivery, that is, to make the connection from the infrastructure of the solution to its end users.   Some of these solutions have amazing capabilities, just like those telecom networks, but are worthless if they cannot get their content into the hands of their end users for those users to make it part of the way they do business. 

So, having drawn this parallel, what insights to achieve success can be gained for BI solutions by looking at it in this light?

I think the fundamental one is to emphasize the importance of considering the BI ‘Last Mile’ into the overall design of the BI solution.  It is always tempting to adopt the adage from the movie ‘Field of Dreams’ – “If you build it, they will come!”.   But experience shows that they may not come, no matter how wonderfully it is built.  The users of a BI implementation and what they are capable of must be considered from the outset. A realistic assessment must be made of what they will need to be able to adopt the capabilities being rolled out.

Secondly, solutions to both of these problems are difficult to scale.  The analogy to the telecom guy climbing a pole or digging a ditch is the one-on-one communication that has to take place to get BI users on board and invested in the solution.  How do you bring along novice users to step up to what can be a daunting new challenge?  How do you convince reporting users who have adapted to existing, less-capable but known solutions that they should extend the effort to learn a new system?  How do you show everyone involved that the BI solution will help them, and is not just something imposed on them from upper management?  This involves careful design of the user-facing artifacts of the BI system, but also careful documentation and training. When it comes right down to it, you really have to sell the solution to the user community.  In our consulting practice, we have found that one of the best ways to engage and motivate new users of a system is to take reporting problems that they struggle with and solve them as sample problems in a training session.  This obviously requires an individualized approach to training, tailored not just for a specific customer but for a particular set of users within an organization.  But the enthusiasm that this engenders and the system buy-in that comes from the demonstration of the system’s capabilities in a well-understood domain makes it worth the effort. Even when solving these problems requires advanced skills with the tool, skills users might not totally understand, it is a concrete demonstration that the time and effort needed to learn the system will have a meaningful payback.

And the final insight is the stark reality that having a flawed plan or no plan at all is going to be fatal to the success of the whole system, no matter how remarkable the technical solution or infrastructure underlying it may be.

The ‘Last Mile’ problem applied to the BI world is more insidious and more likely to be overlooked than the physical ‘Last Mile’ problem in the telecom world.  With all its thorniness, it is starkly obvious that some solution is necessary to bridge the physical gap from switch to home or office.  However, it is far easier to delude oneself that the BI baseball field needs only to be constructed in an Iowa cornfield and that users will emerge like ghostly Chicago White Sox and start running down fly balls, or rather discovering business insights from the BI solution.  With all due respect to Kevin Costner, that just isn’t likely to happen.

Big Data – Deep or Wide?

May 29, 2012 | Leave a Comment

There has been a revolution going on in the Business Intelligence (BI) world in recent years.  Those who follow the trends in BI and data warehousing are probably aware of the growing interest in a wave of database systems expressly developed to analyze the unbelievably huge data stores created by the maturing internet juggernauts.   Companies such as Google, Facebook, Amazon, and Yahoo now want to analyze literally hundreds and thousands of terabytes of data that they find are essential to their business.   Welcome to the brave new world of “Big Data”.  Technologies such as NoSQL database systems and MapReduce algorithms, and products such as Hadoop, Hive, and Pig seem to be becoming more and more mainstream, and consequently more and more the topic of discussion on blogs and at conferences. 

The question is, how much of this really pertains to the world of higher education management systems, i.e. the institutions that run SunGard HE, Datatel (both now Ellucian), Jenzabar, Campus Management, and PeopleSoft systems?  Aren’t they also struggling to make sense of copious amounts of data?   As someone who has worked with BI and reporting in this space for most of the last decade, I find the focus on the “Big Data” solutions a bit frustrating, because I see these tools addressing a different problem than that faced in the higher education BI world.   This may seem a little counter-intuitive, as there certainly is more data than ever involved in running our campuses and institutional systems.   Wouldn’t tools focused on “Big Data” help us too?

To illustrate, if you think of data as a swimming pool, the typical “Big Data” applications work with swimming pools that are very, very deep, and contain a whole lot of water.  The ability to pump lots and lots of water volume is what the job is all about.   On the other hand, I see our data in the higher education management space as being a swimming pool that is not very deep, comparatively, but which has an incredibly broad surface area.   The overall volume water is not comparable to those “Big Data” swimming pools, but the surface area may be much greater and the structure and interrelationships of the different parts of the “swimming pool” are very complex.  

Typically,  an institution is not dealing with mammoth volumes of administrative data (unless it is really big school doing clickstream analysis on its websites and learning management system, perhaps).   The total number of customers at our enterprises (our students) and the number of items they typically buy (classes, housing, meal plans) are relatively modest, again compared to the Amazons of the world.   However, the variety of types of data we deal with is huge, and ranges from housing preferences to complicated faculty contract tenure payments to accounts payable records to course prerequisite and degree requirement rules, etc.  The list of business transactions that occur in the management of an institution is incredibly diverse and complex. It is a wide swimming pool of data with a huge surface area, though as I said, perhaps not that deep at any point.  

As someone working in the higher education reporting world, I am looking for support not for “Big Data”, but for what I think of as this “Wide Data” paradigm.   Rather than tools that support incredible throughput on massive data sets, this implies a need for  tools that help in the analysis of complex data sets. In particular, these tools should make us more nimble in quickly modeling and integrating new data sources into BI and data warehousing environments. This data must be readily available for our reporting and analytic delivery to our end-users. 

There is another trend in the BI world which may prove much more fruitful for our future endeavors, in my view.   This would be the emergence of in-memory databases such as SAP’s Hana or even Microsoft’s PowerPivot and BI Semantic Model that essentially are making the whole idea of pre-aggregated measures a thing of the past.   But more on that in a future post.

 

Should “bad” data be corrected in the Data Warehouse? Of course! Or should it?

March 5, 2012 | Leave a Comment

I ran across an interesting issue the other day while developing a new report for a client from their data warehouse.  As I was going through the validation process, I discovered that there were two pieces of data in some of the records that didn’t agree, yet by definition (on the surface at least) they should have.

The specific example was a college student with a term registration status of “first time student” in the “2009 Fall” term (as opposed to “returning” which is assigned to subsequent registration terms),  yet the student had an entry term cohort value of “2001 Fall”.  The entry term cohort value didn’t match the term value  selected in my query of “first time” students registered in “2009 Fall”.

How could this happen? How could the data ended up this way in the data warehouse? A quick data validation query showed a clear discrepancy on about 10% of records where the student had a “first time” term registration status, but that registration term was not their entry cohort term.

Further research and discussion with the client revealed that their business process dictates that after certain periods of absence the student is required to reapply and therefore be treated as a “first time student” again, even though the student’s original entry term cohort value is not change manually or automatically. So what should be used to query for those entering for the first time? The Registration Status? or the Entry Cohort?  Each returns slightly different values. The question is now before the data governance team as to what to do with the “Entry Cohort” values in this case and what the impact is of making a business rule change given the apparent conflict of definition between these two data elements and the different purposes they serve, often for different business units at the college.

Ultimately, it raises the question about what to do about the data in the data warehouse. Should it be changed? Or should it be left as is? It is an interesting dilemma that could apply to even simpler scenarios where data may have been entered incorrectly like the student ethnicity and a week later it is corrected. The data warehouse will capture these data changes and show the student with the “incorrect” ethnicity for that period of time.  If a report showing the breakdown of students by ethnicity is requested for that very time period, the report would be “incorrect”.

This caused me to do some thinking and searching for other people’s wisdom on this topic. It seems there is no clear consensus on the best approach with pros and cons on both sides.  In the first example I came across, there is a way this type of discrepancy can be captured in data quality exception reports during ETL and either fixed to match an agreed upon business rule or left to the business users to review and implement manually in the source transaction system to flow into the warehouse during the next load.  The second example however has no automated way of correcting since there is no way for the system to truly know what the correct data value should be.

The decision to correct data anomalies of this nature depends somewhat on the business requirements and the general philosophy on data quality. There is a distinct argument for saying “leave it as is since the warehouse is meant to represent the historical state of transaction data and the errors will not create a variance to change decisions made by analysis of that data”.  Others argue that data warehouses are meant to provide as clean an inputs to business decisions and should be cleaned when known to be wrong.  This blog post has a good summary of the perspectives with thoughts from some of the founders of data warehousing Ralph Kimball and Bill Inmon.

What do you think? I’d be interested in your comments and experiences.