When does predictive analytics go too far?

March 20, 2013 | Leave a Comment

This recent news item about Target using analytics to target (pun intended) promotions to newly pregnant mothers and the controversy surrounding it illustrates a profound dilemma. How should data be used for predictive purposes? The privacy issues, loss of control on the sharing of personal information, much less the risks of unexpected consequences, raise serious questions for those of us in the industry who develop the models to answer business questions of apparent importance to an organization.

Target’s business question seems innocent enough–determine as quickly as possible those customers who are likely to be pregnant and interested in certain products and promotions to capture their purchase and loyalty before losing it to the competition. But at what cost? In the case of the father who found out his teen daughter might be pregnant because of coupons sent to the home before she shared the information with the family, Target is facing more than just a public relations challenge. A false positive for this family might have created a bit of an awkward firestorm at home. In this case, the correct prediction did more than create a firestorm, it changed their lives and took the choice and control away from their customer. Is that really the desired result?

What should we be asking? What is the appropriate use of source data? What are the possible implications of accurate predictions and false positives? False positives in a predictive model to identify fraudulent tax refunds might only embarrass the taxpayer or delay the processing while scruitinized in a deeper review. There may be no lasting damage other than a frustrated taxpayer. Furthermore, correct predictions may have no negative consequences for the tax agency, but appropriately negative consequences for the perpetrator of refund fraud. Determining which students may be likely to drop out of university, accept an offer of admission to a program, or be delinquent in tuition payments also seem relatively innocuous.

What do we ever really know about what organizations might be doing with information collected about us? Very little. Should this level of use be disclosed and required in privacy notices? Should it depend on the type of use? Recently when my mortgage was sold to another servicer, I received a privacy disclosure that made it very clear that I had no rights or choice on how the bank used my personal and loan data for internal purposes. The notice pointed out this was legal under federal law. I only could indicate my preferences for how data was used with affiliates and how it was shared outside the bank. That still leaves the problem of how they might use personal data internally for their own predictive modeling that I may find inappropriate.

As BI professionals we should consider more than just the technical accuracy of a predictive model and the selected target variable.  Is would also seem appropriate to consider privacy, potential consequences, and whether the end customer has a choice in saying how that data is to be used or not for decision making purposes.   Perhaps the Golden Rule would be a prudent test.

BI, Uncertainty, and the Two Watch Paradox – Part 1

December 14, 2012 | Leave a Comment

There is a well know quote about certainty that goes something like “A person with one watch always knows what time it is; a person with two watches is never sure”.   I was struck the other day by the pertinence of this quote to the process of working with clients who are implementing a new BI solution which replaces existing reporting tools.  

Many organizations are like the man with one old watch – they may not have much in the way of reporting, and the reports may be labor-intensive and involve combining the results of multiple queries on different source systems, and may be of indeterminate accuracy, but the end results tend to take on a kind of hallowed authority and become the organization’s defining vision of reality.  These organizations “know” what time it is, and accurately enough for everyday purposes, it seems. 

Whatever the promise of a new BI system, whatever savings in staff effort, broader reporting scope, expanded data visualization capabilities and increased accuracy it delivers through state-of-the –art ETL, OLAP databases, and automated, scheduled report generation, it still will inevitably need to deal with the fact that, at least initially, it is the proverbial second watch, the one that shatters the certainty of “really” knowing what time it is.   Because, inevitably, based on the complexity of the organization’s business rules, and the effect of applying these complex rules in to arcane data configurations, the values of accepted metrics will vary with the results of previously accepted reports, and this variance will cause angst.  

And getting through this “two watch” phase is critical, and is made more difficult because the watch that is invariably believed is the one which has been depended upon for all the preceding years.  In fact, projects can get permanently bogged down trying to exactly match these pre-existing reports as de facto acceptance criteria for the new system.

In the next post, I’ll discuss some ideas on getting through this problematic phase.

Moonlight and Other Correlated Factors

November 5, 2012 | Leave a Comment

Today’s Financial Times ”Weekly Review of the Fund Management Industry” has an interesting front page article that really struck me. The article describes how a firm which specializes in longevity research for pension funds recently discovered a spike in death rates when more than half of the moon is visible in the night sky.

My immediate reaction, and one that I think is relevant for any of us in the field of research and predictive modeling, is “Who even thought of data on the moon phases as an input variable to this research??”  Some think predictive modeling is an automatic magical black box exercise. But it really is just math and depends on the capacity to throw the net wide, so to speak, across a range of seemingly completely unrelated data to see if patterns emerge.

Now, does knowing that there is a higher death rate at certain points in the lunar cycle help with predicting longevity? The article suggests not, but it does help with predicting payout patterns, which is of concern to pensions as well.

Now, let’s see… what kinds of things might cause students to drop out? Donors to increase giving? Students to default on Financial Aid payments?? Maybe that crazy full moon has something to do with it! And now I have the King Harvest “Dancing in the Moonlight” song stuck in my head…

What was the question?

September 24, 2012 | Leave a Comment

The answer is 42. For those who are fans of “Hitchhikers Guide to the Galaxy“, you know the story of how interstellar traveller Arthur Dent, while visiting another planet, learns that another race of beings had created a supercomputer to answer the ultimate question of the meaning of Life, the Universe and Everything. It took millions of years to compute the answer, but by that time nobody really remembered the question.

This often happens when working with BI solutions that have been established for awhile. It can also be an issue while trying to gather requirements. People often become so focused on the answer (and particularly if it is “right” or not) that they forget the question. What is the real purpose of the data? What do you do with the analysis? How will it change your behavior and decision making? What will you do differently in the interaction with the student or constituent knowing the answer you have?

Every so often in any Business Intelligence (BI) program it is essential to step back and consider: “What was the question?”  If you cannot honestly come up with a purpose for a report, measure, or Key Performance Indicator (KPI) than maybe it is time to retire it and review what is really needed to answer new questions at hand. Don’t wait until everyone forgets and none of the information is meaningful.

The Last Mile Problem and Business Intelligence

July 18, 2012 | Leave a Comment

The most expensive piece in a telecommunications grid is not the huge data pipes that make up national and international data networks, or even the incredibly specialized switching equipment that control the staggering amount of data which moves across these networks, or even the back-office billing systems that somehow tally the charges for all this data.  In each of these cases, economies of scale bring the per-user price of these components into a manageable cost structure.  No, the most expensive and problematic aspect of the whole system is what is called the ‘Last Mile’, that critical connection from the telecom switch to the consumer’s home or place of business.   This is pure infrastructure, and involves digging up streets, running wires, and working with customers one at a time.  There are no economies of scale and every possible avenue to more efficiently make this connection has been explored.  My favorite examples of the innovative solutions developed to solve the ‘Last Mile’ problem when I worked in the telecom field were: first, a company in the 90’s that built robots which would crawl through sewer lines to run telecom cable to access points within a customer premise and another which used free-space optical beams to send information to receivers on customer rooftops, a solution with obvious challenges during periods when the environment is not cooperative, for example, rain and fog!  

So how does this ‘Last Mile’ problem in telecom relate to Business Intelligence?  I would argue the same connection problem exists between the BI solution (including the data warehouse, the BI infrastructure and presentation tools, the cubes and reports) and the end user.  It has struck me that there are so many amazing BI solutions out there that provide so many potentially game-changing capabilities for their users but which still, well, fail.  They fail to make an impact, fail in their adoption by business users, and fail to meet the rosy expectations of their institutional sponsors.   In my experience, the most common reason for this is the inability to effectively bridge the ‘Last Mile’ of BI solution delivery, that is, to make the connection from the infrastructure of the solution to its end users.   Some of these solutions have amazing capabilities, just like those telecom networks, but are worthless if they cannot get their content into the hands of their end users for those users to make it part of the way they do business. 

So, having drawn this parallel, what insights to achieve success can be gained for BI solutions by looking at it in this light?

I think the fundamental one is to emphasize the importance of considering the BI ‘Last Mile’ into the overall design of the BI solution.  It is always tempting to adopt the adage from the movie ‘Field of Dreams’ – “If you build it, they will come!”.   But experience shows that they may not come, no matter how wonderfully it is built.  The users of a BI implementation and what they are capable of must be considered from the outset. A realistic assessment must be made of what they will need to be able to adopt the capabilities being rolled out.

Secondly, solutions to both of these problems are difficult to scale.  The analogy to the telecom guy climbing a pole or digging a ditch is the one-on-one communication that has to take place to get BI users on board and invested in the solution.  How do you bring along novice users to step up to what can be a daunting new challenge?  How do you convince reporting users who have adapted to existing, less-capable but known solutions that they should extend the effort to learn a new system?  How do you show everyone involved that the BI solution will help them, and is not just something imposed on them from upper management?  This involves careful design of the user-facing artifacts of the BI system, but also careful documentation and training. When it comes right down to it, you really have to sell the solution to the user community.  In our consulting practice, we have found that one of the best ways to engage and motivate new users of a system is to take reporting problems that they struggle with and solve them as sample problems in a training session.  This obviously requires an individualized approach to training, tailored not just for a specific customer but for a particular set of users within an organization.  But the enthusiasm that this engenders and the system buy-in that comes from the demonstration of the system’s capabilities in a well-understood domain makes it worth the effort. Even when solving these problems requires advanced skills with the tool, skills users might not totally understand, it is a concrete demonstration that the time and effort needed to learn the system will have a meaningful payback.

And the final insight is the stark reality that having a flawed plan or no plan at all is going to be fatal to the success of the whole system, no matter how remarkable the technical solution or infrastructure underlying it may be.

The ‘Last Mile’ problem applied to the BI world is more insidious and more likely to be overlooked than the physical ‘Last Mile’ problem in the telecom world.  With all its thorniness, it is starkly obvious that some solution is necessary to bridge the physical gap from switch to home or office.  However, it is far easier to delude oneself that the BI baseball field needs only to be constructed in an Iowa cornfield and that users will emerge like ghostly Chicago White Sox and start running down fly balls, or rather discovering business insights from the BI solution.  With all due respect to Kevin Costner, that just isn’t likely to happen.

When More Complicated is Actually Easier

June 14, 2012 | Leave a Comment

Many institutions struggle with their reporting and analytics deployment. They face a dilemma about how to roll out self service to users and still meet complex reporting requirements. These requirements appear to need the “high cost and high touch” of IT support for those users to be successful.

I hear frequently from BI project leaders that they’ve tried to give users the ability to create their own reports, but most can’t figure out how to do it with the tools and training provided. The problem is, an assumption has been made (often perpetuated by the marketing and sales messages of BI and ERP vendor’s themselves) that their drag and drop reporting and available templates are easy for anyone to use.

Let’s break apart that conventional wisdom, however, and dig a little deeper into this paradox. True enough, conceptually, many of these environments such as SAP Business Objects WebIntelligence, which is the core technology in Datatel’s (now Ellucian) Reporting and Operating Analytics (DROA) solution, are designed for casual users and ease of creating reports with advanced interactivity. The features of Cognos used with the Banner Enterprise Data Warehouse (EDW) are similar and you’re faced with a similar dilemma. Pick your favorite BI tool to use with the complex data of a mature ERP system: Tableau, SAS, even Excel. You name it, soon you’ll come to a brick wall.

The problem is, many reports users need are quite complex and don’t fit neatly into the typical approach of trying to drag and drop all the necessary data and filters into a single query. Yet, that is the way that most projects approach the training and roll out. Everyone is lulled into the “it should be easy and all the data is here in one place to query” effect!

Take this real world example:  the Graduate Studies division needs a report at the end of each term to determine those who are not eligible to continue. They need a list of students who are actively enrolled in the Graduate level, in certain programs, have more than one C grade in a 500 level course in the term, and alongside this info list their General Academic Advisor and not their program advisor.

There are at least 4 areas of data needed, namely, the student, registrations, enrolled program, and assigned advisors. Worse, many of the filters and conditions only apply to certain pieces of the data. One might think since they are all tied together (and they are, albeit loosely) by the student ID, shouldn’t all this data be able to be dragged at once into the query? Magically there should be a result. There will be one, but not at all what is expected. Suddenly, IT is needed to help build this report!

Instead of trying to do this all in one query, it is much easier to break it apart into four distinct pieces. In WebIntelligence, each piece can be defined in query and tied together with “advanced” query techniques. This approach actually expresses the reporting requirements more naturally. One is the list of student in a program info. Another is the count of C grades but only for those in the query result of students in the graduate programs for that term. WebIntelligence allows you to do this type of filter from the results of another query quite easily. There is also sub-query and combined queries capability. Any report can combine data from more than one query and even more than one data source. These are very “complex” and powerful features, but actually make the reporting problem simpler because it is breaking it down into smaller, manageable pieces.

In fact, the seemingly more complex approach and teaching users how to do it actually makes them more likely to create the desired reports successfully and correctly. Part of the reason for this is that users think of their requirements in small chunks. Sometimes, they even forget to define very important little chunks and a query may run on the whole database! Any time you break a data problem down into smaller pieces it is easier to define and verify results. The query statements that can be better expressed as subsets of data and filters that are linked together by some common identifier such as the Student ID or a term ID.   For Datatel users familiar with UniQuery or QueryBuilder used with the UniData database, this is fundamentally the same concept as savedlists and using the results of one query to select for another.

Take your pick of example. Maybe you need only those students with a GPA of 3.0 or higher, or 20 credits after 18 months, or have a Pell FA award. The list goes on. This type of query problem is also not unique to higher education. In any case, don’t hesitate to introduce users to advanced query concepts. This will make them more self reliant and can reduce support requirements.

John Marsh Joins ASR Team

April 18, 2012 | Leave a Comment

Potomac, MD  April 18, 2012  – ASR is pleased to announce the addition of a new Managing Consultant, John Marsh. He was formerly the Lead Software Designer/Developer for Business Intelligence and Reporting Solutions at Datatel (now ellucianTM.) In his role at Datatel he created advanced data model, reporting, and analytic designs to support the needs of a diverse client base of nearly 800 colleges and universities across North America. Prior to his work at Datatel he worked at Qwest where he was heavily involved in data warehousing design and implementation.

At ASR John has joined the Higher Education practice to implement a variety of Business Intelligence (BI) solutions for our growing client base.  Some of the projects currently underway include implementation and knowledge transfer of the popular SAP Business Objects Enterprise, SAS Enterprise BI, and Microsoft BI platforms.

Beyond the technology itself, many institutions are struggling with getting data across their numerous systems organized, defined, and delivered to internal and external constituents. Our “analytic accelerators” leverage the higher education expertise, data and systems knowledge along with predefined template designs for student enrollment and retention analysis, student outcomes analysis with course completion and success. Coupled with recommendations for Data Governance processes, this approach covers three key success factors — people, process, and technology.

One of the more interesting projects currently underway begins to address the void of information about a student once they leave an institution. Student “swirl” as it is called, presents even greater challenges for institutions trying to understand the ultimate success of a student in achieving their educational goal. This is particularly true for community colleges where it is common for a student to come for a couple semesters or years and transfer to a four year university. Or, they may have regular stop outs and take courses at other nearby institutions as convenience and demands of life dictate. 

By combining enrollment data with National Student Clearinghouse (NSC) enrollment and degree verification data, an institution can get a more comprehensive picture of the long term outcomes for their students after they leave. The results of this project will provide a whole new level of measurement of long term persistence and graduation rates while giving institutional leaders a broad range of student dimensional categories to help analyze their own student success and intervention initiatives.

To learn more about the services that ASR Analytics offers and how we can help with your reporting, analytics, BI and data warehousing needs, visit our Solutions page or use our Contact form.

Should “bad” data be corrected in the Data Warehouse? Of course! Or should it?

March 5, 2012 | Leave a Comment

I ran across an interesting issue the other day while developing a new report for a client from their data warehouse.  As I was going through the validation process, I discovered that there were two pieces of data in some of the records that didn’t agree, yet by definition (on the surface at least) they should have.

The specific example was a college student with a term registration status of “first time student” in the “2009 Fall” term (as opposed to “returning” which is assigned to subsequent registration terms),  yet the student had an entry term cohort value of “2001 Fall”.  The entry term cohort value didn’t match the term value  selected in my query of “first time” students registered in “2009 Fall”.

How could this happen? How could the data ended up this way in the data warehouse? A quick data validation query showed a clear discrepancy on about 10% of records where the student had a “first time” term registration status, but that registration term was not their entry cohort term.

Further research and discussion with the client revealed that their business process dictates that after certain periods of absence the student is required to reapply and therefore be treated as a “first time student” again, even though the student’s original entry term cohort value is not change manually or automatically. So what should be used to query for those entering for the first time? The Registration Status? or the Entry Cohort?  Each returns slightly different values. The question is now before the data governance team as to what to do with the “Entry Cohort” values in this case and what the impact is of making a business rule change given the apparent conflict of definition between these two data elements and the different purposes they serve, often for different business units at the college.

Ultimately, it raises the question about what to do about the data in the data warehouse. Should it be changed? Or should it be left as is? It is an interesting dilemma that could apply to even simpler scenarios where data may have been entered incorrectly like the student ethnicity and a week later it is corrected. The data warehouse will capture these data changes and show the student with the “incorrect” ethnicity for that period of time.  If a report showing the breakdown of students by ethnicity is requested for that very time period, the report would be “incorrect”.

This caused me to do some thinking and searching for other people’s wisdom on this topic. It seems there is no clear consensus on the best approach with pros and cons on both sides.  In the first example I came across, there is a way this type of discrepancy can be captured in data quality exception reports during ETL and either fixed to match an agreed upon business rule or left to the business users to review and implement manually in the source transaction system to flow into the warehouse during the next load.  The second example however has no automated way of correcting since there is no way for the system to truly know what the correct data value should be.

The decision to correct data anomalies of this nature depends somewhat on the business requirements and the general philosophy on data quality. There is a distinct argument for saying “leave it as is since the warehouse is meant to represent the historical state of transaction data and the errors will not create a variance to change decisions made by analysis of that data”.  Others argue that data warehouses are meant to provide as clean an inputs to business decisions and should be cleaned when known to be wrong.  This blog post has a good summary of the perspectives with thoughts from some of the founders of data warehousing Ralph Kimball and Bill Inmon.

What do you think? I’d be interested in your comments and experiences.

 

What Makes a Good Measure?

December 18, 2011 | Leave a Comment

I travel a lot on United Airlines and since their merger with Continental, the new CEO, Jeff Smisek proudly states at the opening of the safety video that he and thousands of his colleagues are “creating the world’s leading airline.”   Now, more recently, Etihad Airways has been advertising that they are building the worlds leading airline.  What? Two leading airlines?? Now we have a fight on our hands!

But what does “leading” really mean? The first time I heard the phrase my immediate reaction was: Huh? That sounds terrible. Are they not going to strive to be the best airline? Aren’t they trying to be #1 like most would assume is the goal of a merger? But by what measure? Size? Revenue? Fleet age? Service and satisfaction? Destinations served? Complaints? Lost bags?  Cost management? 

For an industry with dozens of closely watched measures of performance, creating a public marketing message to be “leading” is vague and pointless to me. It’s also a bit risky. After all, vague or non-existent goals will always make you successful, but maybe not in the way you intended. It is a safe way to go, though, if you are not sure what you’re doing or how things might go. Maybe the new United will be able to say by the end of next year, “We’re leading with the worst ontime performance of any airline!”. That’s nice.

We’ve all heard the phrase “for good measure.”   Hey, throw in some extra salt for good measure! Maybe you do it just in case what you’re cooking tastes terrible. It seems rather arbitrary. Why not taste it first? So, a more thoughtful approach to planning and success may be in order. I have worked with plenty of clients who do not understand how to make a good measure. Their five year plans are a wealth of vague, uncertain and impossible to measure goals, usually in an attempt to placate many differing views.

Think about this as you are setting goals for the new year for yourself or your organization.  Are the measures meaningful? Can they really be measured? Is the necessary data collected? How will you know you have achieved the goal? Is it actually a good measure people will recognize as success?

After all, if you are ”leading” the airline industry that still has a terrible overall reputation for service, you haven’t really set a good measure or accomplished much.

BI and the NHL Playoffs

May 25, 2011 | Leave a Comment

I find that living in the DC area, one doesn’t find a whole lot of hockey fans in your everyday interactions. Most people are into baseball, football, and even soccer. But hockey – not so much — not nearly as much as in my hometown.  Living here for almost 20 years now, I’ve become a Caps fan. This season was promising and the begining of the post-season even more hopeful of a run at the Stanley Cup.  We know how that ended, but more on that in a moment.

When a Canadian friend of mine who happens to work for SAP Business Objects, forwarded this link to me that showcases their BI platform, I was ingtrigued. It takes full advantage of their analytics and data exploration technologies using hockey statistics. I mostly deal with higher education related data like student enrollment, retention, financial aid, and human resources. This was different and fun!

I took a look at how Washington stacked up against their second round rival Tampa Bay. Hmm…. Not such a good picture. Tampa Bay had higher average Goals For and lower Goals Against. Their offense and defense looked better by the numbers. I looked at the goalie save percentages. I compared some key individual players from each team. Everyone thought the Caps would keep winning and go to the finals. After exploring some of the data and visualizations, I wasn’t so sure of a spot in the finals. And, in fact, it didn’t happen. Sadly, the numbers seemed to support that outcome.  Certainly there is more to hockey than just numbers. Passion for play, pure skill, wanting to win, and luck sometimes create amazing upsets. That’s what happened in last year’s post-season. (And seemingly in every year’s March Madness for all of you college basketball fans!)

Of course statistics don’t always tell the whole story. Lots of other variables can come into play. And often good analysis includes domain knowledge with the human element to enhance any interpretation. But I can’t help thinking that those stats didn’t lie, and the results certainly bear that out. Now with Vancouver in the Cup finals and Tampa Bay winning tonight to force a game 6, It’s time to go back and do a bit more research and exploration!

Take a look at the site. Play around. Even if you don’t know much about hockey, it’s a good way to become familiar with some of the great analysis and visualization tools available in Business Objects. Maybe you can improve your chances of winning the office Stanley Cup pool!

Next Page »