Data Lake Pollution

Sep 05, 2018

Data Lakes Create Increased Risk of Pollution

I was doing some research for an upcoming project that will leverage a data federation platform, and it piqued my curiosity as to how these companies position themselves relative to the ever-growing interest in Data Lakes. Like all good analysis, this led to another question: “why would any organization, that continues to struggle with reporting and analytics, want to implement a Data Lake?”

I found this blog post from Gartner that is a few years old, but it is still as relevant today as it was then: Gartner Says Beware of the Data Lake Fallacy. It is important to remember that a Data Lake solves two very specific problems FOR IT:

  • Taking the first step in breaking down data silos, by gathering data and putting it in one place
  • Solving performance issues inherent with large amounts of data

Unless you have an army of data scientists, a Data Lake will not solve the problems with enterprise data that still exist FOR USERS:

  • Understanding of the data, i.e. data definitions and context
  • Data quality
  • Security
  • Regulatory concerns, e.g. PII, HIPAA, GDPR, etc.
  • Purpose-built / simplified data models for business users

Additionally, if you are concerned about reporting on your data from specific points in time, how are you going to manage change data capture in your Data Lake?

Ultimately, like any technology, there may be a place for a Data Lake at your organization. However, if you follow a lot of the vendor hype and believe that it will solve fundamental challenges that your organization faces with data today, you may end up creating the next Lake Karachay, the most polluted place on earth.