Geoffrey Marsh

Garbage In, Garbage Out

Blog Post created by Geoffrey Marsh Employee on Feb 27, 2018

Let’s face it, analytics whether big data, machine learning, or IoT is sexy. I figured if I waited long enough it would get cool and I am glad I did, but I think we need to spend a little time on the less exciting side of analytics – data quality.


So why do we care so much about data quality? Aren’t we taught that all data generated has some form of inherent value? Would we not want to analyze everything we can get our hands on? I have been in organizations that have thought both ways on this, but a key thing to remember is bad data will cost you money!


It is not difficult to understand how bad data could drive the wrong decisions, so it is a good thing someone is often cleaning the data. That cleaning could happen anywhere during its lifecycle. Perhaps it is the analyst that is querying the tables or scrubbing a report before he/she sends it out, or maybe they do it in the code they write before the data is output. While it’s great the cleaning is getting done, the efficiency at this point is terrible. To show this, let’s look at some numbers and facts. If you do a standard web search you will get these averages:


  • The average data scientist in Silicon Valley makes $200K a year. If he/she receives full benefits with 401K and other company perks, each data scientist costs the company about $250K (conservatively) per year.
  • Let’s say company X has five data scientists so it spends $1.25M for these resources every year.
  • On average most data scientists spend 60%-80% combing through data and cleansing it —let’s split it and take 70%.
  • Given all the information above, the means company X is spending $875K per year for their data to be cleaned.


That time and money is not being spent creating algorithms and models to drive more sales or for the next big product launch, but rather it is spent just wrangling the data. And that is the cost for only those resources! If you factor in what everyone else in your organization is doing to “clean things up,” the cost would be exponentially higher.


Throughout my career I have seen few organizations manage their data quality successfully. Unfortunately, for most organizations I worked with, cleaning their data was an afterthought. Instead of following those organizations, consider these top five priorities to help drive successful data quality for your organizations:


  1. Leading with quality on your mind. Whether you are building a new app or building a table in a warehouse with existing data keep focused on the quality of data while building it.
  2. Data quality is not a “one-time thing.” You need to implement business processes that ensure a consistent level of quality.
  3. Do not rely on tools only. Master data management tools will not fix all your problems.
  4. Look at other tools. Being able to manage the data lifecycle whether at rest or in motion (in query and ETL) is important.


So, I will leave you with this. The buzzwords and hype around analytics is exciting and I am all for gaining insights we have never been able to grasp but remember you must build a foundation of quality data for those insights to come from. Put in place the right tools and business processes and you will be successful.