Matthew O'Keefe

Some Techniques for Reducing Errors in Your Data Science Efforts

Blog Post created by Matthew O'Keefe on Apr 29, 2015

Recently, Ken Wood and I collaborated on a study of the relationship between cold, windy weather and oil production in North Dakota.


Ken had accessed oil production from the State as well as climate data from NOAA (the Federal agency), and found a correlation.


However, in the process of creating his State of North Dakota monthly production totals, an error in his query crept in.

I was asked to review the data before he published it in his blog post, and since I follow the industry (I’m a native of the State

and have family in the oil-producing region), I was tasked with vetting the numbers for general accuracy and matching them

against common sense. Unfortunately, in this task I failed.


This blog post is about why I failed.


I’m a big believer in failure analysis both before and after a product/process has been

developed and used as a way of improving things. This certainly happens after large disasters like the

sinking of the Titanic and the Challenger explosion, but it should

happen more often both at small- and large-scales in product and process development. The saying: “hope for the best,

but plan for the worst” is too optimistic. You should assume the worst will happen and do your best to imagine ways it

could happen and try to avert them. In the area of data science, here are some suggestions to help make that happen:


(1) Peer review

Find an expert in the area and have him or her carefully review your input, process,

and findings. The flaw in our study was found by Wes Peck at EERC, one of the best geologists

(and skeptics!) I know. Wes’s temperament is to question everything and demand to see all the details

before he accepts something. Peers like that are your friend. Like them, think like a detective and demand

that people show proof of what they are claiming, and cross-correlate stories and facts to determine

the truth.


(2) Wherever possible, use direct sources of data

When deriving data inputs, avoid complex SQL queries where possible.  After Ken and I reviewed the data for

monthly Bakken production (derived via SQL queries) and saw it was incorrect,

I found another online source at the State web site that had just the data we needed.

So "keep it simple" is a good principle when gathering data. This approach also has the advantage

of supporting peer review, as your peers can follow the flow of logic and results more easily.

Also, whenever possible, cross-correlate your data with other, independent data sources

(at least one other source, but preferably more).


(3) Be suspicious of your results even if they agree with your intuition

In this case,  because the data Ken showed me agreed with my overall intuition about Bakken production

(i.e., it goes up every year, year after year since 2005, with the exception of some decrease

or pause in February through April due to cold, windy weather in North Dakota), I didn't

carefully review the data in detail, and missed the clear errors Wes Peck found.


(4) Curiosity is good

Philipp Janert, author of a well known book on data analysis, believes that curiosity about input data and results is the

most important trait for data scientists. Be curious about your input data, the trends it contains and whether they match

common sense. Our complex SQL query resulted in monthly production dropping by a huge amount (nearly 80%) when in

fact that kind of drop is not possible since most producing wells (the vast majority of wells) are not directly affected by

cold, windy weather, whereas wells being completed can be drastically affected. The largest month-to-month drop seen in

actual production during the last few winters was approximately 4.2% in 2013.


(5) Models are good

Based on the trends you find in the data, see if you can develop a model that is consistently able to predict future

patterns based on incoming data. If it can’t, that could be an indication of an error in your data or processes for

summarizing results.