Recently, Ken Wood and I collaborated on a study of the relationship between cold, windy weather and oil production in North Dakota.
Ken had accessed oil production from the State as well as climate data from NOAA (the Federal agency), and found a correlation.
However, in the process of creating his State of North Dakota monthly production totals, an error in his query crept in.
I was asked to review the data before he published it in his blog post, and since I follow the industry (I’m a native of the State
and have family in the oil-producing region), I was tasked with vetting the numbers for general accuracy and matching them
against common sense. Unfortunately, in this task I failed.
This blog post is about why I failed.
I’m a big believer in failure analysis both before and after a product/process has been
developed and used as a way of improving things. This certainly happens after large disasters like the
sinking of the Titanic and the Challenger explosion, but it should
happen more often both at small- and large-scales in product and process development. The saying: “hope for the best,
but plan for the worst” is too optimistic. You should assume the worst will happen and do your best to imagine ways it
could happen and try to avert them. In the area of data science, here are some suggestions to help make that happen:
(1) Peer review
Find an expert in the area and have him or her carefully review your input, process,
and findings. The flaw in our study was found by Wes Peck at EERC, one of the best geologists
(and skeptics!) I know. Wes’s temperament is to question everything and demand to see all the details
before he accepts something. Peers like that are your friend. Like them, think like a detective and demand
that people show proof of what they are claiming, and cross-correlate stories and facts to determine
(2) Wherever possible, use direct sources of data
When deriving data inputs, avoid complex SQL queries where possible. After Ken and I reviewed the data for
monthly Bakken production (derived via SQL queries) and saw it was incorrect,
I found another online source at the State web site that had just the data we needed.
So "keep it simple" is a good principle when gathering data. This approach also has the advantage
of supporting peer review, as your peers can follow the flow of logic and results more easily.
Also, whenever possible, cross-correlate your data with other, independent data sources
(at least one other source, but preferably more).
(3) Be suspicious of your results even if they agree with your intuition
In this case, because the data Ken showed me agreed with my overall intuition about Bakken production
(i.e., it goes up every year, year after year since 2005, with the exception of some decrease
or pause in February through April due to cold, windy weather in North Dakota), I didn't
carefully review the data in detail, and missed the clear errors Wes Peck found.
(4) Curiosity is good
Philipp Janert, author of a well known book on data analysis, believes that curiosity about input data and results is the
most important trait for data scientists. Be curious about your input data, the trends it contains and whether they match
common sense. Our complex SQL query resulted in monthly production dropping by a huge amount (nearly 80%) when in
fact that kind of drop is not possible since most producing wells (the vast majority of wells) are not directly affected by
cold, windy weather, whereas wells being completed can be drastically affected. The largest month-to-month drop seen in
actual production during the last few winters was approximately 4.2% in 2013.
(5) Models are good
Based on the trends you find in the data, see if you can develop a model that is consistently able to predict future
patterns based on incoming data. If it can’t, that could be an indication of an error in your data or processes for