Ken Wood

UPDATE! Data Blending, Spurious Correlations and a Rainy Day

Blog Post created by Ken Wood Employee on Apr 25, 2015

It has come to my attention that the result from this analysis presented in this blog is overstated in its conclusions. The folks at the EERC have pointed out two areas of discrepancies that I would like to address here, the volume of production shown in the charts below is overstated (the y-axis) and the amount of cold weather production decline is overstated (the x-axis). After examining these details with Matthew O'Keefe, it turns out they are right. You’ve got to love experts!


First, the volume of production shown on the y-axis is wrong. Operationally, the data used for this calculation boils down to an operational error on my part – the database was loaded more than once with the same data resulting up to three copies of the same data existing in this table. This greatly inflated any results in the analysis.


This has an impact to the second factor of cold weather’s impact on production. In the same chart, an “up to an 80% reduction impact” was stated during periods of extreme cold weather months. This is in fact wrong by a significant factor. The decrease in production is closer to 9% - 15% from peak production to lowest production in recent years as this new chart show here.


While the results of this analysis is part of a more broader illustration showing how to blend data sources, the data sources themselves needs to be thoroughly understood and examined before reaching conclusions. As Matthew O'Keefe says, “don’t trust your data sources, don’t trust your analysis and don’t trust your results”.


I do want to thank the EERC for reading our blogs and for
commenting on the accuracy of them. This
means they are really reading them thoroughly!

Data Blending, Spurious Correlations and a Rainy Day

With the recent announcement of HDS’ intention to acquire Pentaho, I can finally share openly what I have been working on. Not the business stuff, but much of the experimentation work.


I’ve been working with the Pentaho team and this project for over 16 months now, evaluating and using the technology to solve problems and create visuals. My team has managed to create some very interesting and useful data analysis. Some of it was fun and interesting, while others were a lot of hard work that turned out to be duds – you need a few failures to appreciate the successes. Over the course of my next few blogs, I will share some of the insight and analyses that have been created and discovered using the Pentaho suite of data blending, analytics and data orchestration tools.


A while back, I did share some early work with Pentaho once we had signed the OEM relationship, though I still did not share the tool set I used, due to the sensitive magnitude of the project, I felt that at the time, sharing that detail would be too revealing. In fact, much of my recent lapse in blogging is directly related to this work. Some smart people would have figured out what was going on. These earlier blogs and the work behind them was the “TwitterVerse” analysis. You can review the blogs “If you tweet in the US…” and “Influencers added to the TwitterVerse”. In case you didn’t figure this out, this was created with Pentaho. This actually runs as a live dashboard (internally on the HDS’ network) with other charts of Twitter insight and information. Currently, this runs every hour grabbing as much data as Twitter allows and updates the dashboard accordingly. I’m only running this on the hour due to resource constraints and can be dialed up or down for demonstration purposes and other social events that would warrant it.


However, I’m actually not going to write about the details of that system in this blog. As a future component of this new blog series, I’ll save that write-up for a later time since it will be a multi-part series and will require a lot more detail to explain.


It’s a rainy day today here in southern California and I’m in the mood to write something about weather from a data analysis perspective. Instead of “I’m Singing in the Rain” this more appropriately called “I’m Blending Data in the Rain”.


Early in this project, I gained access to Oil and Gas production data sets from the state of North Dakota and did some work with the Engineering and Environment Research Center, the EERC. In fact, I have a whole HCP namespace dedicated to EERC and North Dakota oil and gas data. This data set is not the more infamous seismic data, this is oil production data and detailed oil well information for the shale oil fields of the North Dakota Bakken region. You can read about a One Hitachi meeting we had back in May 2014, where we toured the EERC research facility and discussed joint research and development opportunities. The geographical map pictures in that blog was done with Pentaho. Now, I’d like to share some additional analysis I’ve just completed while asking a basic question that may seem obvious.


“How does weather impact oil production in the North Dakota Bakken?”


This is not a bogus association that you might read about at spurious correlations, (Donna Yobs has been sharing with me the “facts” found from this site for past few weeks) this actually has useful insight. Without researching the question on the internet, I used the recent work on the North Dakota Bakken data sets that plot oil well locations and production of the region as seen in this picture below.




This is a series of geographical map picture zooms drilling into an oil well location showing the condition and environment of a typical wellhead. Actually, all of the wells and it’s locations are in different stages of completion and conditions depending on the maturity of the well, it’s output, the operator and other factors. But one thing can be observed, the areas are somewhat rugged, undeveloped terrain, lots of dirt roads, no pipelines and the primary condition, it’s in NORTH Dakota. To read these charts a little easier, development of the Bakken region started on the left (west) side of the charts and additional wells are drilled towards the east (right) over the decades, thus the larger circles indicate lifetime production with the left most wells having produced the most oil, with newer wells drilled heading toward the east are younger and haven’t produced as much oil in comparison.


Asking the question, “How does weather impact oil production in the Bakken?” I will need some weather data, specifically, historical weather data for the region. The oil production data for the region goes back to 1960, so there is a lot of weather history to work with. Using the National Oceanic and Atmospheric Administration – NOAA website, I was able to obtain historical weather data from the Bakken region, using a latitude and longitude polygon, since 1960, the first oil production entry in my oil production dataset.


By correlating the two unassociated datasets (other than location) using the Pentaho analytic tools, a distinct pattern emerged based on one weather data point, DX32. Of the data fields contained in the weather report dataset, the field DX32 indicates the number of days per month that was at or below freezing for the area. Have a look at the dashboard below that shows the total monthly oil production for the region where days with freezing or below days in the month. Data from the 1960s are dropped because the data points are overshadowed by the higher production rates of the past 36 years.




For the past 36 years, when it is freezing or colder in the North Dakota Bakken region for any extended number of days, oil production drops, while the peak production happens when the temperatures are over 32 degrees Fahrenheit. Stated another way, more oil is produced when it’s warmer than when it’s freezing. Thus, expect higher oil production in the warmer spring, summer and fall days, and much lower oil production in the colder fall, winter and spring days.


This is not a solid scientific study, this is a data correlation exercise of two different datasets with a visual outcome. Also, I’m only using oil production data through 2013 and this insight doesn’t reflect techniques incorporated over the past year. There are a number of possible reasons for this observed pattern, from freezing roads and dangerous conditions, equipment sensitivities to the cold, to the difficulties in providing and supplying heated water (steam) to facilitate the hydraulic fracturing process. I am looking forward to seeing if I receive any comments on more detailed reasons for the difference in output from experts in this industry.


Actually, this is one of the more simple process flows I’ve done, but it is after the initial hard work of getting all of the well locations and production levels identified. So, in a way, this result is an iterative work that builds from the previous work, which makes this more impressive. You can review the flow below of how this insight was created, but this diagram doesn’t show the previous work.



While this isn't a classical Big Data style problem, this is only about 100,000 records, it is what Michael Hay likes to refer to as Little Big Data. There's still big insight in little data.

As I mentioned, I will be writing up and sharing more of these analytic projects as I document them internally. These are actual live dashboards that can be called up as needed. If everything goes according to plan, these blog write-ups will continue to get more complex in its description and implementation as I share them with you.


I hope you find these useful and insightful, and that you will follow me through these exercises. Also, please share your comments and opinions, and together we can develop even further insight going forward.