Hu Yoshida

The internet is turning inside out – Time for Time Series DB? 

Blog Post created by Hu Yoshida Employee on Mar 24, 2019

Data Hole.png

The internet is normally accessed like a pyramid where a URL may be accessed by hundreds, thousand, even millions of users. Now, with the Internet of Things, we have a multitude of “things” sending millions of records to the Internet. In a sense the internet is being turned inside out with millions more data points being ingested than being served up thanks to the sensors that enable IoT.

 

An IoT device like an autonomous vehicle may have hundreds of sensors generating thousands of GB of data. It is estimated that a single autonomous car will collect over 4000 GB of data per day! The reason for this large amount of data is that IoT devices are concerned with change. In order to track change, the data must be collected as a time series, where new data is always added and not updated. It allows us to measure change and analyze how something has changed in the past, how it is changing in the present, and predict how it may change in the future. By focusing on change, we can understand how a system, process, behavior changes over time and automate the response to future changes.

 

The down side is that time series data generates a lot of data very rapidly. More data than can normally be absorbed by transactional or NoSQL data bases. This has spawned a rapidly developing market for time series data bases (TSDB). TSDBs are fine tuned for time series data. This fine tuning results in efficiencies around performance improvements, including higher ingest rates, faster queries at scale, and better data compression. TSDBs also include functions and operations common to timeseries data analysis such as data retention policies, continuous queries, flexible time aggregations, which results in improved user experience with time series data. You know that time series data bases are mainstream when AWS gets in the game. AWS has announced Amazon Timestream a fast, scalable, fully managed time series database service for IoT and operational applications that makes it easy to store and analyze trillions of events per day at 1/10th the cost of relational databases. The following chart from DB-Engines, November, 2018, shows the growing acceptance of TSDBs compared to other forms of data bases. In 2018,

 

 

TSDB graph.png

 

If an autonomous automobile can generate 4000GB of data per day, imagine what a more complex system like an oil refinery would produce. We recently worked with a large oil refinery in Europe which had thousands of sensors installed on equipment including heat exchange networks, power plants, pipelines, and many other systems collectively generating millions of data points every second. Their operators, process engineers, IT, and data scientists were collecting the data manually from these systems as well as Oracle, SQL Server, and SAP and used tools such as Excel to derive their insights. Data silos inhibited the collaboration between management, scientists, engineers and IT resulting in short-sighted and/or incorrect decisions. This was neither efficient, re-usable, nor scalable.

Oil Refinery.png

 

A TSDB, OpenTSDB, was used to collect all the sensors into a data lake. Pentaho Data Integration was used to connect to OpenTSDB, eliminating the need for 3rdparty vendors and leverage distributed compute. OpenTSDB has extensive, REST based, open APIs which gave our Pentaho engineers huge flexibility to retrieve data extremely fast and parse within Pentaho.The kind of analytics used varied from simple correlation, visualization and ML for predicting values. That being said, Pentaho’s value proposition was more on the data integration part; the data acquisition, extraction, blending, data science etc.; which consumes 80% of a data scientists time spend over mining and modeling. Pentaho also enabled the process engineers, IT, and data scientists to work as a team and enabled business users with self-service consumption of operational data. This not only led to better decisions, but also reduced the lead time from 2 days to less than 10 minutes.

 

Time seriesdata are simply measurements or events that are tracked, monitored, down sampled, and aggregated over time. This could be server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics data. Time series data can be analyzed to understand the underlying structure and function that produce the observations. A mathematical model can be developed to explain the data in such a way that prediction, monitoring, or control can occur. As the internet turns inside out with Time series data bases, Hitachi Vantara’s Pentaho will be there to scale with the explosion of data and provide integration, analysis, and visualization for greater insights into current and new time series applications.

 

A TSDB, OpenTSDB, was used to collect all the sensors into a data lake. Pentaho Data Integration was used to connect to OpenTSDB, eliminating the need for 3rdparty vendors and leverage distributed compute. OpenTSDB has extensive, REST based, open APIs which gave our Pentaho engineers huge flexibility to retrieve data extremely fast and parse within Pentaho.The kind of analytics used varied from simple correlation, visualization and ML for predicting values. That being said, Pentaho’s value proposition was more on the data integration part; the data acquisition, extraction, blending, data science etc.; which consumes 80% of a data scientists time spend over mining and modeling. Pentaho also enabled the process engineers, IT, and data scientists to work as a team and enabled business users with self-service consumption of operational data. This not only led to better decisions, but also reduced the lead time from 2 days to less than 10 minutes.

 

Time seriesdata are simply measurements or events that are tracked, monitored, down sampled, and aggregated over time. This could be server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics data. Time series data can be analyzed to understand the underlying structure and function that produce the observations. A mathematical model can be developed to explain the data in such a way that prediction, monitoring, or control can occur. As the internet turns inside out with Time series data bases, Hitachi Vantara’s Pentaho will be there to scale with the explosion of data and provide integration, analysis, and visualization for greater insights into current and new time series applications.

Outcomes