As a part of the ongoing Financial Forum series I'm going to discuss the emerging exa-scale era. Referenced on the slide to the left one of my favorite projects on the "lunatic fringe" is the Square Kilometer Array (SKA). This project is an effort to observe the heavens, in Australia/New Zealand, with a telescope that essentially covers 1 square kilometer. When completed, sometime near 2020, the system is going to produce roughly 10x the traffic of the Internet or somewhere near 12 exabytes of data per month. (For reference the Internet produces 70%-80% of an exabyte per month today.) Further to the left I've also put an exabyte in the context of the number and capacity of BluRay movies. This facet of the image shows that an exabyte is roughly 20,000,000 movies which correlates to 4566 years of viewing time or one heck of a popcorn bill. If you read how the SKA team is planning to handle the extreme scale what you'll find is that there is a rethink in traditional architectures. It isn't quite distributed and it isn't quite centralized. Essentially processing is likely to occur out by one or more telescopes with the results then pumped back into a core for further processing. (We talked a bit about this project in the OpenStack Panel Discussion, where we focused on if and how OpenStack might be a part of an exa-scale architecture.) In fact we are seeing this sort of trend more regularly, and have given it the name of Smart Ingestion.
Smart Ingestion has several attributes which are listed below for your review.
- Processing -- can include data winnowing, data packaging, decision making, real-time analytics, predictive analytics, and triggering automation close to the data/information source itself.
- The results of processing are persisted in open data formats and likely stored in either an intermediate "cache-like" system that eventually gets to a core or directly in a core site.
- A subset of processing activities are persisted along with the data as an abstracted information model and are also stored in open formats like XML or JSON.
The specific location of the Smart Ingestor will of course depend on the deployment model of your aggregate system. In the case of the SKA it would potentially be close to the telescopes, if your an oil and gas company it might be on the ocean-going platforms, and if your a telecommunications operator it might be in the remote/branch offices adjacent to your wire-line or wireless infrastructures. What is very clear is that proximity to the data source matters and the abilities to process in the open are critical. That is because having humans or a core systems process the equivalent of say 240 million movies per month, in the case of the SKA, isn't really feasible. This doesn't event get into the problems of moving this data over LAN, MAN and WAN networks or how much energy would be required to persist and process multiple exabytes.
What else? Well perhaps a key question in the exa-scale era is to get back to the fundamentals for storage, networking and processing. As we've talked to customers, contemplating this scale, they are thinking that there isn't enough power to spin an exabyte of magnetic media in a single data center. Even if there was sufficient moneys and power supply to handle a single data center the bandwidth of WAN pipes and the ability to afford a second site to assure recoverability from a disaster is suspect. Therefore thinking begins to move towards considering highly reliable cold media coupled to analytics starting from coarse information models. Interestingly, these customers are considering the following things: What if the media itself is able to last for 100s of years, survive floods, and electromagnetic shenanigans? Effectively, you'll have highly reliable cold storage that begins to make exa-scale architectures more feasible. This alone has lead users to think that perhaps a workflow starting from search and coarse analytics on information models is a better bet. Specifically when examined the workflow might look like:
- Coarsely analyze an information model.
- Identify content/data assets from the coarse analysis for deeper inspection, and analyze on the best fit platform (e.g. small file random = HNAS).
- Combine deep & coarse historic analytics with real-time analytics from smart ingestion to make a decision. Augment the information model accordingly with the results.
- As necessary move analytic/processing capabilities directly to the source so that source data is thinner and smarter. (Note this is Smart Ingestion.)
To better wrap our head around what a platform for coarse information models would look like, I want to borrow an example from my colleague Sean Moser. He cites Yelp as a great example of a platform that stores and acts upon a coarse information models for services, stores, restaurants. So if we were to use Yelp to motivate the discussion about coarse analytics the above process it would look like:
- Interact with the information model
- Go to the Yelp web service, which naturally uses geo-IP to detect your location, to get a coarse understanding relevant services.
- Search or pick your class of service like say Japanese restaurants.
- Once the class has been picked the system produces a listing of restaurants sorted by rank.
- Pick a restaurant, let's say Orenchi Ramen, and then you can interact with the information model to understand user ratings, maps, and references to their website.
- Exit the information model
- If you choose to click on the URL for your restaurant you exit the information model system, Yelp, and begin to work with the raw contents, Orenchi's web site.
Of course Yelp is programmatically controllable with a series of APIs so you could automate this process and create rich (analytics) applications that interact with the coarse information models and the raw content.
In essence our customers expect the same thing when they think about the future of analytics and applications at the exa-scale. This way they could facilitate many iterative analytics runs on an information model which we've heard could be as small as 1% of the actual data/content being stored. Once the iterations on the information model are complete our customers expect their users to have identified a small and finite set of data/content for deeper inspection. By making this workflow assumption the customers we've discussed this idea at length with believe there will be sufficient CAPEX containment and OPEX improvements to make the exa-scale a reality.