Folks it has been a while since I've been able to dedicate some time and thinking to the Innovation Center on the Community. This is partly because I've been completing the semester at San Jose State University. So in other words, lots to do and lots going on to account for. Well enough of that already and on to the post.
In my continuing series to detail the Financial Forum presentation I'm going to dig in a little more to a customer study that ended at the beginning of this calendar year. This formalized customer study activity is something that my team, and associated organizations, do on an annual basis to keep up with the pulse of the market.The output aggregates feedback and helps our planners, researchers and engineers do more in alignment with the market. One key challenge we face is that when we complete any study the scale of the document is typically up to 300 pages in length; while we do certainly read, discuss and debate the themes in each study some sense of quantitative validation was desired this year. Therefore the team loaded up R and the associated WordCloud package and visualized interesting terms in the report. The result was similar to the first image in this post, and that image helped to validate the trends we ultimately uncovered. The three trends we believed were important to highlight after our study are: Innovating around the periphery of Hadoop, Coarse metadata analytics when operating at scale, and finally Rich media based analytics. I'll address each in turn and also where I've already done so in the community point to existing content that may better cover the trend.
- Innovate on the periphery of Hadoop - This was particularly interesting. There is essentially a lot of purism in the community now about shuffling data in and out of Hadoop. What was most interesting is that one of the discussions with a prospect centered around virtues of Hadoop, move the application to the data. The user underscored this point specifically and then proceeded to talk about acquiring the data from an RBDMS/EDWH, process it in Hadoop, and then push it back to the database server. Huh? This very process violates the espoused virtue, does it not? Further during our discussions we've run into customers who want to Hadoop-ify their existing SAN and NAS infrastructures so they can shed light on Dark Data and get new value out of it. Finally we ran into customers who pulled data out of a leading commercial RDBMS, stored it in Hadoop, and then pushed it back to another instance of the same RDBMS so that users could more easily work with traditional BI tools. Now I'll be the first to admit that the Hadoop community has made major progress in supporting SQL and other SQL-like query languages and of course they've improved the robustness of their file system. I could also, and won't, get into a religious discussion about file systems but that will not further the discussion. Essentially, what I want to say is that customers have what I feel is a legitimate request to Hadoop-ify existing content that is maybe stored in an object store or NAS device, or even sweat some existing SAN assets. Further a key limitation in the I/O profiles and therefore application workloads that Hadoop is most efficient with come from HDFS. In file systems some of the things that determine affinity to one workload or another come from the lock management semantics and the fundamental file system format. What I mean by this is that file systems like IBM's GPFS or Hitachi's Supercomputing Filesystem (HSFS) do well with large file sequential, while HNAS is more of a balanced performer. If we were to replace HDFS and connect either of these types of filesystems then I think it is fair to suggest that Yarn, MR, and other Hadoop ecosystem goodness may have a different behavior. So in summary customers are interested Hadoop-ifying existing content, sweating their existing assets, and technically I think there is a solid argument to be had here about new workloads Hadoop could potentially support via file systems and object stores other than HDFS.
- Coarse Metadata Analytics when at Scale - Well there are several discussions already in the Community about metadata. Before I get into something new I did want to point out these references and discussions. Firstly John Montgomery started an interesting discussion about metadata called A Metadata Paradox - or Two where we discuss the importance and awareness of metadata. Of course the Snowden effect has broadened awareness of metadata and Greg Knieriemen talks about this point in Random thoughts (and questions) on PRISM, data gathering and metadata. Further as we're talking about extreme scale and memory based systems I waxed a bit philosophical and we did get into why metadata based analytics were required at scale in If memory based storage becomes the dominate type for primary function and performance what kind of interconnect is required? This of course doesn't get into all of the metadata discussions in the Hitachi Developer Network for the Hitachi Content Platform, which I encourage you to check out. So as you can see the Community is already full of lots of interesting discussions about metadata in general. The "so what point" about metadata in any ICT system when taken to extreme scales is that in short architectures require it. In fact the figure in this section points out the expected behavior pattern for metadata systems. I'll not get into the specifics of the machine oriented or person oriented workflow, but instead state that the machines are bottoms up and the people are tops down. I also want to point out that both the DNS and LDAP/AD are excellent examples of highly scalable distributed metadata architectures that make a difference in peoples' lives everyday.
- Rich Media Analytics Highly Desired - During our study in 2012 we ran into a class of companies who utilized images and video to build data products, advise users, and drive decisions from. These customers were intensely interested in technologies that assisted their staff in improving productivity and not complete automation at this time. The range of activities these users engaged ranged from trying to determine the financial performance of retail locations using satellite data, helping farmers improve yields, detecting scenes/actors in serial shows/movies, detecting specific kinds of objects that may go boom, and counting biological features (like brain synapses). In all cases, and as hinted, the goal is to make the human being more efficient at their tasks. Let me illustrate by a discrete example: A customer we've talked keeps satellite images for years due to governmental requirements. They of course sell their imagery products as is to various customers for upstream consumption. What is more interesting is how their data is used during say a flood. In this case this company needed tools that recalled the imagery from several days and months before the flood event so they could advise NGOs on rescue plans. Essentially, the workflow was to provide satellite images of structures without any water and images with. This allowed rescuers to better target their efforts in the right place since flood waters obscured, in some cases, where structures are and therefore where people might need to be rescued.
Obviously we continue to remain engaged with our customers and in fact we're in the midst of a formal study this year too. Instead of information and content our focus is on networking and understanding what profound things the market is asking for...