Oh, The Places You’ll Go
A couple of weeks ago we held our annual HDS Influencer Summit – so titled because it is a gathering of IT industry influencers that span the financial analyst, industry analyst and social media guru/blogger communities. One day spent together so folks can hear and see some of the latest and greatest things coming out of HDS. This time we dived deeper into some things we have cooking with our sister companies within Hitachi (that pleasantly surprised many and appeared to get the Twittersphere humming). As many heard and saw that day, we have been executing towards a vision that enables businesses to harness and leverage the power of information for competitive advantage. The scope of technologies and services we are bringing to bear encompass the breadth of Hitachi capabilities in delivering what we call “Social Innovation”. As Mark Peters recently observed, “It’s not about an IT stack (however complete and wonderful) sold into a vertical….its about “big” Hitachi serving/selling an entire vertical (business) ‘stack.’” It’s a big vision that perhaps just a few companies in the world have the ability to pursue.
At the same time we outlined how our business performance continues to outpace the market. We were pleased to report our Q2 financials (ending September 30), which continued a string of twelve consecutive record growth quarters – in Q2 we grew 11% year-over-year. The diversification of our portfolio continues with 50% of revenues coming from software and services, and our file/content solutions growing in triple digits from a year ago this time. I could go on with more figures, but hopefully you get the picture relative to our success.
Looking forward to the near term we anticipate continued changes in not only how customers consume technology and services but also dramatic innovations in the technology itself. As Hu Yoshida outlines in his 2013 predictions, we can expect to see continued shifts in customers’ CAPEX/OPEX mix as capacity demands steadily grow while advances and usage of the technology available to manage them (such as virtualization) become more prevalent. In fact, as Hu points out, the consumption models that customers are interested in leveraging to handle technology CAPEX demands will evolve dramatically towards more per-unit based acquisition through cloud and/or vendor managed service offerings. In parallel, expect to see continued innovation (as demanded by customers) in mid-range systems, flash storage, truly “converged” stacks of infrastructure and very importantly the scalability and automation in object-based file systems to support the explosion in unstructured content and the demands of big data analytics.
Innovation now, with an eye towards a vision for the future, is essential in any journey that leads to something truly unique. We will continue to learn from you our customers, partners and industry observers but we can’t hide our excitement about where we can go.
Big Data – It’s not the size of your source but the size of the insight that really matters
Over the past couple of weeks I’ve heard some great quotes from acquaintances and colleagues. Ron Lee from my team recently attended a CIO breakfast in Asia Pacific (APAC) where he heard the following from a CIO: “I’m not worried about my Big Data; I’m really worried about my Little Data.” Also, in a recent interaction with a Hitachi partner we explored the broad usage of smaller data sets to help in organizational and interior architectural design. There, one of the two data scientists we met with (he holds a Ph.D. in Psychology/Cognitive Science) had the following to say: “For years I’ve been developing islands of specialty in vast seas of ignorance.”
These discussions got me thinking…
I think that CIO in APAC is on to something. Big Data is not just about mining big data sources – big insight can also come from smaller data sets that just haven’t been tapped yet. The comments from the data scientist about democratization and extreme usability also resonated with me. The real prize here is making insight and analysis as ubiquitous and easy to use and collaborate around as, say, email.
So though the de facto definitions for Big Data seem to revolve around the 3 Vs –volume, variety and velocity–I think there is another angle. Here is my proposed definition of what Big Data will mean in the future.
Big Data of the Future – At scale agile processes realized by multidiscipline teams leveraging a variety of data categories and types flowed through various technologies, including provisions for security and privacy. The end result – timely discovery of sparks of insights leading to valuable innovation and knowledge.
Putting your business under the microscope
At Hitachi we are already starting to realize this vision through a number of really innovative projects – one of these is called the Business Microscope. You may have already seen our announcements about how it is used to deliver new insight in retail stores and call centers. Check out my previous blog post for more information. These efforts are powered by “the continuous measurement of human behaviors(1)” using a variety of sensors connected to and near human beings in their environment.
Traffic flow at a retail store goes under the microscope
In the retail example the technology was utilized in a ”home center” store. Over a period of 10 days they used the Business Microscope to track and analyze customer service activity, the standing locations of employees throughout the store floor and the impact on customer flow. By analyzing the results and appropriately repositioning shop floor staff, they were able to achieve a 15% average increase in sales per customer.
That’s a big insight for the retailer from a relatively small amount of previously untapped data!
I’ve included some images here of the traffic patterns and you can see clearly the impact of a few subtle changes that would have been unrealized without this type of analysis.
So, certainly, the output and resulting recommendations point to valuable insight and knowledge: repositioning a sales team to the right location for 16 seconds or more results in a 15% increase in average sales per customer.
In my mind this certainly matches a key part of our proposed Big Data definition in that there was a multidiscipline team behind it. In this instance the team comprised a management consulting firm, Hitachi’s own Ph.D. grade data scientists, along with customers, sales clerks, and temporarily employed staff to collect the data.
There was also a wide variety of different data sources involved in the analysis and different analytic engines to process all the point of sale and sensor data before we arrived at the visualization shown above. This was not as clear cut a process as traditional database transactional style analytics.
What about security and privacy issues of data? In our retail example we were very cognizant of those. Our data scientists K. Yano and N. Moriwaki along with the other partners involved in the study employed techniques like data anonymization, network and file encryption and limit collection duration to ease privacy fears, and to ensure compliance with security and privacy policies at the customer’s site.
Surprising insight when call center sales performance also goes under the microscope
We leveraged the Business Microscope on a second project this July with MOSHI MOSHI HOTLINE, Inc. in Japan to identify what impacts call center sales performance. The results of the analysis were pretty enlightening and dare I say surprising too. Two call centers participated in the study with 51 telemarketers at one location and 79 at the other. A variety of data was collected from face-to-face interactions between employees and supervisors to sensors monitoring body movement of the employees as they went about their day. The goal was to determine what factors directly correlated with the ”order receipt rate” of the telemarketers. If one were to second guess it would be fair to assume that skill level was the primary factor influencing sales success but the analysis showed that the degree of activity during breaks had the biggest impact on sales. Surprise!
Armed with this new insight, the team worked to test their theory by picking a group of telemarketers of the same age and testing sales results over a 3 week period when they were on breaks together versus separately. The result – when they took breaks together their activity increased and so did their sales–by approximately 13%.
So size really doesn’t matter
This is another great example of how obscure patterns in small amounts of data can lead to big insights and big business impact. Both these examples dealt with relatively small scale projects, with limited investment and small data sets in the 10s of gigabytes . Of course because Hitachi has been doing experiments like this with the Business Microscope for years the overall amount of data analyzed runs closer to the billions of data items across a wide number of projects but the point here is that Big Data doesn’t have to mean big project, big volumes of data, or big cost. The big refers to insight and sometimes it’s the little things that make all the difference.
1. K. Yano et al., “Measurement of Human Behavior: Creating a Society for Discovering Opportunities,” April, 2009
Well We Scaled Down Hitachi VSP
As per my last post we’ve scaled-down Hitachi Virtual Storage Platform (VSP) into a smaller form factor. I’d like to dig in a little deeper and explain why Hitachi Unified Storage VM (HUS VM) is dramatically different than our past efforts with scaled down enterprise platforms such as Universal Storage Platform VM (USP VM). My esteemed colleague Hu Yoshida discusses some of the key design points of HUS VM and its relationship to both VSP and HUS. In the past, and as Hu articulates in his post, we’ve followed a practice of miniaturization to take an enterprise storage system and make a physically smaller version of it. Essentially with USP VM and Network Storage Controller (NSC) before that, we kept the architecture the same, but we put less of the ingredients inside. So the miniaturized version was the same system with less cache, fewer processors, fewer ports, less on the backend, etc. in less space and a 19-inch rack. When we look at HUS VM we’ve rewritten the rules because rather than miniaturize the system we’ve scaled it down. Instead of using less of the same we’ve replaced key components with functional equivalents that are Hitachi value added or COTS (commercial off-the-shelf) in nature. The net is that the core hardware architecture of HUS VM is quite different than VSP, yet it still runs the same value-added block microcode as its bigger brother.
We were able to do this by improving our microcode relying more on COTS equivalents and a new Hitachi-specific microprocessor. Here’s an example: in the VSP system there are 5 types of Hitachi processors or ASICs, while in HUS VM there is only one type. These changes have increased flexibility in the microcode affording our customers the advantage of enterprise capability in a modular footprint. So unlike USP VM, HUS VM is not a miniaturized version of VSP, but instead it is a distinct class in the HDS portfolio apart from VSP. Additionally, if our users want a “miniature” version of VSP we have been able to accommodate that need since the release of the product. That is accomplished through our 3D scaling methodology allowing customers to configure a small VSP in a single 19-inch rack with or without mass storage and grow the system to the biggest, baddest storage system on the planet.
As suggested, HUS VM carries the same microcode base as its bigger brother leading to a clear benefit: consistent operational behaviors in the microcode from the biggest VSP to the smallest HUS VM. This can result in assurances of operational consistency across a larger part of the overall portfolio than in past. So specifically it means that not only are CLIs, APIs and GUIs the same but also the core engines like Universal Volume Manager (UVM), Volume Migration, Hitachi Dynamic Provisioning (HDP), Hitachi Dynamic Tiering (HDT), ShadowImage, TrueCopy, etc. operate in the same way on VSP and on HUS VM. Having been an IT administrator in a past life I know personally how big a deal this is. In effect it is like running the same Linux OS on a Hitachi 2U rack mount server or a blade server, which means I don’t need to be retrained, I don’t need to worry about switching my mental model when I move from one system to the other, and there is less of a chance for operational mishaps.
The Rumble in the Cage at SNW
Well, we’re a day away from “The Rumble”. Forget about the Presidential debates, the Vice Presidential debate, and the Jon Stewart/Bill O’Reilly debate. “The Rumble” (Tuesday, October 16 at 1:55PM at SNW in San Jose) is for keeps. This time it’s for real. If you’re an SNW attendee, you might consider this event for some afternoon entertainment.
Moderating this group of ragtag fighters is Mark Peters (Enterprise Strategy Group). The fighters are George Teixeira (CEO, DataCore), Mark Davis (CEO, Vistro), Ron Riffe (Business Line Manager, IBM), and of course yours truly (Chief Scientist, HDS). The subject will be storage hypervisors and storage virtualization. Come join the fun. The winner of this debate should be the attendees. It’s a “winner take all” and there will be no rematch.
Is the Google Car the Future of Storage?
Well, I think it is. In school, I majored in Analogies (with a double minor in Metaphors and Euphemisms) so I think the comparison is very appropriate, and something I’ve spoken much about (those of you having to sit through my diatribes of late will get the connection).
For those of you that don’t know, the Google Car is a “driverless” car. Actually, it does require a driver but the driver is not required to do anything. In California now (as in Nevada and Hawaii), these cars have been promoted from “experimental” status to legal as of this bill signing by our governor Jerry Brown, this week.
I’ve run into them (not literally!) a few times on Bay Area freeways and I find them fascinating. As you can see in the picture, there is a cylinder on the roof that rotates at 10 RPM to sense road conditions and adjusts driving appropriately. It apparently also reads the posted speed limit road signs—something I am still learning to do.
So what does this have to do with storage? One word: Automation. I not only speak about the history of storage, but the future of storage as well. I talk about how our “storage computer” absorbs tasks that we mere humans have done in the past. I talk about how we’re beginning to automate performance with Hitachi Dynamic Provisioning and disk tier selection with Hitachi Dynamic Tiering. I also talk about how LUNs are being turned into simply “containers” for data and will lose all sense of physicality. Do we have a fully automated storage environment today? No; but we are getting closer and at some point (drumroll, please, here comes the obvious analogy) storage will be the equivalent of the Google Car.
Taking Converged Infrastructure to the Next Level
by Pete Gerr on Oct 9, 2012
Today Hitachi Data Systems (HDS) announced the new generation of its Unified Compute Platform (UCP) portfolio as well as a new software product, UCP Director.
UCP Director was developed from the ground up by HDS to provide unified management, orchestration and monitoring of the complete converged infrastructure within VMware vCenter itself—something we believe is unique in the industry. With these new UCP solutions, Hitachi achieves the lowest TCO per VM in the industry today without compromising reliability or flexibility.
One of the new solutions announced this week in particular, Hitachi UCP Pro for VMware vSphere showcases best-of-breed compute and storage technology from Hitachi along with integrated IP and FC networking from Brocade. This solution was developed for, and achieves very tight integration with vCenter environments and comes with seamless support and service from Hitachi for customers’ convenience.
As more customers move their most demanding applications onto a virtualized infrastructure, UCP Pro for VMware vSphere is designed to support mission-critical workloads and provides the highest reliability and availability in the industry today. We believe UCP Pro for VMware vSphere provides a cost per VM that is between 25% and 40% lower than the current industry average. And that’s without asking customers to give up on enterprise-class reliability, availability and world-class serviceability that Hitachi Data Systems is known for. Customers can also enjoy the flexibility to choose from server CPU and memory options to best suite their needs.
Along with this new UCP Pro solution comes UCP Director software. UCP Director provides simple and scalable monitoring of all elements of the UCP Pro for VMware vSphere solution under a single unified view which is seamlessly integrated within the vCenter graphical user interface. It’s simple to use, scalable and allows VM administrators to use the familiar vCenter tools with which they are most comfortable.
Furthermore, by enabling administrators to manage, provision, configure and monitor the entire converged infrastructure directly within vCenter and not requiring third-party or external tools, UCP Director saves customers time and money even as their virtualized environments scale. And as requirements change and grow, the solution can scale along with these needs in addition to being extensible through the inclusion of open APIs.
By allowing customers to focus on deploying new workloads quickly and not constantly recreating infrastructure designs to accommodate change, Hitachi Data Systems enables customers to accelerate their business and do so even while reducing costs and complexity. It’s a powerful combination and one that we believe is unique in the industry today. What do you think?
Realizing Big Cost Cuts Requires a New Generation of Converged Infrastructure
If you are thinking of deploying a converged infrastructure (CI) stack, you are among the fast-growing group of IT professionals in enterprises and cloud service providers. The promise of CI, – which integrates server, network, storage, element management and in many cases – hypervisor software, is too attractive to ignore. The increasing popularity of CI is driven by the desire of organizations to find new ways to cut their fastest growing cost component – OPEX. Everyone buys hardware and software to benefit from advances in technology and to help their business to stay competitive. So why not pass the cost of integrating, testing and configuring the ever-changing landscape of new systems and applications to the vendors? Surveys show that this can amount to a quarter of IT resources and time. Highly optimized CI solutions can also lower capital expenses due to higher utilization, less cabling and fewer network connections. Some vendors are only too happy to embrace the cost of pre-testing complete infrastructure so they can offer a complete package, versus selling individual elements.
Organizations that have tried first-generation CI, which focuses mostly on hardware integration and validation, have seen a reduction in deployment time and related costs. But they have also seen less than stellar benefits in terms of automating the end-to-end infrastructure management. Most of these original CI architectures utilize multiple element managers due to the practice of bundling systems from different vendors, with no real orchestration to simplify the high volume and wide range of day-to-day tasks. Organizations have not seen an appreciable reduction in a substantial part of the cost of operations. IT professionals say processes like configuring systems, deploying virtualization images, provisioning storage or network to virtual servers, monitoring and troubleshooting end-to-end systems, is still complex, error prone and can take days or even weeks to complete. The big disadvantage of not having true unified orchestration is that IT administrators continue to spend a majority of their time on manual, low-value tasks rather than on value-add activities that streamline the data center into a service model. A few first generation CIs with so-called unified infrastructure management offer only basic integration and/or provide a “link and launch” to multiple device managers from a single GUI. Others have offered new integrated management tools that require IT professionals to learn new tools with lengthy training and/or new processes, but provide them with limited ability to truly automate and optimize the orchestration and management of the whole infrastructure. This is not an option easily embraced.
CIOs are requiring that their organizations re-architect data centers to include a migration to private clouds, with virtualization at the core. The focal point of infrastructure orchestration is moving to serve these virtualized environments. VMware vCenter and Microsoft System Center are becoming critical to managing and orchestrating the virtual environment, but their visibility into the physical elements is largely limited to the servers that host hypervisor images. The rest of the infrastructure -including bare metal servers, switches and storage – are mostly invisible to hypervisor-based management. To overcome this shortcoming, many products provide software tools that integrate and enable VMware vCenter or Microsoft System Center to perform some management functions, for instance storage provisioning or snapshot management, but again, these are limited to individual infrastructure elements and do not orchestrate the entire infrastructure from one source. First generation CI solutions produced real cost benefits in the pre-deployment phase but continue to be limited to traditional element management tools bundled together. The division of labor for virtual machines, physical servers, client networking, storage and storage networking produces management overlap and no labor-related benefits without optimized converged system management.
IT professionals at both enterprise and cloud service providers are clear about what they wish for in a solution. To truly realize the full benefits of CI solutions, a new generation of CI needs unified orchestration of the physical layer from the same software that manages the virtual. To achieve seamless and true orchestration of end-to-end physical and virtual environments, tight integration across all elements, including virtualization management software, is essential. The industry needs innovations in CI that support seamless infrastructure, virtualization and management integration in order to enable dramatic simplification via automating manually intensive tasks and reducing associated staff resource time. These innovations must also cut over provisioning of storage and network resources, and most importantly, maintain commitments for mission-critical workload SLAs. With these advancements, a new generation of CI can become the foundation for the migration to private clouds, and enable bigger reduction in operating costs. Savings will come from reduced labor requirements due to a simplified and automated data center management, and eventually drive enterprises and cloud service providers toward consolidating storage, network and server management teams into CI teams.
When will this day come? I’d be curious to get your thoughts.
Laying the Foundation for our Future Vision of Data Protection
by Sean Moser on Sep 24, 2012
Data protection is straining under the weight of big data. Exploding data volumes, stringent restore requirements, and shrinking OPEX and CAPEX budgets all mean that traditional data protection solutions aren’t cutting it any more. IT teams are relying on disparate tools for operational recovery, disaster recovery, long term archive, performance, migration, hardware refresh, etc. And each of these use cases involves capturing a copy of the data, yet there is no integration, sharing or reuse. The result is multiple, redundant, and often poorly tracked copies, leading to increased management overhead and excess use of key resources.
Hitachi Data Systems wants to address these challenges head on, so I am excited to announce that we have acquired Cofio, a privately held company with a successful unified data protection solution. Its flagship product, AIMstor®, is an innovative and intuitive solution for protecting customer data globally.
The Cofio acquisition is part of a longer term data protection vision that goes beyond the existing “backup” paradigm. We envision a comprehensive solution that integrates backup, archive and storage platform-based replication technologies. To realize this vision, we will be integrating Cofio technology with the high performance of storage-based data protection to meet our customers’ current and future needs. Over time, we plan to offer a centralized data instance management solution that promises significant infrastructure, storage, and management cost and efficiency savings by consolidating, reducing and better managing and tracking data copies.
We are excited to welcome the Cofio employees to HDS and to work with them to continue to evolve our data protection portfolio. We also look forward to supporting existing Cofio customers through the integration.
I will continue to share updates and insights on how our innovative data protection vision is emerging in the coming months.
Scaling-out is a well-oiled, in vogue term in use today. There are, of course, other related terms around scaling like scaling-up, scaling-down or my recent favorite scaling-right. In different contexts these terms imply different things– for instance, whether or not we can use the term scale-up to mean that a product or technology is moving from one market segment to another. For example, “NetApp is still attempting to scale up their product to be enterprise class.” In another case, however, we can use it to talk about a product improving an attribute like, “Hitachi can add multiple Virtual Storage Directors to scale up performance.” I think that the term scaling-down is more interesting and the first time I heard about it was in 1999 in reference to the Linux kernel from Linus Torvalds.
Delivering last night’s keynote to a boisterous LinuxWorld crowd here, Torvalds prodded open-source programmers to shy away from the “sexy” task of scaling up the OS to compete with commercial Unix flavors. Instead, he said, programmers should actually focus on scaling down the operating system for user-friendly use on devices from desktop PCs to PDAs. (ZDNet, Linux takes aim at the desktop, 1999)
HDS is one of the few companies I know that intentionally scales down capability and function from enterprise class systems into midrange devices. Here are four examples:
- Mainframe MLPF/LPAR to Intel architecture LPARs on our Compute Blade platform
- Enterprise Storage Shadow Image LUN/Volume cloning to our midrange storage
- Mainframe class bus fabrics and I/O capabilities to intense I/O expandability, and bus based fabrics on our Compute Blade platforms
- Enterprise storage-based thin provisioning to our midrange storage
We take this path for many reasons such as ensuring core feature consistency and another is that we recognize when a capability or feature is mature enough for our enterprise customers it is sufficiently hardened for consumers of midrange or distributed systems. Obviously, this might make you wonder what’s up our sleeve?
I think the title of this blog might be better stated as: “What you can predict from Hitachi implementations today?” I’ll illustrate this point through example. A long time ago in a storage universe not so far away Hitachi sold both the enterprise class 9980V and the midrange 9500. Towards the end of the product lifecycle we introduced something called “cross system copy” that allowed a user to replicate data from the 9980V to the 9500 for disaster recover type purposes. A little while later, we debuted Universal Storage Platform (USP), which was one of the first products that embedded storage virtualization inside an intelligent controller — an approach that has ultimately proven to dominate in the market. In this example we were able to trial storage virtualization for a limited use case, disaster recovery, to observe a set of core behaviors. Learnings from these observations were then placed in a more general feature of storage virtualization around the year 2005.
As I touched on briefly in a recent Speaking in Tech podcast, if we look at the HDS enterprise storage portfolio today we have a lot of interesting IP –Hitachi High Availability Manager, non-disruptive migration, rock solid UVM, pervasive use of Intel microprocessors, value added firmware, etc. What might you imagine we’d scale down next?
Hitachi Data Systems Now Supports Windows Server 2012
Today, Microsoft announced its first Windows Server Cloud OS release with the introduction of Windows Server 2012, and HDS is excited to be part of the next generation of Microsoft solutions this release enables. HDS has a reputation in the IT Industry for leadership in mission-critical architectures with enterprise service and support, and we’re extending our expertise to bring the best possible solutions to the Windows Server platform.
We’ve collaborated with Microsoft since early 2010 and have architected our products to be optimized for Windows Server 2012 features. We also believe that Windows Server 2012 sets the foundation for Hitachi solutions built on Microsoft Private Cloud Fast Track architecture that combine Hitachi compute and storage with industry-standard network infrastructure and management with System Center 2012 SP1. This is compelling because Windows Server 2012 scales up to 64 nodes and can support thousands of virtual machines using Hitachi infrastructure-as-a-service (IaaS) and Platform-as-a-service (PaaS) offerings. Customers trust HDS infrastructure because it’s designed for the most demanding environments, and we’re ready to meet the challenge.
HDS servers and storage are Windows Logo Listed and Certified for Windows Server 2012 on the Windows Server Catalog. In fact, HDS has more storage products listed for Windows Server 2012 than any other company. To view our listings from the Windows Server Catalog website, click “Certified for Windows Server 2012” for a complete list of Hitachi-certified products.
Search on the following HDS products:
- Hitachi Unified Storage
- Hitachi Virtual Storage Platform
- Hitachi Universal Storage Platform
- Hitachi Compute Blade 500
- Hitachi Compute Blade 2000
For more information on Hitachi solutions for Microsoft, visit: www.hds.com/go/microsoft
Digital Archiving Part 4: The Data About Your Archive Data
One of the things that is frequently overlooked in archiving is the catalog of the data in the archive. In fact, I refer to a suite of data about data in an archive as metadata (not to overload an already overloaded term, but it can’t be helped). Metadata is the broad term that I use to label all of the pieces of information to describe what is in the archive, with the exception of the actual data itself. I use metadata to refer to the catalogs, POSIX metadata, metadata, custom metadata and search indexes. The way that “I” describe these aspects of metadata goes like this (from a file and object store perspective):
- Summary information about a file(s) that includes location, names, light keywords, references, application, type, etc.
- POSIX Metadata*
- Data around a file — the filename, file path or path name, creation, modification dates & times, size, permissions and ownership, type, etc.
- Data about the data within a file, typically from its header which could include pixel resolution (height x width), color palette, color depth, application, a magic number, bitrates, etc.
- Custom Metadata
- User or machine-supplemented data, or other desired information about a file or the data within a file (e.g. associated weather conditions, location data**, geo-coordinates, hashes, camera type, comments, keywords and tags, system level data, thumbnails, other references, etc.)
- Keywords or other data components organized in a scan-able or searchable arrangement used for fast lookup and a summary of content.
* Typically, this is the metadata with which most file data is associated
** Location data in this respect is system location such as a URL or pathname
There may be some differences in opinion on the way I’ve defined these terms, but I like them and they are sufficient for going forward in this blog. The point is, there can potentially be a lot of data about data, without including the actual data in the in the overall count.
We all know that, as data becomes less interesting for any number of reasons such as–age, usefulness, importance, etc.—the access activity of these files diminishes as well.
There are also several reasons for archiving data, such as data preservation (want to), compliance (have to), and/or to get the unused data off of the expensive primary storage systems and to locate it somewhere more appropriate (either want to or have to, and need to). There are several terms thrown around to describe data that is no longer in use, but is deemed too important to delete entirely. My favorite is “long-tail data.” See the RED activity line level in the chart below.
Long-tail data usually starts out active, especially at creation time, and depending on what the data is, can be active for a long or short period of time. This data then goes through a slightly active phase , less active than at the time of creation including several references, maybe some minor updates. Finally, toward the end of this data’s usefulness, it will enter the long-tail phase of its lifecycle rarely being referenced, opened or used.
My last blog, for example, is a perfect example of an accelerated lifecycle for a piece of data. When I created it, I had it open and didn’t close it until I was done, or had to reboot my system. That file was active. I knew exactly where it was. It was listed on the top of my “Recently Opened” list. I never had to search for it. When I finished writing it, I reviewed it again, made some mild updates and changes. Emailed it for review and for posting on the blog site. It was done.
That blog is now officially long-tail data. I would consider it a short-term file with an accelerated lifecycle that lasted a couple of days at the most. Many files of enterprise projects can be active for months or years. Versioning is used to save the state of data that can become inactive, backups of data can be inactive and so forth, but there is an interesting phenomenon about long-tail data.
Looking back at the chart, you’ll see an inverse relationship between the long-tail line and the BLUE line which shows data about this data. The BLUE line shows the activity of the data about data (of course this is not a true logical inverse relationship), the metadata. The less active data becomes, the more referenced the metadata becomes. If you think about how you use data, this should make sense. As I described in my blog file example, the file location is known and is listed on the last recently opened list. The file is active. However, the more inactive the file becomes, the more I need to search for it, either by browsing through the file system or through a full system search. The data about this data is always referenced either in a search for other data or for this data itself. The data about this data becomes a part of every query from now on.
Now, scale this scenario up and out to an enterprise level archive. Data is constantly being ingested. This reminds me of one of my favorite one-liners–“a true data archive only gets bigger”. Part of this ingestion could include cataloging, metadata extraction or full-blown indexing in order to make data in the archive findable at some level.
Then again, this assumes that this is a “seamless” system for lifecycle management where data flows automatically through these systems and that all storage tiers are part of the same namespace.
There are archiving systems and data repositories that tend to be more active than long-tail data archives and that are purposely built to ingest data from the get-go. These archiving systems tend to be partitioned off from the rest of the operational and active systems and may not be part of the same, overall “namespace” of which the active data is a part of. Actually, the flow in this case is reversed in that data is pulled from these repositories and staged into an operational state to be processed. Then again, searching for the right data to be “pulled” is also a big part of the system and process.
I’d love to hear from you, what’s your opinion on my use of these terms? How important is data about data in your environment?
From the show floor at SIGGRAPH — Tuesday, August 7, 2012
As Matt Jacobs, Video Effects (VFX) Supervisor of Tippett Studio, discussed how during their rendering of the movie The Immortals, Tippett’s artists nonetheless struggled with creating what he referred to as “The Ballet of Blood”. They used Pipeline FX Qube, Autodesk Maya and Houdini Fluid Simulation tool and built mesh simulations in-house. (And of course, the infrastructure was supported by Hitachi NAS storage from Hitachi Data Systems!)
Salaries in China and interest in western movies are both rising. Even Ron Stinson of Rainmaker, a visual effects shop in Vancouver BC, commented that with XingXing Animation’s recent acquisition of Rainmaker, movies can more easily move into China from government quotas (previously limited to 26 films per studio). With the Rainmaker purchase, films can now move from Rainmaker to China with almost limitless transparency to their new parent company.
Other observations from SIGGRAPH beyond the HDS booth…
Although you might think that Hollywood is American or maybe even think of media and entertainment as an “American” Olympic sport, the artists and studio post-production executives that are speaking in the HDS booth this week (and many who are attending SIGGRAPH) are keenly aware of how interconnected and international this industry truly is.
With tens of thousands of artists streaming into the exhibition hall, North American video imagery creators, must improve their own effectiveness and digital story telling or be replaced by craftsman who are residents of other countries. Creators from Sydney to Beijing, from Barcelona to Belgrade, and from Moscow to Morocco might have faster rendering platforms, better data migration software tools, and improved storage solutions that provide that extra edge. This is what SIGGRAPH is all about – learning as a community about new technologies on the horizon or those rapidly coming to a post-production house near you. More to come tomorrow on day two….
Follow HDS and Visual Effects Studio Customers at SIGGRAPH
by Jeff Greenwald on Aug 2, 2012
When I look at 2012 and reflect on the top 10 grossing movies worldwide, I also contrast their gross ticket receipts during their opening weekend, it is clear that a BIG weekend that has great weather, positive industry buzz, good critical acclaim, and an interesting plot contributes mightily to a profitable return on an increasingly bigger and bigger investment by the studios.
Here are the top 10 as of August 2, 2012:
Movie Studio Total gross/screens Opening/screens Open
|Marvel’s The Avengers||BV|
|The Hunger Games||LGF|
|The Dark Knight Rises||WB|
|The Amazing Spider-Man||Sony|
|Dr. Seuss’ The Lorax||Uni.|
|Madagascar 3: Europe’s Most Wanted||P/DW|
|Snow White and the Huntsman||Uni.|
What is clear to me when I look at this list however, is that all of these movies have employed a significant amount of VFX, animation, and post-production work to help intensify, enhance, and build the exact reality for the audience, as requested by the film’s director. Simply taking a camera, shooting a scene, editing out the retakes and slapping together files, has been relegated today to YouTube, Sundance Film festival, and maybe amateur Blair Witch projects.
Today, post production, lighting, color enhancing, 3D conversion, HD camera technologies, which includes animated characters (Ted, Lorax, etc), all contribute to the world that we all seek when we “escape” to the movies. There will still be a place for dramas and reality TV but Hollywood has honed the art of developing artificial reality that scares, entertains, and delights. HDS is proud to have studio customers who worked on 7 of these top 10 films.
This week (August 7-9), HDS will exhibit at SIGGRAPH at the Los Angeles Convention Center where 25,000 artists and studio heads will attend. HDS technologies enable our media and entertainment customers to be creative with compute and storage infrastructures that are highly scalable, available, reliable, and deliver faster file renders, transcoded videos, and films that are quickly broadcast over cable, the Internet, and traditional TV airways. Five HDS customer studios (Arc Productions, Lux VFX, newbreed Studio, Rainmaker, and Tippett Studio) will highlight their 2012 videos projects in HDS booth #622.
We hope you can join us, but if you can’t, I will blog about highlights throughout the show. Talk to you soon.
Walls Come Tumbling Down…
I just returned from an almost-2-week vacation, and it’s actually good to be back at work. I’m rested and ready to launch myself back into the regular routine. I’ll blog on some interesting vacation observations in a few days, but wanted to cover some recent customer activity that occurred before I left on my global trek.
I’ve often said I’ve got the greatest job at HDS (@HuYoshida says he does, but he still has to manage me, so I win that point). The best part is meeting with our customers and talking technology and challenges. The latest round of visits included some of our largest customers who have been with us for many years, and the topic was a software product we’ll be coming out with soon. We’re still open to suggestions and modifications and the customers provided honest and constructive feedback.
Not just with this product, but with the many I’ve worked on over the decades, it’s great to come up with an idea, refine it here in the labs, and then pass it by the customers to get their reaction. Their responses range from:
Perfect!! I love it!! I want it now!!
That’s the dumbest idea ever, but if you add this and that it will be awesome!!
That’s the beauty of these interactions. In the end, it makes for better products.
Many years ago (for different products and different employers of mine) that was not the case (at least not to this extent) and product development was a “push” activity. In other words, “Here’s our new product, how can I convince you to buy it.”
Customer activity these days is much more collaborative and the “walls” have come down (hence the title of this blog).
But I want to talk about another “wall” that greeted me on this trip, and that was in Berlin. I’ve met many “walls” in my life – Walmart, Walgreens, and of course the world-famous Wall Drug Store, in Wall, South Dakota (that’s a whole other story!!), but this was my first visit to Berlin. Spending the weekend there (between customers), I had to visit the site and the museums.
The first thing I noticed was my name graffiti’d on the wall (uhhh, I didn’t do it, really). But in an attempt to capture the “Kodak moment” with my arms and iPhone outstretched, I had an offer from a nearby Danish tour guide to take the picture. Turns out, his name was also Claus, and we had a brief chuckle about Claus taking a photo of Claus by the Claus graffiti.
But back to the real topic at hand – relationships with our customers have become amazingly collaborative, and that’s a classic win-win. The products we develop are better for all, and the goodwill generated is amazing. I can’t ever imagine a better business model for our industry.
And now that I’m back in the daily routine, it’s time to book that next fistful of flights and get back on the road. I can’t wait.
A Series on Hadoop Architecture
As my guest blogger, this is Matt’s second installment on his series on Hadoop and MapReduce. Specifically, he explores the viability of the Hadoop and MapReduce framework for Computational Science and Engineering. Traditionally, CSE enviro
nments have focused on the inter-processor and parallel communications for forming huge compute clusters. Hadoop and MapReduce are very different approaches than these types of complex systems, but can they handle the CSE types of workloads and eventually change the game? Read more to find out…
MapReduce’s Potential in Computational Science and Engineering (CSE) (part 2)
Hadoop and MapReduce — A Brief Introduction
As we mentioned in our previous post, Google’s pioneering MapReduce parallel programming framework, which was later cloned and extended by the open source community via Hadoop and its ecosystem (including HBase, Hive, Pig, and other software), today drives much of the web’s infrastructure. Search engine index creation, search operations, ad-targeting analytics, operations management and optimization, and social media graph processing are some of the core tools driving web operations today.
The simplicity and regularity of MapReduce’s semantics allow the underlying system software (including Hadoop Data Filing System or Google Filing System) to be optimized for the streaming data flow while providing automated mechanisms for recovering from server, application, storage or networking problems. The automated failure recovery made possible by the MapReduce framework brings two significant benefits: (1) application codes can rely on these automated recovery mechanisms, reducing application development complexity and avoiding the need to replicate these failure recovery mechanisms in a custom way in each application; (2) system operation and maintenance of a MapReduce cluster is simplified since many failure modes can be recovered from automatically.
Programming for Computational Science and Engineering
One area where MapReduce is rarely used today is applications in computational science and engineering (CSE). These applications typically model the underlying physics of the science being studied (e.g., ocean circulation, atmospheric weather, gas dynamics in star formation, etc.) or engineering device or process being developed. These codes have been developed and tuned over several years, sometimes decades, using traditional programming languages like Fortran, C, or C++ and libraries like MPI and OpenMP to support parallel operations. They are written as a series of fine-grain mathematical transformations on input data and intermediate data in memory, and it’s likely that the overhead of MapReduce’s framework is too high to support this kind of fine-grain parallelism. Completely rewriting these applications to use MapReduce would require that the prior software development investment be thrown away, a very expensive and probably impractical strategy.
Limits of Exploitable Parallelism in CSE
In addition, most CSE applications are run on a few hundred processors maximum (though the median number of processors per run is probably no more than 8 to 16 cores). This hasn’t changed much since the 1990s, when MPPs first appeared, although large supercomputers and MapReduce clusters today often have more than 10,000 cores. For applications that are parallelized using threading libraries like Apple’s Grand Central Dispatch or Linux’s pthreads, the exploitable parallelism in applications is most likely going to be on the order of 4 to 8.
Even in applications where a lot of parallelism exists and the codes are written in parallel form, there is often not enough data to justify running across more than a few dozen processors. On the other hand, a few dozen cores are becoming the commodity server sweet spot these days. Since the 1990s, we’ve gone from 1- to 2-processor machines to multi-core servers with 16-24 processors, with 32- and 64-processors on the horizon. Today it is possible to run parallel calculations on 16-way multi-core single box servers that were considered exotic parallel machines in the 1990′s (when your humble correspondent was very active in parallel application development). Given Amdahl’s
Law, which dictates that small amounts of load imbalance and inefficiency can greatly limit the speedup achievable via parallelism, it’s still difficult to run most parallel scientific applications efficiently on more than 16 to 32 cores.
MapReduce Cluster Nodes Have Optimal Parallelism for CSE Applications
On the other hand, typical MapReduce clusters have hundreds of large-memory, high-core-count (at least 4- or 8-cores today, moving towards 16- and 32-cores in the near future) multi-core servers with multiple terabytes of non-RAIDed local disk storage. Using MapReduce directly to parallelize typical scientific or engineering codes does not appear to be the best approach. As mentioned earlier, these codes require fine-grain data sharing and regular synchronization; MapReduce’s semantics involve localized computation on independent pieces of data, followed by a shuffle and reduce phase to aggregate results. It’s semantics are probably not general enough to support most codes built for computational science and engineering. The whole point of MapReduce is to simplify the programming model so that large amounts of data can be processed within a highly scalable compute and storage framework. So the question is: if we look at the problem differently, can we still exploit the MapReduce framework to get real computational science and engineering work done?
Why Not Use MapReduce to Run CSE Ensembles?
CSE software uses input data from external measuring devices or input models to drive its execution. It can be very useful to vary the input model or perturb the measured input data to create N input data sets, and then run N calculations on these varied/perturbed inputs and observe the differences between runs. These ensemble calculations are useful in many ways, including:
- Determining how sensitive the output results are to the inputs because systems with outputs that are highly sensitive to inputs are less predictable, which can be extremely useful information when analyzing the results
- Determining how sensitive the output results are to changes in one or a few input variables can help pinpoint the effects of specific inputs on the overall CSE software model
- Various input configurations can be modeled to determine which engineering design is the most efficient, high performance, or optimizes some other desirable feature relative to other input configurations
As a concrete example, ensembles are used in numerical weather forecasting to determine how predictable the atmospheric state is at a particular point in time. Unstable atmospheric conditions that are less predictable can be tagged as such when providing forecast results. The scalability and automation of MapReduce has the potential to accelerate the computation and analysis of ensembles. Each MapReduce node can be given one parallel calculation in the ensemble to perform during the map phase, while the reduce phase can be used to combine results to provide comparisons between ensemble members and well as aggregate statistics on the overall ensemble calculations. The resulting data output can also remain in-situ in the MapReduce cluster, available for further analysis, reprocessing, and long-term archiving. There is no need to provide an external shared storage file system (e.g., Lustre or a scale-out NFS server) as the data can stay resident within the MapReduce cluster. The goal is to “exploit the restricted programming model (of MapReduce) to parallelize the user program automatically and to provide transparent fault-tolerance” (this quote is from the original MapReduce paper by Dean and Ghemawat) for CSE ensemble programs. Using MapReduce for ensembles could provide:
- Simplified management of ensemble computing and results
- Higher performance and efficiency by leveraging the moderate parallelism in each MapReduce node, rather than trying to make a single calculation run efficiently across a huge number of processors
- Automated load balancing and simplified cluster management
- Fault-tolerance in commodity server clusters allowing the ensemble jobs to complete, even in the presence of system failures
Leveraging Hadoop MapReduce clusters for ensemble computing could exploit most of the existing CSE software development investments (that don’t use MapReduce programming) while making ensemble computing easier to program and execute, and allowing the ensemble outputs to be conveniently analyzed and stored in a scalable way. So, although Hadoop and MapReduce have not been used much for computational science and engineering, ensemble computing may be the killer application for their use in this domain.
Why are Hadoop and MapReduce Eating the World? (Part 1 of a Series on Hadoop Architecture)
As my guest blogger, Matt is starting to provide more insight about interesting subjects. Here, read part 1 of a multi-part series on the Hadoop and MapReduce architectures compared to traditional programming approaches. Why are Hadoop and MapReduce eating the world? What makes Hadoop so popular and how does Hadoop process volumes of data – dare I say it – BIG DATA – easily and cheaply? This is nice contrast to past approaches to solving this problem and the new modern techniques being employed today. Enjoy!
Hadoop and MapReduce – A Brief Introduction
Google’s pioneering MapReduce parallel programming framework, which was later cloned and extended by the open source community via Hadoop and its ecosystem (including HBase, Hive, Pig, and other software), today drives much of the web’s infrastructure. Search engine index creation and search operations, ad-targeting analytics, and social media graph processing are the core tools driving web operations today. Most of these infrastructure applications are based on Google’s GFS file system or the open source Hadoop file system, and related MapReduce programming techniques. The techniques are characterized by a series of transformations between groups of key-value pairs, an initial process of local transformations known as the map phase, followed by a sort-partition-and-combine process known as the reduce phase. More complex applications generally require multiple map-reduce phases for a complete calculation.
Why has MapReduce become so popular? There are really three parts to that story, but it’s also important to remember that MapReduce was not invented in a vacuum. In the early 2000s, Google’s engineering teams had a very specific problem that they had to solve: scaling operations by a factor of 100x (they thought: it later turned out that they needed to scale even further, by a factor of 10,000x, and more). Existing distributed and parallel file systems couldn’t come close, either in terms of storage capacity and performance (100s to 1000s of petabytes in a single system using commodity servers and disk storage), or in the ability to tolerate faulty software and hardware in extreme-scale environments with millions of components (and where 2% to 5% of the system may be inoperable at any given time).
Let’s look at three reasons MapReduce has become popular.
Part 1: Simplified Parallel Programming
Parallel programming is hard. Really hard. In fact, it’s so hard that the most productive way to program for parallelism is to never do it directly. Instead, find some way to allow the programmer to express the work they want to perform in such a way that (a) they can efficiently, succinctly, and easily express the computation they want to perform; and (b) the programming model is expressive enough to meet the requirements in (a), but restrictive enough so that the parallelism inherent in the calculation is not hidden or lost via the act of writing the program, and automated software tools can be used to execute the program efficiently. MapReduce provides both (a) and (b), while scaling from small to extremely large datasets within the same programming and system framework.
Part 2: The File System is the Computer (and the Database Too!)
Historically, large-scale computer systems have separated memory and compute resources from disk (and tape) storage. System (Infiniband) or storage (Fibre Channel) networks are then used to connect the two. The problem is that these networks are expensive and generally lack web- or supercomputer-scale connectivity. Supercomputers still use this compute-and-memory-separated-from-storage paradigm, but require custom networks to make it work. However, the more serious problem with this approach (beyond cost and scalability limitations) is that it does not exploit the benefits of locality when computations are co-located with the file data they need from the storage system.
MapReduce is designed to co-locate computations with data, and in fact, goes beyond this most excellent idea by leveraging the system-wide replication already required for data resilience to flexibly co-locate data with computations. Furthermore, it can exploit the co-location to aid load-balancing (e.g., it avoids co-locating computation on an already busy server and has alternatives to do so because data is replicated to other, potentially less busy, servers). With MapReduce, the file system is essentially embedded into the parallel computer, with significant performance benefits (both from load-balancing and from exploiting locality between computations and data).
Part 3: Simplified Operations via Fault-Tolerance in Extreme-Scale Environments
Extreme-scale web infrastructure, with millions of components and 100s or 1000s of petabytes of storage, absolutely require system designs that transparently and automatically tolerate and recover from failures of all kinds, including data, storage, networking, server and software faults. This fault-tolerance requirement for scalability allows for two additional benefits:
(a) Building on the simple MapReduce programming model for parallelism, the
framework also tolerates and transparently recovers from a variety of system faults and load imbalance conditions, with no additional effort from the programmer. The fault-tolerance is built right into the framework. This contrasts with other popular parallel programming models such as MPI or OpenMP, or pre-MapReduce programming frameworks at Google, which require that the programmer write custom, application-specific code to handle faults. This adds significantly to program complexity, and this complexity reduces performance and programmer productivity. MapReduce allows the programmer to avoid this complexity entirely.
(b) What’s the difference between a server failure and a system administator simply shutting down a server? From the MapReduce framework’s perspective, absolutely nothing! Hence, routine system maintenance does not generally require downtime: instead, administrators can easily replace servers, disks and network switches while maintaining high system performance and throughput. Failed components (e.g., disk, servers, file data, etc.) can be replaced when convenient, without disrupting operations, while the system automatically re-replicates data copies from other servers to maintain data redundancy at prescribed (generally N=3) levels.
MapReduce intregrates a widely applicable programming model that is implicitly parallel and scalable, tolerates system faults and load imbalance, and integrates file storage directly into the computer. The difference between MapReduce and existing parallel programming and data storage models reminds me of the difference between aircraft carriers and battleships. During the first part of the 20th century, the battleship reined supreme and naval battles were primarily between fleets of battleships, which also provided limited coastal bombardment and amphibious operation support. Aircraft carriers were originally used strictly for reconnaissance, but faster ship speeds, more efficient operational practices, more powerful aircraft and better tactics led to their use for both reconnaissance and attack for operations over 100s or 1000s of miles, instead of the 10s of miles to which battleship fleets were restricted. By integrating long-range operations (think big data), reconnaissance and attack (computation and storage), aircraft carriers revolutionized naval warfare to a similar degree that MapReduce is now revolutionizing how computing systems process huge data sets.
Part 2 of Hitachi NAS SiliconFS Object-based File System
As my first guest blogger, Matthew O’Keefe, PhD provides the conclusion to his 2 part blog on accelerating NAS with hardware by describing the hardware implementation of hardware pipelining.
Accelerating NAS via Hardware
By Matthew O’Keefe, PhD
The Key to Efficient CIFS and NFS Performance: Pipelining
In my previous blog post, I pointed out that network protocols like NFS and CIFS could, in theory, exploit pipelining to significantly improve NAS system performance. Pipelining allows many operations to proceed in parallel across multiple, independent memory banks and FPGA chips, greatly increasing performance, stability under heavy load and power efficiency. In this post, I’ll talk about how the Hitachi NAS server implements pipelining.
How HNAS Implements Pipelining: SiliconFS Object-based File System
HNAS implements pipelined network file operations via its Silicon File System FPGA-based server architecture. FPGAs are a form of non-custom ASIC (application-specific integrated circuit) that contains programmable logic components called “logic blocks”, and a hierarchy of reconfigurable interconnects that allow the blocks to be “wired together”—somewhat like many (changeable) logic gates that can be inter-wired in (many) different configurations. Logic blocks can be configured to perform complex, combinational functions or merely simple logic functions like AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory.
FPGAs can be used to implement any logical function that a custom ASIC could perform. However, FPGAs cannot achieve quite the same density or performance possible with a custom ASIC, but the ability to update the functionality after shipping, partial re-configuration of a portion of the design and the low non-recurring engineering costs relative to an ASIC design (notwithstanding the generally higher unit cost), offer advantages for many applications, including, as we will see, network file servers.
Mercury uses separate groups of FPGAs to implement network, file system and storage operations, and in particular, the data flow of these operations, as shown in the following Mercury hardware layout diagram.
HNAS architecture includes four FPGA chips: the network interface, disk interface, hardware file system (WFS) and data movement (TFL) FPGAs, all connected to the data movement chip via LVDS (low voltage differential signaling) connections.
A key advantage of this design is the point-to-point relationship between the FPGAs along the pipelines. While traditional computers are filled with shared buses requiring arbitration between processes, this pipeline architecture allows data to transfer between logical blocks in a point-to-point fashion, ensuring no conflicts or bottlenecks. For example, data being processed and transferred from a network process to a file system process is completely independent of all other data transfers, so it would have no impact on data moving to the storage interface. This is vastly different from conventional file servers where all I/O must navigate through shared buses and memory, which can cause significant performance reductions and fluctuations.
The following file system operations are performed directly in HNAS hardware:
- Create/delete (files and directories)
- Read/write (user data and attributes)
- Directory operations (lookup and readdir)
- FS consistency and stable storage
- Metadata caching
- Free space allocation
Like CPU pipelines, exception conditions (e.g., error handling) and complex (but infrequently used) operations like quotas, file system check, NVRAM replay and management (format, mount, shutdown) are executed outside the pipeline hardware in software. Software (via today’s fastest multi-core CPUs) associated with pipeline operations can also be used to extend the functionality of the fast data pipeline into other areas, including encryption and storage tiering. Hitachi NAS SiliconFS Object-based file system allows customers to exploit custom hardware acceleration for the datapath while riding the price-performance improvements in multi-core CPUs to execute complex, value-add features in software.
HNAS Hardware Architecture: The Results
HNAS pipelined server architecture, implemented via FPGAs, yields outstanding performance (and price/performance metrics) in several dimensions:
- Highest aggregate and most predictable performance — HDS effectively designs directly for the performance it wishes to achieve in the product
- Highest aggregate IOPS of any platform (95,757 SPECsfs2008 NFS throughput with ORT of 1.73 ms, the highest single node throughput of any NAS platform)
- Cheapest IOPS per dollar for any NAS platform
- Largest number of files/directory/overall storage due to high-throughput SiliconFS Object-based file system metadata engines
- Gracefully handles heavy load due to lack of underlying OS bottlenecks (virtual memory, scheduling, thrashing, etc.)
- Highest SPECsfs numbers in the industry (189,994 Spec2008 throughput via a 2-node cluster)
- Lowest power per IOPS
HNAS hardware pipelining strengths allow it to directly address one of the biggest weaknesses found in NAS technology deployments: NAS server proliferation. By using hardware very efficiently, a single NAS server can potentially replace dozens of slower software-based NAS servers.
By integrating a hardware datapath and multi-core CPU system, HNAS delivers the highest performance, most efficient and scalable file server in the market today.
Honestly, We’re Proud. All Over Again.
That’s what the posters say on many walls across the company right now. Why, you ask? We’ve had quite a nice run of delivering industry leading financial results the past three years here at HDS, and just concluded another stellar fiscal year at the end of March. But before I get into that, we’re proud for a number of equally meaningful reasons as well – in the past four months alone, we have been honored by being ranked in FORTUNE’s “100 Best Companies To Work For,” in Chief Executive’s“40 Best Companies for Leaders,” as well as being named to the Ethisphere Institute’s 2012 World’s Most Ethical Companies list for the second consecutive year. All while delivering 15% year-over-year growth in the fiscal year FY11 just concluded in March, and achieving record consolidated revenues of $4.4 Billion.
How could we achieve such results in today’s hyper-competitive and uneven economic environment? Simple. Attract great people to execute a vision for information technology that is unmatched in the industry and solves real customer problems every day. Our storage virtualization is the foundation for our solutions and is resonating now more than any other time in recent history due to the economic savings associated with reclaiming existing (heterogeneous) capacity and applying automated management capabilities to a single dynamic pool of storage. This piece of our business grew more than 30% Y/Y in the quarter just ended but that doesn’t begin to tell the entire story. The intelligence that can be layered on top to leverage this foundation as a single platform for all data is where the power of the solution can really be achieved. Whether it be blocks, files or objects that are stored and managed, the level of integration (as evidenced by the recently announced Hitachi Unified Storage) is where the value proposition gets really interesting. Our file and content portfolio (consisting of Hitachi Content Platform, Hitachi NAS and Hitachi Data Ingestor) grew more than 50% Y/Y in the just completed quarter and achieved record revenues as well. The use cases range from cloud enablement and big data to simply content archiving, and the demand keeps growing.
The services wrapped around all of this is critical to customer success and we are working with the largest companies on the planet to transform their businesses, growing this portion of our business close to 20% last quarter. Some people may be surprised to hear that software and services now account for nearly half of our overall revenue. In addition, we were recently highlighted by Gartner as the top performing vendor in terms of revenue growth in the firm’s Storage Management Software Report. We achieved a remarkable 28.7% Y/Y growth in storage management software revenue for calendar year 2011, which was more than any other vendor. Everything is geared towards our singular design goal of making our customers’ information matter.
So let’s recap the facts – a great place to work, recognized for developing future business leaders, operating with the highest of ethics and growing faster than almost every company in our chosen markets. Sometimes you can’t help but feel proud.
That Awkward Moment…
When you realize that you need to announce a product that your biggest competitor has had for almost 8 years. That awkward moment when you realize you have been bashing controller based virtualization for the duration of that 8 years, and now you’re in the position to have to endorse it. Ouch! That’s gotta hurt and I can only imagine what the EMC marketing team had to go through to prepare for this announcement. That’s not a team I’d want to have been a part of. “Gee guys, we’re announcing something that HDS has had – and we’ve been bashing – for 8 years. Any suggestions on how we couch it?” Eight years in this biz is a lifetime. It’s like American Motors being resurrected and developing a Rambler hybrid. “Yeah, we’ll take on those Prius guys and dominate market share.”
Really, do you hope that customers have bad memories and will forget your well-orchestrated messaging over these past 8 years? Will your customers forgive you and just think you messed up on this one subject, FOR 8 YEARS? Or do you long for the good old days of InVista and that over-hyped announcement from New Orleans? InVista? How did that one work out for you guys?
OK, Claus. Calm down, here. What is this blog all about?
Well, EMC just announced controller based storage virtualization. After all that EMC has said on the subject, they’ve seen the light. And I’m happy about that since it is a ringing endorsement of the strategy we embarked on many years ago. Welcome to the club, but…
There are two old aspects in politics. The first is that if you say something loud enough and often enough, people will believe you. It works, unfortunately. The second is the well-honed art of mudslinging. It also works, unfortunately. Both are occurring now from EMC, but don’t expect HDS to be silent on this. Actually, we’re quite pleased they’re agreeing with us – we can only imagine it will help our bottom line.
Combing through the information coming out of EMC World, we’ve come across a few statements that need “adjusting”. I hope you don’t mind if I come across on the blunt side. Then again, I’ve waited 8 years for this blog….
I have already given some of my feedback to Dave Raffo on the subject, but wanted to expound a bit on some of the statements that I understand EMC made:
“We’ve extended Symmetrix’s data integrity to non-EMC devices. Hitachi does not do that.”
Seriously, what kind of statement is this? Of course we “extend data integrity” (whatever that means) to external storage and actually improve reliability, integrity, and availability of external storage. Corral your marketing guys and give them a little technical training.
“Our technology is free of charge. You can virtualize any amount of non-EMC storage behind Symmetrix. Hitachi gives you a certain amount of terabytes for free, and then they charge when you go beyond that.”
Well, EMC states that software license enablement of FTS (Federated Tiered Storage) is a no-charge feature to customers, but they fail to mention the future impact on software maintenance costs for the FTS license and any other EMC software license that charges maintenance based on installed capacity. With the Switch It On program from HDS, virtualization *IS* free and third party capacity is deeply discounted. Hmmmm…again, not true. But, good try guys….
“Hitachi will also tell you not to use virtualization for databases, we don’t say that.”
Well, as an ex-DBA, let me give you my take – and yes I’ve been guilty of saying this. But if EMC is not saying this, they should. Are you nuts? Database applications are all about performance. The second most important aspect of databases is performance. Do you want me to name the 3rd, 4th, and 5th most important thing about database applications? So, EMC, if you imply that your customers should safely keep their Oracle (or whatever) databases on 3rd party virtualized storage, are you saying that virtualized storage outperforms your internal storage? C’mon guys, stop with the ridiculousness…
“We also extend FAST to other arrays. They [Hitachi] don’t extend auto tiering.”
You must not be aware of our website. It’s pretty simple. It’s www.hds.com and if you took the time to check it out, you’d find we actually do support this. Again, what you’re saying is not true.
So this announcement from EMC on controller-based storage virtualization is rather amusing to me. Look for more to follow from me and my colleagues on what you need to know and how to separate fact from fiction
And one final note: does FTS support z/OS? I think we know the answer to that question, and yes, of course we do support z/OS.
Hitachi NAS SiliconFS Object-based File System
Today, I get to introduce my first guest blogger, Matthew O’Keefe, PhD. My colleague Matt will discuss hardware accelerating the various components of NAS systems, specifically Hitachi NAS (HNAS, aka BlueArc Mercury), in a multi-part series. Matt’s expertise is in scale-out file systems and kernel development, so you might want to read this thoroughly. Since I am a performance nut (as well as having a passion for efficiency) this post seems appropriate to provide you with some insight on the thinking and architecture within the design of HNAS.
Accelerating NAS via Hardware
The Foundation of NAS Technology: NFS and CIFS protocols
Network-attached storage (NAS) first became widespread in the mid-1980s with the advent of LANs and workstations, and later, PCs. Client machines on the network shared data via a file server using two protocols: Microsoft adapted and renamed IBM’s Server Message Block (SMB) into the Common Internet File System (CIFS) by adding more features in order to evolve from NETBIOS/NETBEUI for Windows-based systems, while Sun popularized NFS in the UNIX world. The basic idea for these protocols was to implement file operations (e.g., create/open/read/write/close/truncate/delete/mkdir/rmdir/link/unlink/stat/fsync) over the network via client requests to a server. Specialized network file server appliances became popular for overcoming protocol performance hurdles (such as synchronous writes, which could be accelerated via NVRAM) versus roll-your-own-file-servers, simplifying storage hardware deployment and volume management, exploiting operating systems tuned specifically for file serving, and simplifying system management.
Implementing each NFS and CIFS operation generally involves a series of three sub-operations—network, file and storage—to determine what operation a client is requesting, transferring the necessary data, then sending any data and return codes from the file back across the network from server to client. Each sub-operation stage can be broken down further into micro-operations (such as translating a file byte address to the appropriate block address, performing a lookup of a file name in a directory, etc.). Traditionally, these micro-operations have been performed in software sharing a single memory space, using traditional operating system support for network and file system operations.
The Basics of HNAS Pipelined FPGA Architecture
Pipelining is a classic technique to speed up processes consisting of a series of operations, including assembly lines (which are inherently pipelined) and computer central processing units (CPUs), which have been pipelined since the 1960s. Instead of processing one operation completely, then starting and completing the next operation sequentially thereafter, and so on, pipelined operations are broken into n sub-operations; each operation is completed by going through the n sub-operations implemented by the pipeline. Several good things result from pipelining operations:
- At any point in time, n operations are being performed in parallel;
- After the first operation gets through the pipeline, each following operation completes at the pipeline rate, which for efficient pipelines is n times the rate of doing each operation sequentially;
- Each pipeline stage can use its own local memory, providing n times the bandwidth of a single main memory and removing memory conflicts between requests in different pipeline stages.
For example, an NFS read request could be implemented roughly as a sequence of 4 steps:
- The request is encapsulated as a network packet and sent from the client to the server;
- The server interprets the read request and determines the blocks associated with the file offset and length requested;
- The server requests these blocks from the storage devices;
- And returns the file data obtained from the blocks and return code indicating the operation was completed successfully.
If a series of 100 such read requests is sent to a non-pipelined NFS server implemented in software, each request is completely executed before the next request is started, so the total execution time is (4)*(time per step)*(100) or (400)*(time per step). In a pipelined NFS server with 4 independent stages, 4 operations are occurring simultaneously and once the pipeline fills, a read request is fulfilled every (time per step). Hence, the amount of time to complete the 100 read requests is 100*(time per step), or ¼ the time required by the non-pipelined server.
HNAS SiliconFS technology implements precisely this kind of pipelining, but with much deeper pipelines, more parallelism, and multiple memory modules to remove bottlenecks. In fact, the NFS and CIFS protocols are so amenable to pipelining that the pipeline depth and the resulting clock rate can be increased as necessary to achieve the targeted performance. Moreover, by avoiding resource conflicts over memory ports and other hardware, performance scales predictably across different loads, and can sustain itself consistently even under very heavy, difficult (e.g., random small file write) workloads.
The Performance Potential of Pipelined NAS Hardware versus Traditional Software Designs
In modern multi-core processors, CPU pipelines are limited to about 6 to 7 stages and require complex interlock logic to delay certain operations from moving forward in the pipeline until earlier operations in the pipeline complete. Due to the sequential nature of most software, this interlock logic is activated quite often, creating stalls in the pipeline and reducing the pipeline rate increase to less than the number of stages n. This effect is so deleterious that in the mid-2000s, processor vendors like Intel and Sun reduced the pipeline depths of their processors from 14 to 6 stages (Sun UltraSparc) and from 31 to 14 stages, and reduced clock rates by over 50%.
In contrast, NFS and CIFS server operations require little interlock logic and generally execute completely independently of each other. This means that pipeline depth can be increased as necessary to match the network and storage hardware speeds available at the time, so that pipelines for implementing NFS and CIFS can be designed for very specific performance targets and can be guaranteed to reach them.
However, if these file server operations are implemented sequentially in software, then the pipeline speedup potential inherent in the NFS and CIFS protocols is lost. Today’s multi-core processors have significant contention for the limited bandwidth between processors and to off-chip memory. Networking, file system and storage operations implemented in parallel across multiple cores contend for the bandwidth into the single off-chip, main memory, creating bottlenecks, contention and erratic performance, especially under heavy load. What’s needed is a pipelined implementation of the network, file system and block storage operations with separate, parallel memories per pipeline stage.
In my next blog post, we’ll tell you exactly how HNAS implements this kind of pipelining, and the amazing performance it can achieve.
HDS at SAP SAPPHIRE NOW and ASUG Annual Conference; it’s going to be a gem!
Hitachi Data Systems is excited to sponsoring the SAP SAPPHIRE NOW and ASUG 2012. The HDS team is off to sunny Orlando, Florida. The conference is being held this week from May 14th-16th at the Orange County Convention Center. HDS is an Onyx level sponsor, so we will be actively involved throughout the conference.
HDS hopes you will attend our many speaking sessions. They include a joint SAP-HDS Big Data Session Wednesday, 16 May / 12:00 – 12:40 pm in the Partner Theater. We will also be co-presenting with SAP/HP and Cisco: Wednesday, May 16 / 3:00 – 3:45 PM at the D&T/ HANA Campus. This session will cover an overview of Hardware Appliances Options Optimized for SAP HANA Microforum. Hitachi Consulting will also be speaking on Tuesday, May 15 / 11:00 – 11:45am in the partner center. The topic will be SAP HANA-enabled Next-generation Market Responsiveness for CPG Companies. This exciting subject should generate a lot of discussions.
Please come visit us at the HDS Booth #1063 and in the SAP Test Drive area. There will be lots of action, giveaways, frequent mini stage presentations and active informative discussions!
Check our QR code microsite http://qr.hds.com/sapphire.
Digital Archiving Part 3: How would you define “Technology Longevity”?
I can’t help but enjoy reading some of the comments, debates and other various blog articles on optical storage technologies, especially on technology durability and media longevity, even more when compared to tape storage for long-term digital archiving or as a slightly active data repository. Back in November 2011, I wrote a blog on the capacity density of the new BDXL Blu-Ray format being denser than LTO5 tapes. But now I want to explore longevity. How long should your data last on a given technology? And I don’t just mean the data on the media – I mean is the media supported today without your datacenter doubling as a technology museum? I’m sure we all have floppy disks, tapes or removable disks that still have data on them, but you’ll never know if that data is any good because, well, there’s nothing left to read them with.
I would like to hear your opinions as to the information in the chart below (my timing might be a little off, but you should be able to get the gist of this post). Fortunately or unfortunately, I’ve been in the technology industry for a long time. I have participated in and implemented optical technology both professionally, and as a “prosumer” (prosumer in this case is a very knowledgeable consumer, or professional consumer). Actually, most of you have as well whether you know it or not.
Let me explain what I am trying to describe in this chart at a high level. The first commercially available Compact Discs (CD) hit the market in the early 1980s. There were eventually two formats released which allowed higher capacity going from 650MB per disc to 700MB per disc. The first commercial use was for music distribution and competed with tapes (cassettes and 8-tracks) and vinyl records. Tapes and records were analog recording technologies while CDs were digitally recorded. The important takeaway here is, digitally recorded music means music was stored as digital data and converted to analog sound. Soon thereafter, CDs were used as a vehicle for other content distribution like audio books, video, documents and multimedia documents, games and software. Today, in 2012 (about 30 years later), CDs are still in use for content distribution, consumer storage, software games and data distribution, etc. During this time, the prices dropped dramatically for both the blank media and the drive devices for reading and writing, yet the quality and features continued to increase. I like to use my example of a used CD I bought at a swap meet for a $1.00 that didn’t work 12 years ago on the drives back then (I knew it wouldn’t work due to the scratches, but “Dark Side of the Moon” for a buck??), but works today on a modern drive. I’m not saying this always works, but in this case, it did.
Around the early to mid 1990s (roughly 12 years later), the follow up to CDs was announced, Digital Versatile Discs, DVD. This time, multiple layers for recording data was incorporated with 4.7GB and 8.5GB capacities. Primary market focus was video distribution competing with VHS videotape and possibly LaserDisc (though no real threat on that front). The follow on markets for DVD included, game distribution, software distribution, music, consumer and enterprise recording for content distribution, backup and archiving, and other multimedia documents. Today, DVD is still the main media distribution vehicle for software, games, consumers, etc. We are currently in the “overlap” era between Blu-Ray (BD) technology and DVD. The overlap era between movies on VHS tape and movies on DVD lasted about 2 to 3 years, and my guess is the overlap era between DVD and BD will be much longer. The cost of DVD blank media and drive devices for reading and writing, have also dramatically dropped like CD technology. One important note here, every DVD drive device today, reader, writers, rewriters, supports both DVD and CD media.
Today, Blu-ray Disc (BD) is the new media introduced around 2006/2007 and became the supported standard around 2007. Blu-ray is still in the overlap era with DVD for movie distribution and as the other content distribution vehicle, but also in the mix are online streaming services, application downloads and “cloud”. In this regard, I’m only going to discuss the BD technology. BD capacities when announced were 25GB and 50GB (dual layer) and famously competed with Hi-Definition DVD (HD-DVD). This consumer-based, public battle was not the replacement of a main staple media like CD’s taking over audio tapes and vinyl records for music, or DVD replacing VHS video, but more like the VHS versus Betamax battle for the video tape standard back in the 1980s. The new BDXL Blu-ray format now supports 100GB and 128GB capacities using both additional recording layers (3 and 4 layers) and a new recording format to increase the per layer density. Projections are that this many layer approach for the Blu-ray technology will continue for some time to increase the capacity density of the media and drive down the bit cost per disc.
Ok, now that the history and background is laid out the way I want it for this article, let’s talk about longevity. Today, if you Google “buy a Blu-ray drive”, you will see internal drive devices for around $60 with a modern SATA interfaces. If you read closely, these drive devices are capable of reading and writing media for CD, DVD, and BD (BDXL support is still new and a little more expensive for now). For a few dollars more, you can even have the drive device write custom labels for you directly on the disc. Think about that, a brand new, mass-produced, commodity modern device manufactured to support a 30-year-old media technology for around $60. The device itself is greatly different from the original CD device from 30 years ago with newer technology, enhanced features, new interface(s), faster, smaller and denser packaging, etc. The oldest CD in my collection that I burned is from March 1995, and this disc still works today in my new MacBookPro (of course, this CD isn’t stored in the bottom of a drawer somewhere), and my first music CD from the 1980s still plays today in these new devices.
In my chart, I tried to illustrate the industry standard interfaces of the time of the introduction of the technology such as SCSI and IDE, and so forth (avoiding the exotic stuff). This shows the evolution of the technology world advancing forward with faster and more economical interfaces, improved manufacturing efficiencies, and so forth, but here’s my take: with the many markets that use these technologies, especially the consumer and high volume markets, the support for legacy media isn’t an option. My guess is, it could be harder to rewrite the firmware to drop support for these media formats than it would be to just keep including it going forward. Bottom-line is 3 generations of media supported with read/write capability today spanning about 30 years with improvements and added features, all with reduced pricing.
Now let discuss tape, specifically Linear Tape-Open (LTO), and more specifically tape for long-term data preservation. From the Wikipedia entry for LTO, a modern LTO tape drive, for example LTO5,
- Can read the current generation tape cartridge (n), LTO5, and the two prior generation tape cartridges (n-2), that is, LTO4 and LTO3 written by those tape’s associated tape drives in their native capacities and format
- Can write a current generation tape cartridge (n), LTO5 drive writes to an LTO5 tape cartridge, and the prior generation tape cartridge (n-1), LTO5 drive writes to an LTO4 tape cartridge at its native capacity and format
- This would apply to all tape generations. For example, an LTO4 tape drive can read LTO4, LTO3 and LTO2 tape cartridges, and can write an LTO4 and LTO3 tape cartridge
Also, according to the LTO generation table below from the same Wikipedia page, LTO dates back to 2000 with new generations of LTO standards introduced on average about every 30 months (2 ½ years).
The next generation LTO release, LTO6, is rumored to be out by the end of this year, 2012. This means that read support for LTO3, released in 2005, and write support for LTO4, released in 2007, will be dropped at that time, or at least for the LTO6 tape drives. This is the 6th generation of this technology in 12 years, so this standard moves very fast, maybe too fast. For long-term data preservation requirements, the idea of technology and media being obsolete so quickly drives Operational Costs (OPEX) higher over the lifetime of the data due to technology and media migration costs, which in many cases is, forever.
Granted, the standard optical storage track (notice at the bottom of my chart the carcasses of defeated technologies throughout this optical storage history) is currently only in its 3rd generation, so there’s no telling what will be supported in the next generation. There are some holographic technologies that promise backwards compatibility, but maybe not for all generations and with a loss of significant advances like capacity, and there are some holographic technologies that might be disruptive to the compatibility track. Then there’s the notion that the market could split for consumer and distribution based holographic technology and support, and a holographic archiving and enterprise class. The trick will be to maintain the price curves using common manufacturing and parts.
Currently, Blu-ray has its pros and cons. Technically, the BDXL format is denser than the current LTO5 tape cartridge when measured in gigabytes per cubic inch, data stored is randomly accessed which is great for non-streaming use cases, has a highly rated durability factor (to be discussed at another blog), and has media longevity options that will last 30 years with standard quality media, and 50 to 100 years for special certified quality media. The technology evolution is slow, but this is not necessarily a negative situation when long-term data preservation requirements need a stable, enduring technology. Stable technology here does not mean static. The technology advances and improves, which applies to the older media as well, but the requirement to support the media going forward has several market segments driving the support requirements for stability while also driving down cost. Stated another way, you cannot upset billions of users and consumers in a highly competitive market, and stay in business long.
Tape is faster for streaming uses cases like backup restores, but is not suited for randomly accessing its stored data. In fact, randomly seeking within a tape shortens its lifespan, as does every tape load operation. It will be interesting to see what the longer-term effect will be with the support for LTFS (Linear Tape File System) for archiving and data repository systems that have a slightly active requirement to them. However, with the short and accelerated compatibility matrix, this may not be an issue, as the underlying technology will require higher tape migration frequencies in order to remain supported and compatible.
What are your thoughts and experiences on this subject? Do you agree with my chart? What’s the oldest CD or DVD you’ve burned that still works? Do you have any predictions going forward?
HCP Announcement with an Archiving Angle
Last week’s announcement dovetails perfectly with my current series on data archiving, although the Hitachi Content Platform (HCP) team may not appreciate my angle based on all their accomplishments with HCP product that go above and beyond just archiving. Be that as it may, I can’t ignore an important feature of the new release of HCP, for me anyway. I have a passion for efficiency, especially when it comes to power and environmental efficiencies. Overall, the new release of the Hitachi Content Platform has increased its dominant foothold in the object store and cloud arena, with a richer set of enhancements and features designed to provide a world-class platform for managing the massive scale-out requirements of today’s explosive data growth.
A highlighted list of these new and enhanced features include:
Improved Operational Efficiency to Lower Costs
- Lower costs for large scale unstructured data storage and reduce overall energy consumption with HCP support of spin-down disk in Hitachi Unified Storage (HUS)
- Eliminate downtime with nondisruptive, online hardware & software upgrades
- Proactively address any bottlenecks or hardware issues before they impact SLAs with improved component and performance monitoring and email alerting
Greater Scalability and Reliability
- Improve economies of scale, reduce costs and maximize utilization with support for thousands of tenants and tens of thousands of namespaces per system
- Ensure service availability with advanced replication and failover capabilities
- Meet vaulting requirements with support for tape-based copies of objects
Robust Security and Control from Edge and Core
- Reduce risk and control access to content with new object access control lists
- Support corporate security policies with active directory integration
- Identify sets of related objects for information, action and automation with custom metadata search across system and custom metadata
HCP is already one of the densest data storage platforms in the industry scaling from a few terabytes up to tens of petabytes in a single system, easily and seamlessly. In fact, it’s this scalability and multi-tenant support that has transformed HCP into the premiere object storage platform for the massive scale-out of unstructured data management in the industry. HCP embodies our content cloud approach allowing organizations to store and manage billions of data objects while providing intelligence layers and policies to help index and search the data independently of the application that created it, expand and scale to match or exceed to the unstructured data growth rate organizations are experiencing, and protect data in the most cost efficient manner and of your choosing.
While I’m flaunting and listing the individual features and enhancements of the new release of HCP as isolated capabilities, the overall total result is an unstructured data storage platform that is flexible and agile, that scales to meet any demand, and which provides upper layer capabilities embedded directly in the platform. This advanced combination of feature sets and reliability, without the traditional management complexity, is unique in the industry.
Actually, the flexibility of HCP may cause a slight retro-effect. HCP and its predecessor, HCAP (Hitachi Content Archiving Platform), has always been the platform of choice for compliance based archiving. That is, customers that HAVE TO archive data because of compliance laws and regulations of their respective industries. HCP can be configured to erase and/or digitally shred data when it is legally time to, and keep data safe and immutable in the meantime. For the data archiving use cases of HCP, there are many customers using HCP for long-term data archiving, but many more customers use HCP for shorter-term archiving based on regulation requirements and retention timeframes.
So, this is a series of articles on archiving data, specifically, long-term data archiving. Now, when configured with the new HUS storage platform of products, HCP is more cost effective and environmentally friendly with the support of disk drive spin-down for overall power savings over the life of data. Data stored in HCP can be stored on HUS storage platform that can spin its disk drives down consuming less power and generating less heat, thereby saving on datacenter cooling power as well. For data that needs to be stored for a long period of time, this new feature can have a significant positive impact on the Total Cost of Ownership (TCO) over the lifetime of the data where power and cooling costs add up to a significant operational cost over time.
At one point in time, Massive Arrays of Idle Disks (MAID) was once a technology, coined by Copan Systems, captured the imagination of many, but failed to fulfill the promise. Since HCP maintains and stores metadata and data differently, the requirement to continually reference storage for even the slightest metadata reference is negated. So while metadata can be searched, queried and referenced actively in HCP, the actual data does not have to be active or accessed. In an active archive or data repository used for research or as a library of information, searching through indexes, metadata queries, and custom metadata searches and queries will be the majority of the activity in these types of systems. The actual retrieval of data based on search requests, will most likely be the last task performed.
HCP lays down a critical foundation for future ways of designing cost efficient data systems for large active data archives, research libraries and other long-term data repositories. While HCP lists an impressive number of significant enhancements to simplify the complex task of managing massive-scale data repositories, disk drive spin-down support is my personal favorite. With environmental impact and energy costs on the forefront of many people’s minds these days, this exciting feature allows the current explosion in global data growth to also have a positive global effect.
Not Clear on the Concept
I just returned from Cleveland and had the honor of presenting the HDS recent announcement of the Hitachi Unified Storage to customers of the CHI Corporation and hosted by the esteemed Greg @Knieriemen, and enjoyed presenting our new product for the first time.
On the flight to Cleveland I was reading some old magazines that had been accumulating on my coffee table and came across this cartoon from The New Yorker and thought it was appropriate for this post.
The caption may be a bit hard to read, but it says “It was much nicer before people started storing their personal information in the cloud”. And although intended for humor, this is a classic example of the “concept” being misunderstood.
I have my own favorite story on “not clear on the concept”. When I returned from the military and rejoined IBM in 1970, I was greeted with technology that was job changing: IBM had implemented what was called the SUN Network, tying major IBM locations together electronically. An intranet, essentially. Shortly thereafter email was emerging as a new form of communication within the company.
I was completing a project with a colleague and after our final meeting asked him to document our agreement and send a note to me copying his manager and my manager. The next day I received 3 identical emails, which was a bit confusing, but 10 minutes later I received a phone call from him apologizing that he had sent me all 3 copies and asked me to return 2 of them so he could send them to his manager and mine. Not clear on the concept? Doh!
I will swear to the ends of the earth, that this is a true story. I’m not smart enough to make this stuff up.
What does this have to do with unified storage and HUS? Plenty, actually. Unified storage is the latest challenge in the industry and all the vendors are tripping over themselves bringing products to market. But what is the “concept” of unified storage? Is it packaging GigE, FC and iSCSI in a box and shipping it? I don’t think so. Unified storage is not packaging, it’s engineering and integration, which is what we’ve done.
For example, with HUS there are no static partitions of file storage and block storage, and capacity can be nondisruptively reassigned. Also, with the Hitachi Command Suite all of this storage, as well as any other storage in your data center, can be managed with one product, not separate products depending on the last storage acquisition your company has made.
The “concept” with unified storage can be summed up in a single word: UNIFIED! Not just another box that has packaged multiple protocols. There is a vast difference, and we were willing to invest the engineering effort to make it right. What a “concept”.
HCP and HDI, A Monster Release
Firstly, it has been quite a while since I’ve last posted. A lot has happened between the beginning of the year and today. In fact it has been so action packed, the past 100 days seem more like 365. While I cannot spill the beans on everything, my colleague Ken Wood and I are super excited to communicate to you about the new Hitachi Content Platform (HCP) and the Hitachi Data Ingestor (HDI) developments.
With HCP and HDI we are bringing well over 130 new features and capabilities to the cloud-enabled object storage market. While documenting a log-like post of every detail would be downright boring I believe referencing the top 5 HCP capabilities and top 2 HDI capabilities will expose the stellar work completed by our HCP and HDI R&D teams.
HCP’s top 5 new capabilities are:
- Energy efficient content storage via spin down disk on HUS (Note: this will be covered in depth by Ken Wood)
- A radical increase of the number of tenants and namespaces
- Improved authentication and authorization through Microsoft AD support and S3-like or S3 inspired object level ACLs
- Native metadata search via the enhancements to the existing Metadata Query Engine (MQE)
- HCP packaged in VMWare to support limited use cases like custom application development, evaluation, PoCs, etc.
HDI’s top 2 new capabilities are:
- File pinning at the edge to ensure fast, local access to designated files
- Improved VMWare appliances supporting HA and easier installation
HCP code named Godzilla – Scaling done right
As HCP has evolved, the need for extreme scale has become paramount. Extreme scale means different things to the different types of users. For example application authors are likely more concerned about how to best extract solid performance, object density, etc; whereas service providers are more interested in things like extreme tenancy, operational efficiencies and stellar manageability. For HCP we’ve focused on the needs of the application and service provider by expanding key attributes of the system, improving user performance/usability, and zeroing in on the details needed to improve online upgrades. I’m using the image at the beginning of this section to illustrate that when we build features and capabilities into HCP we look at all aspects. Specifically with HCP plumping up the number of supported tenants and namespaces to 1000 and 10,000, respectively, we needed to ensure the administrator can gracefully interact with an extreme number of tenants. Further, tenants themselves are now more power packed through the additions of more protocols (i.e. SMTP, NFS, CIFS and WebDAV), a new external authentication method (i.e. Microsoft Active Directory), and S3 inspired ACLs — a great illustration of a feature with strong manageability. Object level ACLs can be individually controlled by an application or user, but application or service providers can establish overrides at the namespace or tenant level which are then “pushed down” to all objects below. This kind of activity is critical for use cases where application and service providers may want to quickly remove or restrict access to an entire tenant when something untoward has happened such as a security breach. Again we’ve made this feature scalable both for other applications and systems using HCP and for the administrators.
You’ll see this same theme applied to nearly all of the features our monster Godzilla brings to the table with the most recent release. Whether it is a seamless online upgrade or a VMWare version available for an application developer or for special evaluation, I believe you’ll see we’ve done scaling right!
HDI code named Emerald — Access done right
While many in-development or modern applications can take advantage of the cloud or object lingua franca, REST over HTTP, many existing applications and certainly human beings cannot. Further, we know that many users are interested in relieving pressures on WANs and pushing out content to remote sites both of which are key facets of wide area collaboration. Today I’m proud to talk about two key new HDI capabilities that add to its growing arsenal of capabilities: file pinning and availability/reliability improvements to our VMWare appliance model. Firstly, file pinning allows an administrator to define which files will remain in HDI’s cache. From a user’s perspective pinning allows popular or hot content to sustain higher performance because in a WAN style deployment the entry point to the content is intentionally close to the user. Complimenting file pinning is an HA cluster version of our popular Virtual Machine Appliance (VMA) model of HDI. HDI’s HA VMA allows administrators to quickly and easily deploy and re-deploy a highly available HDI infrastructure in a remote site. When taken together, these two capabilities allow users to rely on the power of HDI, afford administrators simple administration meeting users needs, and allow companies to deploy infrastructures leading them down the path towards wide area collaboration.
Our keen attention to detail in our gem of a product, HDI, helps our users do access right so they can begin to meet objectives like reducing pressure on WAN pipes and move forward in generating new information over distance.
Why is this relevant and why I’m excited?
The last points on HDI are really why I’m excited: efficient usage of WAN resources to share and engender insightful information that matters. Intimate interaction with our global customers produced the spark that headed us down this specific path for HDI and HCP. Here is an anonymous quote from an EMEA customer which crystalized our goal:
“I want to be able to find and share information from both London and Hong Kong so that we can create collaborative projects to analyze markets on a global basis.” (EMEA Financial Services Customer)
Helping our customers and users get at and manage their information without limitations is really what jazzes me and all of the teams at Hitachi and HDS. I hope you can see that we are moving intentionally down this path, and I really appreciate all of the efforts by the teams so far to get us to the super competitive position we are in today.
What can HDS do for you?
I’m very pleased and excited to be part of a company that has a compelling vision, focused on solving customers’ business challenges, AND continues to execute toward that vision in a steady drumbeat of announcements, enhancements and new product introductions. Today, we made two significant announcements (unified and cloud), both highlighting our commitment and serving as proof-points to our vision and journey to the information cloud. Before I go any further, let’s briefly recap on that vision.
In October 2011, we announced our 3-tier strategy—starting from infrastructure cloud to provide more dynamic infrastructure, then layering content cloud to enable more fluid content, and then finally building to information cloud to facilitate more sophisticated insight. You can read more about that here.
Today’s Hitachi Unified Storage announcement supports our vision by providing a platform that underpins it, granting customers unified and seamless access to all resources, data, content and information. This unified architecture provides a single pool, so that the whole capacity can be managed from one place, improving utilization, simplifying management and lowering costs for our customers. To achieve this, we are bringing to market a unified platform, as well as a unique, extensible management framework to provide a single way to store, manage and protect all data: block, file and object.
Now, in order to manage the massive growth of information, with limited resources all while bubbling up the trends and insights gleaned across previously siloed datasets, it’s critical to free data from its originating application and underlying media. This is where object storage comes in (enter Hitachi Content Platform – HCP). HCP provides a different way of storing information. It stores information as objects, which is the data, plus metadata (data that describes the data itself). This approach unleashes a whole host of interesting things that can be done with that data. For example, the management of that data can be automated based on what the data is, how it was created, who created it, who can access it, what service levels should be assigned, how it should be protected, when it should be deleted, and the list goes on and on. This also means we can index, search and discover across all data within the object store. This is critical to finding data for insight, reuse and action.
Today’s content cloud announcement includes new features across Hitachi Content Platform and Hitachi Data Ingestor as well as a new Content Audit service. The new capabilities provide greater scale within a singular architecture with significant increases in tenants and namespaces, storage density and broad protocol support, more granular visibility and control by providing more intelligent queriability of the data, along with more comprehensive access controls and finally improved operational efficiency achieved through support of spin down, simplified installation, enhanced monitoring and online upgrades. These new features and enhancements are targeted towards customers looking to centralize the data within their organization, within remote or branch offices, or service providers looking for a robust foundation in which to build and deliver their own reliable and profitable cloud services.
This is indeed an exciting set of announcements, enhancements and introductions, but don’t just take my word, take a look at the details of what’s included in unified, covered by Hu Yoshida here, and more to come on HCP/HDI from Michael Hay and Ken Wood, stay tuned!
Healthcare Cloud Solution Checklist
Ultimately there are certain minimum requirements that providers need to consider when evaluating a healthcare cloud provider. Without these considerations, providers will put their services at risk and fail to realize the full potential of cloud technology.
To overcome current perceptions of the risks associated with using the cloud for personal health information, cloud providers must demonstrate security measures that prevent unauthorized access to patient data. With security comes privacy. Consideration must be given to the following:
- Secure access to the facility
- Network security
- Data security
- Staff training and regulatory compliance awareness
High AvailabilityHealthcare organizations are dealing with mission-critical applications where downtime can mean the difference between a patient’s life and death. Cloud providers need to be aware of and prepared for these stringent availability requirements and should be ready to guarantee delivery of information. Consider:
- Downtime for maintenance
- Responsiveness as data volume grows
- Network latency and redundancy
- Hardware redundancy
Standards-based Data ManagementHealthcare is driving the development of standards throughout many different areas. The use of the following standards in managing data will future-proof the data to ensure access and migration of data will always be possible.
- XML metadata
- IHE framework
ScalabilityAs new systems come online, the volume of data will grow, creating a need for the cloud provider to be able to scale up, out and deep. As the data volume grows, the impact on performance should also be negligible. Consider:
- Plug and play growth
- Dynamic scaling
Remote AccessFlexibility to access the data should be considered by healthcare organizations as they look to the cloud. Various aspects need to taken into account to ensure adequate services are provided to the users.
- Capacity of users
- Performance at peak access times
- Flexibility of mobile devices
Contractual AssuranceAs with any agreement, healthcare facilities should develop ironclad agreements that ensure the delivery of services will not be interrupted without penalty. Contracts should include items such as:
- Curing periods for breach of contract without interruption of service
- Insurance for breach of privacy
- Service level agreements
- Migration assistance
What’s on your checklist?
This is part four in a series on cloud technologies in healthcare. You can read the previous three here.
It is a common misconception that cloud technology equates with inexpensive technology. There are economies of scale that must be achieved for savings to be realized. In the case of healthcare providers, a private cloud will cost more than a public or hybrid cloud because in a private cloud the resources are shared amongst fewer constituents. It should also be noted that price does not equal cost, and the total cost of ownership (TCO) should be evaluated when looking at a cloud architecture. (My colleague David Merrill has written about this numerous times over on his blog)
All of the costs important to the facility must be examined beyond just those of capital outlay. For example, if cooling and power are not a visible expense, any cost savings here will not be immediately tangible. Costs that are borne by other departments need to be considered in the overall business justification for cloud adoption.
For those considering adoption of cloud technology, attention should be paid to the following financial areas:
- Total cost of acquisition (TCA)
- Total cost of data ownership (TCDO)
- Total cost of ownership — hard (TCOH)
- Total cost of ownership — soft (TCOS)
When choosing the architecture of the cloud provider, facilities should consider the underlying technologies. Direct attached storage will likely be the cheapest of technologies, but it carries a risk of performance issues with the very large-scale growth common to hospitals. Modular storage and enterprise storage systems have improved benefits of performance and features—such as thin provisioning and dynamic tiering—but bring increased costs. It will be important for adopters to consider what they are buying in any cloud model.
Total cost of acquisition is what most people think of immediately when making buying decisions. TCA takes into account the initial outlay or CAPEX but does not consider the ongoing costs. Only considering TCA will result in an architecture that may not fully support the initiatives trying to be achieved by the organization.
Total cost of data ownership takes a more practical view of the costs associated with managing data. When considering a cloud model, TCDO has real meaning as the OPEX costs are factored in and can be more easily compared to the current costs a facility is experiencing. This becomes an apples-to-apples comparison that will help in the decision making process. Cloud adoption brings with it additional benefits that need to be factored into the equation beyond just TCA and TCDO.
Total cost of ownership — hard includes items that are easily measured, such as maintenance, cooling, power and so on. TCOH should be considered with cloud technologies. However, TCOS is where the real financial benefits will be shown.
Total cost of ownership — soft for a hospital adds in the business and clinical impact that cloud technology enables. For instance, TCOS would encompass access to specialist storage or network resources that otherwise would not be affordable, or enable the adoption of new applications in a more dynamic and timely manner than through normal procurement processes. TCOS of cloud technology would have a business impact by reducing paperwork and process when the purchase of additional storage is needed scale to meet the demands of the department. These financial aspects can’t be overlooked when looking to make a decision on the adoption of the cloud.
Another area outside of operational costs that should be considered is the operational transition costs. There are some more obscure costs in this area, other than the labor and management, like the remedying a privacy breach—any agreement with a cloud provider should identify who bears the costs of such. Insurance policies that cover payments to 3rd parties as a result of a breach need to be included in the costs.
Cloud technology can bring many benefits, but due diligence will show what those financial outcomes will be. It is important that facilities look beyond just the acquisition costs, as many benefits will be found outside this single factor.
My next blog in this series will cover areas that healthcare organizations should consider when looking at cloud providers. Until then, I’m looking forward to hearing your thoughts…
Got Private Cloud?
Note: This is a guest post by Bob Laliberte (@BobLaliberte), Senior Analyst at the Enterprise Strategy Group (ESG), who focuses on data center infrastructure management, automation software, data center infrastructure and technologies, and professional services. Take it away, Bob…
The tight integration of IT with business is essential for the success of any modern organization. In fact, ESG research (subscription required) indicates that organizations use business process improvements as the top justification for IT purchases, beating out OPEX and CAPEX concerns.
This is reflective of organizations moving to highly virtualized and dynamic environments (clouds – private or public) to better serve the business. More and more, the CIO is looking for efficient problem resolution and is becoming a broker of IT services, balancing speed and agility with security and control when deploying new applications. Typically, this requires less architecting, designing and testing of numerous disparate pieces and more about deploying solutions that can rapidly scale to meet the needs of the business.
Converged data center solutions can play a big role in helping organizations accelerate the time to value and still maintain control. These are solutions that combine virtualization, compute, network, storage and management to provide a solution for enterprises and service providers. Typically, these infrastructures are used to create a foundation beyond simple consolidation and cost containment. More specifically, solutions like virtual desktop environments, server virtualization efforts and even business critical applications are becoming common–essentially they enable organizations to build out a private cloud environment. Again, ESG research confirms the need for solutions like these as survey respondents report that increasing the use of server virtualization, desktop virtualization and private cloud computing all rank in the top 10 for 2012 IT initiatives. ESG has seen further evidence at end-user events, more information can be found on the New England VMUG site.
While business process improvement is critical, organizations do not have unlimited budgets, so in many cases they have to work within the confines of existing infrastructure to create these private cloud infrastructures. Therefore, the ability to create a private cloud leveraging existing server, network and storage should be appealing as well. This requires more of an open model capable of integrating and orchestrating industry standard infrastructure to enable an end-to-end solution.
HDS has a range of converged data center solutions designed to automate, simplify and accelerate the adoption of cloud computing. The company’s objective is to provide solutions that enable faster deployment, automation and scalability to help organizations adopt cloud infrastructures at a pace that works for them, with predictable results and faster time to value. These converged solutions help eliminate some of the roadblocks to private cloud deployments that come from a lack of infrastructure standards, expertise and best practices. HDS vision includes three levels of solutions:
- Level 1: Reference architectures – designed for key applications such as Microsoft Exchange and Oracle as well as virtualized environments such as Microsoft Hyper-V Cloud and VMware. Currently includes Hitachi Solutions Based on Microsoft Hyper-V Cloud Fast Track.
- Level 2: Hardware integration – solution specific, validated bundles of integrated hardware platforms. Hitachi Converged Platform for Exchange is one example.
- Level 3: Management integration – orchestrating various components across the boundaries of technology domains. Hitachi Unified Compute Platform enables organizations to leverage multi-vendor or existing hardware.
Regardless of the path taken, organizations are in the process of transforming their environments to better respond to the needs to the business. HDS offers multiple paths that can help organizations achieve the desired end state, a private cloud infrastructure, at their own pace and budget.
For more on this topic, check out this video.
My EMR Is LIVE, So Where Is My Data?
As we near the final phases of the HITECH Act in the coming months, I’m starting to see more and more news (and having first hand conversations) as to why this adoption curve is going so slowly—and how it’s really only the first step in providing greater access to clinical information.
As clinicians and health professionals across the nation are logging into their EMRs for the first time, they are quickly discovering that the “promised land” leaves them with a convenient but spotty view of the very information they desire.
Is this really a surprise, with only 30% of clinical data being neatly structured into efficient rows, tables and databases, and the other 70% of the data sitting in imaging files, wave forms, clinical reports and such? This is like driving a car where you can only see 30% of the windshield.
To be sure, EMR adoption is a major first step forward, but until we tackle the interoperability of these systems with the “other 70%”, it is going to be a long climb.
Here are a few interesting pieces I’ve found on similar topics:
Off to Dallas for SNW April 2-5: Texas BBQ and Cloud…BIG and BOLD!!
Next week Hitachi Data Systems is participating as an Underwriter Sponsor at SNW Spring in Dallas from April 2-5, 2012 at the OMNI Dallas Hotel.
Our focus at #SNWusa is “The New Data Center Economy”. You can find us in booth #201. The HDS booth will showcase virtualization, capacity efficiency, big data, file & content, and cloud solutions, and feature a Technical Operations kiosk with a VSP and demos.
During the four-day conference I will be in the booth, meeting with customers, and presenting on “HDS cloud at your own pace”. I am excited to meet with people to talk about what is relevant to their business data center challenges. There will be fun giveaways and a lot of action in our booth. Look for my tweets for #HDS and #SNWusa news and fun! Follow me at @tdoyle49
And don’t forget to go see my colleague Fred Oh (@fredhds on Twitter) for his Big Data session (details below).
Main highlights of HDS participation include:
- Main stage presentation “The New Data Center Economy” by Michael Gustafson, Senior Vice President, Global File & Content Solutions Business, Hitachi Data Systems on Tuesday, April 3 at 10:00am – 10:30am
- 5 Solution Provider sessions to include:
- Big Data Track: Big Data, Big Content, and Aligning Your Storage Strategy: Fred Oh, Sr. Product Marketing Manager, File, Content & Cloud Portfolio, Hitachi Data Systems, on Monday 4/2: 1:00 pm – 1:45 pm. Fred’s presentation is embedded below
View more presentations from Hitachi Data Systems
- SNIA Tutorial: The Evolution of File Systems: Thomas Rivera, Sr. Technical Associate, File, Content & Cloud Solutions, Hitachi Data Systems, on Tuesday 4/3: 3:05 pm – 3:50 pm.
- Cloud Storage Track: A Hype Free Stroll through Cloud Storage Security: Eric Hibbard, CTO Security & Privacy, Hitachi Data Systems, on Wednesday 4/4: 11:40 am – 12:25 pm.
- Data Management Track: Advanced Data Reduction Concepts: Thomas Rivera, Sr. Technical Associate, File, Content & Cloud Solutions, Hitachi Data Systems & Gene Nagle, Manager, Applications Engineering,
- Storage Systems, Exar Corporation on Wednesday 4/4: 11:40 am – 12:25 pm.
- Data Security Track: Storage Security – The ISO/IEC Standard: Eric Hibbard, CTO Security & Privacy, Hitachi Data Systems, on Thursday 4/5: 8:30 am – 9:15 am.
Will we see you there? What questions or specific issues would like me to addresses?
Benefits of Cloud Adoption for Healthcare
This is part three in a series on cloud technologies in healthcare. You can read the previous two here.
While many challenges have contributed to slow adoption of the cloud, there are equally as many benefits for providers to embrace this new technology across the enterprise. These benefits encompass both business and clinical areas. In today’s world of cost cutting, many facilities must show clinical benefit in order to justify expenditures, and cloud technologies are potential tools to do just that.
The single, biggest clinical benefit that cloud technology can provide is access to applications that were previously unattainable. For example, the implementation of digital pathology—managed through cloud services—has a huge clinical impact on an organization. The organization can now roll out a service that would have cost millions just for the storage alone, but now can pay for it as they use it.
Access to pathologists, previously located exclusively near centers of excellence, means that remote facilities can offer new services to its local patient population while relying on remote experts to render their diagnoses. Patient care can be improved by providing this service through the cloud faster and more efficiently. Since patients don’t need to travel, waiting lists are more easily managed as more patients can have the same tests in multiple locations with various experts now available.
These same experts can access patient data remotely and on demand through the Internet via a variety of connected devices. Physicians can review the latest diagnostic results from home and perhaps determine that the patient can be discharged immediately, rather than wait for their afternoon rounds.
Collaboration between researchers or physicians and allied health professionals suddenly becomes a reality, as the patient information is centrally located and accessible to authorized users. Patient information is now being shared between caregivers, regardless of location, allowing for more informed decisions to be made.
Obviously there must be some business benefit for a new technology to be adopted, or it won’t be considered. Cloud technologies provide tremendous benefits that can contribute to the welfare of a provider organization.
Healthcare providers are in the business of treating and caring for patients. They are not IT focused; their purchasing patterns indicate that investment into IT falls far below other industry standards. In many cases providers’ IT staffs are stretched very thin and other staff must overcompensate.
For example, in radiology it is often a medical technologist—with a technical affinity but no formal technical background—who becomes the PACS administrator. The cloud offers providers the ability to access specific experts to manage and maintain their systems. A cloud provider will have a block storage expert, a network security expert, and an archiving and backup expert who will manage the different components. Providers need not build up these skill sets, but rather, for example, focus on a clinical applications specialist for PACS who helps clinical users maximize the application. These experts can spend the time and effort to implement the best practices for each component, which ultimately delivers added benefit to the clinical users and their patients.
Today’s purchasing environment through capital outlay usually works in cycles. A department will be given capital for the next 5 years and then will need to reapply and compete for funds to continue to operate their systems. The cloud provides a way to operationalize investments while guaranteeing that they can continue to operate.
Take the radiology example again. The department adds a new CT scanner and their data volume increases by 10%. Their storage is not scaled to handle this added volume and so they will deplete their available storage faster than expected. In a cloud model, the facility has access to the needed capacity and performance to meet the demand of the new CT. This “unlimited” scalability allows for the IT department to meet the interests of various departments simultaneously, and respond more quickly to changing needs as they develop.
This model lowers the barriers for adoption of innovative new technologies and helps to address the massive overhaul and modernization needs in healthcare.
Cloud models provide transaction-based pricing—as a facility uses more storage, they pay for it. Traditional capital models mean that the storage purchased in year 1 sits mainly idle, waiting for data to be captured. The ROI is low as utilization rates are very low to start.
With cloud technologies, utilization rates are 100% from the start, and the cloud provider is responsible for maintaining the hardware. For example, by year 5, cloud technology has probably been refreshed by the cloud provider, while the organization is looking to replace the capital model and migrate the data—a costly proposition.
Cloud technology shifts the paradigm for the delivery of healthcare. Consistent delivery of IT services and scalable hardware and software on a pay-per-use model enable healthcare providers to focus on what they really should be focused on: effective delivery of patient care.
In my next post I will explore the economics of the cloud in healthcare. In the meantime, let me know what you think.
Do you “HAVE TO” Archive?
I’d like to start a series of blogs on digital archiving over the next few weeks—primarily, I’ll be making some statements and asking questions. In this first installment, I want to know what is your archive type? Over the past year, I have been talking with customers and have come to a simple set of conclusions: there are primarily two types of archivers…
- Those that HAVE TO archive their data,
- And those that WANT TO archive their data.
Of the many companies I’ve talked to about archiving of their digital data, these two types standout. Sure, there are a few industry segments that HAVE TO archive, and have for a very long time, like healthcare. Typically however, the data archived in healthcare is active data, and while compliance mandates the retention of healthcare records for the life of the patient (plus some number of years beyond), the records are actually used and provide long-term benefits to the overall well-being of patients. Plus, much of this data is redacted of personal information and preserved for research purposes. So, I typically classify healthcare as the WANT TO type.
Another industry segment that potentially has a duality between HAVE TO archive and WANT TO archive is manufacturing and engineering. The original designs, test data, simulation results, etc., typically are kept for a very long time by choice, mostly because researching prior designs or results, or reusing old test data can yield many benefits in newer designs. However, many sub-segments of this industry also have compliance laws or regulations to adhere to. For example, design data in the aerospace industry or medical device manufacturers (including components of the devices) must be kept for years beyond the end-of life of an airframe or device. So again, of the two types, I classify manufacturing and engineering as a WANT TO industry.
Now, for the easy to classify types. The industries that I label as the WANT TO types are typically those that generate or acquire content. Movie and animation studios, publishers, video game studios, marketing firms, recording studios, research and academics, national archives and possibly law firms fall into this category. The content that is generated typically has a large investment in generating this data, and in some cases can provide a source of future revenue. In other cases, the capability of the state-of-the-art doesn’t allow for sufficient processing today.
Take a movie studio for example. New distribution formats (VHS to DVD to Blu-ray), “never before seen footage”, director’s cut editions, 20-year anniversary editions, collector’s editions, and so on and so on are ways that original content is reused and retrieved from archives. These assets can have a business requirement to maintain this content forever, or at least for the life of the company. In fact, there are cases where the archives of a failed company are worth more than the other assets of a company itself during acquisition or takeover talks.
There’s also the government of countries that want to preserve the history and culture of a country. Besides trying to convert everything to a digital format and digitally archiving data, the long-term language used by these organizations takes on an ominous tone: “Data preservation for the life of the republic”. The retention requirements during these types of discussions are always fun and challenging to participate in. Some organizations have requirements for storage media that is electromagnetic pulse (EMP) proof, or the ability to survive a disaster. Again, quite a grim tone is set when this comes up.
There’s also a different mindset in managing these data repositories and archives. This is their primary storage. Capital expenditure (CAPEX) takes a backseat to operational expenditure (OPEX), but the data stills needs to be accessible in a timely manner. For most government related meetings about archiving data, 25 year planning is considered short-term planning. The operational cost of managing data through many technology migrations and possibly data format migrations are painstakingly planned. The space, power, facilities and maintenance costs are the overwhelming expenditures “over the life of the republic”. The WANT TO archive types consider archiving as the primary mission.
On the other hand, those organizations that HAVE TO archive, usually do so because they are mandated through laws and regulations in order to do business. They must be compliant and prove it in many cases. Now, don’t confuse the HAVE TO types with a WANT TO types for these cases. It’s true that companies don’t want to pay a hefty fine due to not having this data available, squarely landing them in the WANT TO category, but for violation avoidance reasons. Given a choice, most would delete unwanted data. In fact, the compliance archiving mentality is to be able to prove that you have the data for the required retention period, but you can also delete the data as soon as the retention period expires and they need to prove that the data doesn’t exist any more.
These organizations are concerned with primarily CAPEX with some concern for OPEX, but only in light discussions. These environments rarely prepare beyond the 5 year total cost of ownership (TCO) plan, as it is a very long-term endeavor. In fact, it has been stated to me that no CIO looks beyond 5 years in a TCO study with which I counter: “Maybe in these industries. You should talk to Hollywood”.
So, it comes down to compliance archiving, which is typically seen as a HAVE TO archive activity where the data is rarely accessed, if at all, but must be there just in case an auditor or regulator wants to see it. This data is usually short term and uses language like expiration dates and data expunging.
At the same time, long-term data preservation activities are typically a WANT TO archive mission, where data is reused, researched and retained forever. The planning for these types of systems uses its own type of language, such as 100-year archive, longevity and durability. With the forecasted growth of data we create as humans and the amount of machine-generated data being created, (and with the “keep everything” mindset, mostly due to our current inability to do everything that needs to be done with the data) the discussions of Exabyte –1,000,000,000,000,000,000 bytes or 10^18 bytes –archives have been in some serious discussions recently.
So, what’s your archive type? Is your organization a WANT TO archive or a HAVE TO archive? How busy or active is your archive? I would love to hear about your organization.
As a footnote, Hitachi Content Platform (HCP) is a great solution for compliance archiving. The ability to immutably retain data and digitally shred data (when allowed) is a great way to manage these compliance-based archives. In fact, HCP is the archiving platform used by many organizations for long-term archiving projects with no plans for deletion. This is sort of a paradox to this article, but goes to show the versatility of the HCP solution.
Reunions, Capacity Efficiency, and Bad Haircuts
So, as many of you know, we at HDS have been talking a lot of late about capacity efficiency. I’ve blogged about it, Hu has put in his views on it, as has David Merrill, who owns all things “economics” for us.
David and I both work for Hu, and the running joke at HDS is that we’ve actually never been in the same place at the same time due to our collective travel schedules. Seriously, getting any two of us together at the same time is rare; getting all three of us together? Well, that almost never happens.
But it did last August. One of the members of our Executive Committee was so stunned to see the three of us in the same building that he felt compelled to capture that rare “Kodak Moment”. That’s me on the left (the scruffy “Dos Equis man” beard is now gone), David is in the middle, and, of course, Hu on the right.
Well, it happened again a few weeks ago, and someone had the great idea of capturing the three of us on video talking about capacity efficiency. Well, almost.
Hu was tied up with customers, so David and I launched into this cool discussion on “CapEff”, and the economics supporting our positioning. It’s not a long video, so I hope you have time to watch it, and I totally apologize for that bad haircut of mine. David is more GQ in the video, by far.
One of the many things I love about the video is that this was completely unrehearsed, and there was no “David, you give this message, and Claus, you give that”. We were very much in sync on the topic from the beginning.
So take a gander, and give us your thoughts. As for our competitors, you might want to watch it as well, since it will drive future sales. I love the message, and the impromptu nature of the video. It was just fun to do.
Except for the stupid haircut, that is.
Don’t miss the opportunity to win a free capacity efficiency assessment from an HDS economic expert:
For other posts on maximizing storage and capacity efficiencies, check these out: http://blogs.hds.com/capacity-efficiency.php
Healthcare Drivers for Cloud Technology
This is part 2 in a series of posts on the role of cloud in improving patient care. Part 1 can be found here.
As with any industry, certain drivers need to be present in order for new technologies to be adopted. For many years, these drivers have been minimally present in healthcare, resulting in a reluctance to change. Recent investments and the increased visibility of healthcare on many country’s national agendas have raised the drivers for cloud adoption.
Delivery of Cost-effective Healthcare
The cost of healthcare delivery has grown to such huge proportions that governments now face serious funding issues if there is no resolution. Healthcare costs in some countries amounts to 35% of gross domestic product (GDP); an unsustainable model that could drive some nations into bankruptcy. The drive to lower the cost of healthcare delivery has become so predominant in society that governments have risen and fallen on these platforms. Alternative models of healthcare delivery that lead to cost savings and efficiencies must be explored in order to rein in the increasing costs.
Governments around the world are providing financial incentives for healthcare facilities to adopt new technologies, such as electronic health records. The recognition that technology can improve patient care while reducing costs has meant that governments are willing to invest in the traditionally slow healthcare industry to incite a faster pace of adoption. Reimbursement, the development of standards, introduction of legislation and regulatory compliance are just some of the mechanisms governments are using to advance healthcare infrastructure. The result is an increased awareness and consideration of these new technologies by healthcare facilities.
Healthcare is always striving to innovate. The ability for healthcare providers to adopt new technologies that drive better patient care has always been a challenge, born out of the cost and complexity of rolling out new technologies. Today, facilities seeking to improve their technology adoption must identify funding for a capital purchase and develop complex tenders—likely without a full understanding of the impact on their existing infrastructure and staff.
Advances in technology combined with government incentives push organizations to adopt new technologies. Thus, there must be mechanisms in place for these organizations to deploy, test and validate the effectiveness of these proposed solutions and prove the return on investment (ROI), without significant upfront investment. Increasing clinical innovation drives better patient care and outcomes, which is the main reason for the existence of healthcare facilities in the first place. Increasing responsiveness of facilities to deploy these new technologies in a cost effective manner will be a driver for cloud adoption.
Big Data Growth
Healthcare has become the best example of big data. As the amount of digital information increases, the ability to manage this data becomes a growing problem. Petabytes of data exist in storage devices. This data holds the key to future clinical advancement, but often remains inaccessible to researchers. The ability to access this data and utilize analytical tools against it can drive clinical and business intelligence. This will contribute to better utilization of healthcare practices, even driving new clinical decision-making processes. Big data analysis holds the promise to better treatment paths for diseases and faster recovery times through the understanding of best practices.
Hospitals are patient care centers, not centers of technical innovation. IT departments are stretched to accommodate the different clinical systems that are introduced into use, dealing with different vendor systems, platforms and licensing models. Clinical departments drive the acquisition of relevant applications without always considering the existing infrastructure and the results are inefficiencies. Take storage purchases as an example. Departments typically buy 5 years of storage during the procurement cycle without any rationalization of the storage needs of other departments. This storage can sit unused—but paid for—for years, tying up valuable capital dollars. Add to that the requirement for the IT department to then manage the application’s backup and archiving needs with those of other departments. There can be 10 to 20 different applications that need managing, taking the IT department’s time away from being strategic in responding to physicians needs and being more focused on day-to-day operations. Simplifying administration in the IT department allows more time to be spent on clinical systems and less time on the infrastructure.
Cloud Challenges in Healthcare
We have established that healthcare lags behind other industries with respect to technology adoption, and embracing the cloud is certainly in that category. Healthcare providers face many challenges as they investigate moving to a cloud model. Once these challenges have been satisfied, cloud technology will become not a question of “if” and more a question of “when.”
Privacy and security rank at the top of the list explaining the slow adoption rates. Putting personal health information (PHI) into a 3rd-party, remote data center raises red flags where patient privacy laws are concerned. The possibility that patient data is lost, misused or falls into the wrong hands affects adoption. What recourse does an organization have should the cloud provider lose data? It has happened, and it has the potential to be a very expensive problem to resolve. Violation of patient confidentiality carries heavy fines, including significant costs of recovery and patient notification. A cloud provider needs to demonstrate how they are dealing with this issue.
A potential solution is a private cloud model. In this case the data still resides at the customer data center and a certain degree of control still exists for organizations to manage patient privacy. The organization can also ensure that the data center complies with certain standards, such as NIST 800-146 Cloud Computing Synopsis and Recommendations. This model may be more expensive, but security and privacy are more visible.
Security challenges may be a moot point where healthcare providers are concerned. One of the benefits of cloud technology is the ability to access resources that would otherwise be unattainable. A cloud provider will have security experts deploying the latest patches and software to its data center. Secure access to the physical property will be well guarded and many policies, processes and mechanisms will be in place to ensure security remains in place. Add to that the fact that any applications operating through the cloud will store all their data in the cloud. This means there is no protected health information (PHI) remaining on the computers within the facility and you have a more secure situation than today’s current environment.
Health and human services studies show that PHI violations have come from the theft of computers taken from various locations: facilities, loading docks and even physicians’ vehicles. These thefts have been more for the computer and less for the PHI. This raises the question: Wouldn’t it be better to have everything in the cloud?
Healthcare providers are notorious for resisting change. Therefore, we should assume that the adoption of a cloud model would be a major change management issue for providers. Current processes are often inefficient, relying on paper in many cases to manage patient care. Any transition to a cloud would require significant support from the technology partners to ensure a smooth transition for users.
- Take for example the current practice of requesting a diagnostic exam:
- A physician fills out a request form with patient details, history and reason for exam
- This gets sent to the radiology department for scheduling (assuming it’s a magnetic resonance (MR), nuclear medicine (NM) or computed tomography (CT) type exam)
- The clinical staff books the appointment and informs the doctor, who advises the patient, who has a conflict with the time
- Back and forth it goes.
Now consider an electronic scheduling system based in the cloud whereby the doctor enters all the relevant information and the system determines the most appropriate exam and notifies the patient directly of possible options. The patient logs in, selects the best time for the predetermined exam and the system books the exam. This process relies on many people to do their part: the physician must enter the correct information for the most appropriate exam to be selected, the patient must cooperate by selecting the best time, and so on. It seems simple, but change management is required to ensure this transition is smooth.
As a part of this workflow transition, serious consideration should be given to the staffing needs within the organization’s IT department. As the cloud starts to permeate the clinical environment, no longer will the same skill sets be required. Different technology will need to be supported, new training will be required and new skill sets will need to be defined. An organization that had staff working on managing backups and archiving will now migrate to network connections and clinical applications. IT staff will focus on the rollout of the electronic medical record (EMR) instead of managing the storage layer the EMR sits upon. Access to this kind of skill set is in high demand today with experts suggesting the healthcare IT industry could be one of the highest growing areas for employment.
These challenges contribute to slow adoption of cloud technologies but should not stop cloud progress. Organizations are weighing the benefits against the risks. As more providers migrate to the cloud, we will see these challenges overcome with new and innovative solutions.
In my next post I will discuss what benefits cloud technology will bring to healthcare.
Previous posts in the series:
Mainframes are for Kids
I’ve been spending a lot of time on mainframe activities over the past 6 months or so, which is totally fine with me. This year alone I’ve visited 8 very large customers wanting to improve efficiencies. What seems to be getting the most traction are a few very unique products of ours, namely Hitachi Virtual Storage Platform (VSP), Hitachi Dynamic Provisioning (HDP) and Hitachi Dynamic Tiering (HDT).
I’ve blogged about these products on numerous occasions, as has Hu Yoshida and David Merrill, but the latest news here is that HDP now works on z/OS (and soon HDT will as well). The response from the customers I’ve been with, just in the past few weeks, has been extremely positive. Everyone wants to save money, improve storage performance, and reduce OPEX, whether open systems or z/OS. I’ve blogged on the notion of the Storage Computer, and this now applies to the “big iron” guys as well.
We are the only vendor to have these features available on the z/OS platform. EMC does not have it, but even more surprisingly, neither does IBM—and they own the platform!
To those of you that have read this far and assume that the MF platform is going away: think again. I came across this article from Forbes entitled Mainframes are for Kids that strongly indicates otherwise. New “mainframers” are being trained as we speak. How cool is that!
The truth is that z/OS pretty much runs itself and has solved many of the issues we’ve struggled with in open systems, so it tends not to get a lot of focus. So this new news, is big news.
Basically, VSP allows our z/OS customers to deploy much cheaper storage. Previously, there were only 3 storage options available, namely from IBM, EMC, and HDS, since they were the only platforms supporting FICON. Now, with VSP our customers have a much larger (and much less expensive) choice in storage.
HDP, through our unique data dispersion architecture, provides a significant improvement in performance, especially for cache-unfriendly workloads. DB2 performance improvements anyone? You’ll love it!!
And finally, HDT automates the placement of data on the most appropriate tier. I recently wrote in a post that roughly 80% of data residing on tier 1 disk, doesn’t need to be there. HDT will dramatically and automatically reduce capacity costs, and the power and cooling improvements that go along with it. All in support of our capacity efficiency efforts.
So there you have it. I’ve always wondered how many mainframers read blogs, so if you know one, please forward this around. I’ll be doing more z/OS blogs in the future, and if you are a big iron person (or anyone else for that matter), feel free to leave a comment.
I want to know if you’re out there!!
For other posts on maximizing storage and capacity efficiencies, check these out: http://blogs.hds.com/capacity-efficiency.php
The Role of Cloud in Improving Patient Care
This is the first in a series of posts discussing the role that cloud technologies play in the healthcare market.
It is no secret that healthcare organizations lag behind most other industries in adopting new technologies—by some estimates by as much as 10 years. Providers must modernize their IT infrastructures and massively overhaul their paper-based workflows, all while dealing with budget cuts and government reforms. It’s no wonder that healthcare organizations are often slow to move.
Healthcare providers invest 10% of revenue into IT, compared to other industries that regularly invest 25%. To date, their IT focus has been primarily around the digitization of images with picture archive and communication systems (PACS), payment and reimbursement applications and maintaining regulatory compliance. In addition, government incentives are driving providers to look at electronic health records (EHR), health information exchanges (HIE) and business intelligence or analytics tools as a way to push the boundaries of patient care.
The reality is, these types of initiatives can mean huge upfront capital expenditures, sizable ongoing operating expenses and a huge investment in change management. This is a major challenge in an industry that is historically reluctant to change.
Enter cloud computing.
Embracing cloud technology in healthcare may be the answer to enabling healthcare organizations to focus their efforts on clinically relevant services and improved patient outcomes. At the same time, it may reduce and even remove the burden of infrastructure management. Cloud technologies can provide access to hardware, software, IT knowledge, and resources and services, all within an operating model that drives down costs and simplifies technology adoption. Suddenly, management and migration of legacy hardware fall upon the cloud provider, allowing hospitals to get back to their primary intent of business—patient care.
As with any new technology, there are concerns that are both unique to healthcare and common to all industries. Security and privacy become a regulatory compliance issue while high availability is a must for systems that deal with life and death situations. Data movement across borders and ownership of that data are also important. Reports show as many as 30% of healthcare organizations are either implementing or operating cloud-based solutions, and the result is a wealth of vendors moving their applications to cloud models. Unfortunately, these cloud technologies are mostly limited to email applications and collaboration tools like Microsoft Live Meeting, but the movement to clinical systems is starting to grow. Electronic health records, diagnostic imaging, analytics and the introduction of health information exchanges all lend themselves to be cloud-based with a clinical focus.
Over the next few months, this series will explore the different aspects of cloud adoption and how healthcare providers can move forward with a cloud-based solution.
But as the saying goes, you can’t know where you’re going until you know where you’ve been…
Current State of Healthcare
The healthcare industry has traditionally underutilized technology as a means of improving the delivery of patient care. Even today, organizations still rely on antiquated paper medical records and handwritten notes to inform and make decisions. Digital information is siloed between departments and applications, making access to a patient’s longitudinal record difficult, if not impossible. This lack of access costs the healthcare industry millions of dollars each year in duplication and waste.
The sharing of patient data among clinicians, departments and even patients is rare and complex. A hospital’s reliance on vendors to “knit” together their diverse technologies leads to expensive and unproven data experiments that fail to deliver the expected outcomes. Various countries have approached this issue in different ways, from the central national clearinghouse (UK) to regional health centers (Canada) to more granular HIEs; all realize various degrees of success. Those countries that have skipped over paper records and started with diagnostic imaging seem to have had more success albeit in a limited manner, and have yet to achieve success with the larger components of the patient record.
Most provider IT departments are accustomed to traditional technologies that require licensed software platforms, elaborate and hardware-heavy infrastructures supported by a large staff. The staff members need to be experts in all areas of the IT department, including hardware, software, networking, backup and archiving. As new technologies are introduced, the demands on the IT infrastructure start to push the limits of the promised efficiencies. While ground-breaking in its concept, government incentives simply don’t cover enough of the true costs of overhauling legacy equipment and modernizing a facility.
As EHRs, PACS and advanced clinical systems are evolving and becoming more prominent, the demands on current storage resources are stretched. The implementation of a digital pathology system alone could put petabyte-level demands on the current infrastructure instantly. Implementation time of these projects are consumed with ensuring the back-end technologies are properly configured and working, often taking focus away from the clinical aspects of these applications and what users need. Reducing this implementation time is critical to a facility’s ability to adapt quickly to changing needs and the introduction of new applications.
Patients today are better advocates for their own healthcare; they are more educated on their diseases and increasingly demand access to the latest technologies. At the same time, they seek the best care at the lowest cost, and are willing to investigate their options. As a result, the demands for access to personal patient records are increasing and organizations need to keep up. When citizens can access bank accounts from anywhere in the world, withdraw money, get balances and make payments it is hard to understand why we cannot travel across town and inform a physician what medications we are taking and what diagnostic procedures we’ve had, never mind what the results were. Patients require universal access to their secure health information.
This bleak picture is not all doom and gloom, however, as many facilities have recognized these challenges and still provide top notch care. Many developed countries are establishing healthcare data clearing houses or data centers that can help make data more portable. Canada has established diagnostic imaging repositories across the country with demonstrated benefits to both patient care and cost savings. Countries everywhere continue to invest in new technology that will improve patient care.
And this is where cloud computing can help drive the industry. CDW’s study (same reference as above) showed that 37% of healthcare providers have cloud adoption in their strategic plans, 22% are in the planning stages and 25% are in the midst of implementing. Only 5% have already embraced cloud computing and have recognized an average of 20% savings on those applications implemented. The next step is to move more clinically focused applications into the cloud.
In the next blog we will discuss the healthcare drivers for organizations to adopt cloud technologies. Until then, I’d love to hear your thoughts.
HDS at Storage Visions 2012
Guest post by Tracey Doyle
After a relaxing break for the holidays, Storage Visions 2012 was a great way to ease back into the swing of things. Sure, many people might not look at a whirlwind trip to a Las Vegas-based conference kicking off CES (the world’s largest consumer technology show of the year) “easing back in,” but well…I do.
SV2012 starts a few days before CES, so there is somewhat of a quiet before the storm. I love the feel of Storage Visions; it has an intimate feel even though the attendance continues to grow each year. This is due to Tom Coughlin and his team. They run the conference like a community. I see many familiar faces each year and there is a heavy emphasis on networking with your peers. Networking is made easy with the laid back atmosphere at the conference, and with all the friendly exchanges, it is very easy to meet new contacts and get excited about what you do. Our HDS cloud vision really seemed to resonate with the people I talked with.
I was able to present during the session “They’re Out There: Opportunities and Challenges for Consumer and Enterprise Cloud Storage”. What a fun panel session it was. It included a variety of presenters (even a competitor), but it was such a lively interactive exchange that it made for an interesting and informative discussion. The session focused on distributing content, as well as how on-line back-up and disaster recovery are driving demand for remote storage. We also addressed storage requirements and trends for online content delivery and remote storage. We covered some new business opportunities and how they’ll impact the growth and use of storage in this growing market.
Another thing that made SV 2012 a bigger and better event this year, for me and HDS: Hitachi Content Platform (HCP)was honored with a Visionary Product Award at the 2012. HDS was recognized in the Enabling Professional Storage Technology category for the benefits HCP brings to organizations, including simplified IT, reduced costs and reduced risks. I swear I thought I was accepting an Oscar! A little too excited maybe? Oh well, a little too much excitement won’t kill anyone.
That’s how I started off my New Year. I know it will be an exciting year here at HDS. I look forward to spreading the HDS cloud vision and continuing to share the many exciting things we have going on!
Follow me on twitter @tdoyle49
From ASIC to Microprocessor and Back Again
Other than being an allusion to J. R. R. Tolkien’s The Hobbit, there is real meaning in the title of this post, which I’ll get to towards the end. What I want to start with is a look back into the past and talk about, of all things, math co-processors.
Do you remember them? If you go back that far in personal computing land you should recall what an external FPU or math co-processor is. Here’s the Wikipedia definition for context, which I find personally very interesting for this post:
A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root. Some systems (particularly older, microcode-based architectures) can also perform
various transcendental functions such as exponential or trigonometric calculations, though in most modern processors these are done with software library routines. In most modern general purpose computer architectures, one or more FPUs are integrated with the CPU; however many embedded processors, especially older designs, do not have hardware support for floating-point operations. In the past, some systems have implemented floating point via a coprocessor rather than as an integrated unit; in the microcomputer era, this was generally a single integrated circuit, while in older systems it could be an entire circuit board or a cabinet. Not all computer architectures have a hardware FPU. In the absence of an FPU, many FPU functions can be emulated, which saves the added hardware cost of an FPU but is significantly slower. Emulation can be implemented on any of several levels: in the CPU as microcode, as an operating system function, or in user space code. (source: http://en.wikipedia.org/wiki/Math_coprocessor )
If you’ve noticed the bold and colored sentence in the selected text above, it points to the fact that most modern processors have replaced math co-processors with embedded Floating Point Units and software libraries. So what has happened is that a previous cottage industry, which provided ASICs functioning alongside a CPU, have disappeared.
However, that hasn’t stopped new technologies from cropping up in the area of numerical processing. A type that has
become extraordinarily popular for graphics and vector processing of late are GPUs. For specific numerical and highly parallel tasks GPUs with standard x86 CPUs have arrived on the scene and become popular for increasing compute capability while decreasing physical system footprint. Generalizing a bit, what I see is the sedimentary hypothesis in action: separate HW function lives for a while, but eventually, when functioning as the microprocessor, libraries and compliers become good enough that the need for a separate HW goes away. Repeat cycle!
Now let’s take a look at what Intel has been doing with their microprocessor family around embedded applications such as storage. Specifically, if you read some of Intel’s product briefs on their microprocessors for embedded applications and you’re a storage vendor, you might think that hell has finally frozen over.
Intel has been implementing embedded application functionality into their Xeon processor line adding in a veritable alphabet soup of TLAs. Here are but a few of the capabilities:
- Internal support for RAID 0, 1, 5, and 10
- Integrated SAS and PCIe
- Support for AES, Hashing, Chunking and Compression
- Non-transparent bridging
- Various virtualization assists
There’s also the assertion from Intel that software RAID stacks with Intel microprocessor assists are on par with ASICs that support RAID offload from a standard microprocessor.
My response: Okay, this is nothing more than the sedimentary hypothesis in action, and eventually Intel’s Xeon SoC for embedded systems will solve some, but not all, storage problems. Furthermore, new whitespace problems will emerge in the storage market, and guess what? Intel won’t have that capability on or near their processor for a while — just like we see with math co-processors being sucked into the micro process and GPUs in a Phoenix-like way, rising from the math co-processor ashes. So from ASIC to microprocessor and back again!
Any ideas for what the white space could be? Drop me a line or comment here if you have any suggestions. Otherwise, tune in soon to read some ideas in a future post.
From NAS Virtualization to NAS Feature
Per my previous post, I wanted to provide more concrete examples from the storage world related to the sedimentary hypothesis.
Here goes example number one: NAS virtualization.
You may recall past companies and products in this space. Those that come immediately to mind include Rainfinity, Acopia, and StorageX, with only Acopia ARX really existing at F5 as a standalone NAS virtualization product. All the others have either been acquired or have gone out of business (at least as far as I know). As there are no longer being highlighted via a standalone application or appliance begs the question: Is NAS virtualization a viable technology?
You bet, and you can see it in action within two Hitachi products, except not as separate appliances: Notably, you’ll find NAS virtualization in the Hitachi Data Ingestor (HDI) and the Hitachi NAS Platform (HNAS).
Our first incarnation was done in 2007/2008 by applying engineering talent from HDS to the then standalone BlueArc. (Here’s a shout out to Simon, Paul, and Phil…welcome back!) It showed up as a feature called eXternal Volume Link (XVL) and was controlled through a basic interface on the native element manager or through full content and indexing via Hitachi Data Discovery Suite (HDDS). XVL can talk to any NFSv3 server as well as using REST over HTTP to talk to Hitachi Content Platform (HCP). So what we did was to put NAS Virtualization as a feature into the storage infrastructure four years ago.
The second incarnation is within HDI and was first implemented as a connection to HCP using REST over HTTP. It is and was designed as a cloud on-ramp for remote locations to connect to stellar Hitachi Private cloud/object storage infrastructure. Most recently with the updated version of HDI we are now able to also virtualize via the CIFS protocol to consolidate existing NAS and Windows Filers into a Hitachi Private Cloud infrastructure. The setup of HDI for this purpose, just like XVL, is as an inline file system virtualizer which can take over shares from the target filers or file servers and allow users to smartly drain these older systems into the cloud.
In both instances you can see that in-band/inline NAS or file system virtualization is no longer a standalone product like F5 ARX or any of the other legacy technologies. In fact the NAS virtualization feature has transformed from a standalone application or appliance to features in the storage infrastructure. Digging a little deeper, two more key questions are: Why did we do this and why in this way?
Well to answer the first one, our customers asked us to. Here is a customer quote from 2006/2007. (Now, I will add that at the time this customer was the “poster child” for Acopia and since there is no statute of limitations on protecting customer names, I’ve removed the customer name from the quote.)
“Acopia was our only choice at the time, but if it was incorporated into a NAS product we’d throw out their [ARX] product in a second.”
Wow! This is still, and was back then, a very clear driver to do what we did. As to why we implemented XVL and HDI file system and NAS virtualization the way we did, that is pretty simple. When we looked at our existing portfolio we already had what was becoming a blockbuster success in the form of in-band block storage virtualization in the form of the original USP. This system had the data movement engine within the storage controller sporting a basic control point on the native element manager and an advanced control mechanism in an out-of-band controller called Tiered Storage Manager at the time. As a result we made the determination that to help our customers as they wanted to add NAS to their portfolio, we’d follow a similar approach with the hope of making adoption easier.
If this isn’t a data point screaming that the sedimentary hypothesis of technology is true then I don’t know what else is. However, this is only one data point and more are needed, and for that you’ll have to wait until the next post.
The Sedimentary Hypothesis of Technology
I’ve mentioned the sedimentary hypothesis of technology in a few tweets already, and now I wanted to take the time to explain this concept in more detail. Before I get into explaining the hypothesis, let me provide a warm-up in the form of a definition of the process for forming organic sedimentary rock.
Organic sedimentary rocks are formed under varying degrees of pressure and temperature over long periods of time. More pressure and an increase in temperature will form different types of organic sedimentary rocks. When organic material is broken down it becomes peat. Peat is the first step in the organic sedimentary rock process. As more earth accumulates over the peat and causes the peat to come under greater pressure and a higher temperature, then lignite is formed, another type of organic sedimentary rock. After the lignite is formed it begins to undergo a similar process as the peat. More pressure is applied to the lignite and the temperature becomes hotter resulting in the formation of bituminous coal. Bituminous coal then becomes anthracite coal as its temperature and pressure increases. Coal is created under swampy conditions that are not commonly found in our era because it needs higher sea levels to help it form. (Source: eHow.com on Organic Sedimentary Rock)
Obviously what precedes the generation of organic sedimentary rock is a vibrant active ecosystem filled with fauna and flora—both of which can die initiating the process of rock formation. I see technology in much the same way; basically it goes like this:
- Application – correlates to the vibrant and active ecosystem, but eventually every application or at least some parts of an application “die”, begetting.
- Middleware/Feature-ware – matches the peat stage of organic sedimentary rock formation and occurs when what were once vibrant applications or several application components transform into a middleware stack or a set of capabilities within an existing middleware stack, and with time and market pressure produce.
- OS-ware/Infrastructure-ware – is rather like ignite or bituminous coal happening when middleware and feature-ware end up as either features or components in either the OS or within the infrastructure (e.g. Storage, network or compute), and finally with additional market innovations result in.
Microprocessors, ASICs, ASSPs or FPGAs – realize the equivalent of anthracite coal and are comprised of accelerators, full/partial offloads of capabilities into silicon or assembly-like instructions executing on FPGAs. (Note that complete implementations may never find their way into silicon; however when algorithms arrive on silicon often extreme performance boosts and power consumption reductions are major benefits.) This is the general “hypothesis” that I’ve been referring to, and I think there may even be more sub-cycles within each layer. For example, multimedia functions (e.g. graphics and audio) used to be merely a set of software running on a general purpose processors. Then, over time, the GPUs and other accelerators have arisen, taking moving a large part of this function onto silicon. Now, given even more time, there is a processor from the SoC model to further compress things like GPUs onto a single multi-type many-core processor produced by the likes of Intel or AMD. Another example is in the DBMS world where there are a plethora of open source alternatives to Oracle and NO-SQL systems whose core is available for free. I believe that this shows in the middleware layer there is healthy market pressure/competition resulting in a wide selection of offerings.
A conclusion, and an inappropriate one, is that because of the large number of DBMS technologies, especially with the focus on open source, this market is officially commoditizing.
I have a couple of other posts up my sleeve with some real world examples coming soon. Until then, what do you think? Am I on to something? Can we transform the hypothesis into a theory?
But Which Big Data Again?
As I have mentioned before, there is more to the Big Data story than Data Warehousing. Let me conclude first and back my way into the “why”.
I would say that the next tool in the arsenal of any Big Data question is Search!
However, the big “S” Search that I’m talking about is before an analytic query across data residing in a data mart, Key Value Store, Columnar Data Store, or any other NO-SQL (not only-SQL) system. Since in the era of the big bang of Data the super majority of data is potentially exabytes in scale and structured, unstructured and semi-strucured in type, I argue that this pre-Search may indeed be the most important of all.
In his post, Philip Russom talks about this very point: an early step in the overall analytic process, he calls “Discovery Analytics,” which is prior to the institutionalization phase requiring formal ETL placing the data into a DWH or NO-SQL store. This is not dissimilar to early phases in eDiscovery, which include a kind of raw search across mounds of content. Results from this search are then passed to a case management tool for further refinement and analysis. This Discovery Analytic process, to use Philip’s term, identifies the insightful diamonds in the rough which can literally transform, refine, revolutionize, or save an enterprise. Without this phase we are left with no seed to initiate a longer term or deep and recurring analytic process—the kind that Mr. Russom dubs as being institutionalized.
My worry is that the industry is largely leaving behind Search or Discovery Analytics in the general discussions surrounding Big Data. Instead there appears to be fascination with NO-SQL data stores, feeding Hadoop, releasing your own version of a Hadoop, evolving BI tools to handle Big Data, etc. Perhaps this is due to the fact that Search is not trendy enough to warrant hype and excitement, but I suppose if we modify the name to “Discovery Analytics” things could change.
Rest assured that worrying about Search within the enterprise can yield real and tangible results beyond Big Data. In fact, at least Forrester states, as of 2009 information workers spend almost a half a day a week merely finding things inside of an enterprise. To me, this means if the enterprises and vendors who provide to the enterprise focus on Search as Discovery Analytics, we could improve the lives of everyday users and put in the rebar needed to pave the path towards managing Big Data.
Furthermore, I think that an added and positive consequence of focusing on search is the real potential to start the democratization of the Data Scientist. In my humble opinion this could not happen soon enough so that the role is prevented from being entrenched in an almost ivory tower-esque way throughout the industry.
Here’s to a Big-Data-verse for the people, of the people, and by the people.