Nathan Moffitt

Is NVMe Killing Shared Storage?

Blog Post created by Nathan Moffitt Employee on Sep 7, 2017

If you've been investigating NVMe solutions lately you may have noticed some interesting comments around shared storage and NVMe. Namely, NVMe and NVMe over Fabrics (NVMeF or NVMe-oF depending on who you talk to. I like fewer characters) introduces some... challenges for current shared storage architectures.


What? You've been told that NVMe slots right into current designs? Well sure, you can support NVMe and NVMeF by 'tweaking' an existing array design but that doesn't mean it’s going to give you the ROI you expect. So buyer beware lest you become a grumpy cat.


Note: For an overview on NVMe, NVMe over Fabrics an different approaches to implementing NVMe, check out this blog by Mark Adams.



The notion of shared storage has been around for a long time. Implementations vary, but the basic idea is the same: a system owns a pool of storage and shares it out for use by multiple hosts. This enables superior economies of scale compared to direct attached storage because capacity is:


  • Managed and serviced from a consolidated location (operational simplicity)
  • Not stranded on / under-utilized by individual hosts (reduced budgetary waste)
  • Scaled independently of the host (operational efficiency)
  • Able to be virtualized and over-subscribed to minimize idle resources (storage efficiency)


Combined, these benefits significantly improve IT economics, reducing system, management and environmental costs (power & footprint).



From a ‘raw’ performance standpoint a single NVMe device has the ability to completely saturate a 40GbE network link (some think 3D XPoint will saturate a 100GbE link. I’d be very skeptical.). A single NVMe device also has the potential to soak up storage controller CPU time faster than a dry sponge in swimming pool.


Note for the experts: I agree NVMe is more CPU efficient than AHCI, but even an improvement in IO processing from 10 µs to 3 µs of CPU time still means a NVMe device can saturate a CPU with 100% read workloads. At 100% writes you've got only slightly more scale.


The implication is that even a small set of NVMe devices is going to consume a lot of network ports and CPU resources – even with all data services turned off (more on that in a minute). You’ll be spending a lot on a high end storage controller and expensive NVMe media to share… a few TB? Sure, you can add NVMe capacity but to what end? You aren't accessing its value because the controller is tapped out.


The other challenge is that the storage services used to abstract physical media into logical resources take processing time, adding latency. And if you plan on using deduplication to keep costs down? Bad news, deduplication adds a lot of overhead.


Even scale-out AFAs are not immune. Scale-out gets expensive if you only have a few NVMe devices per node (and nodes with more CPUs & 40GbE ports = more cost). Plus, cluster communications across multiple nodes will increase latency and reduce value.



This is why software-defined vendors are saying you should get rid of external shared storage and use NVMe only in direct attached storage (e.g. hyperconverged) or NVMe over Fabric RDMA solutions (e.g. rack-scale flash). Rack-scale flash vendors get more out of NVMe storage by breaking the mainstream storage design:


  • Stripping out storage services that can add latency (e.g. thin provisioning)
  • Moving core storage services to the host (e.g. very basic RAID)
  • Implementing NVMe over Fabric with RDMA (e.g. RoCE or iWarp)


Yes. Very very basic diagram, but hopefully it illustrates the point. Using NVMe over Fabrics with RDMA can be particularly helpful in increasing performance because it sends IO requests directly to the NVMe media, bypassing storage controller processing and its capabilities. The complication is that you lose shared storage capabilities unless you add a ‘manager’ that owns carving up the media and telling each host what block ranges they own (note: a few vendors have this).


Net, this architecture can really tap into the potential of NVMe but at the cost of enterprise sharing and data services. Do you want to give up replication? What about thin provisioning to over-commit and reduce costs? Data reduction to make flash more affordable? All are lost – at least for now.


So rack-scale flash gets us past the ROI challenges with current shared storage architectures, but it struggles in the shared storage department.




So how do we get the performance of rack scale flash (which really unlocks the ROI from NVMe) coupled with the enterprise shared storage functionality (which delivers the best TCO and, ultimately, resliency) found in today’s arrays?


This is the million dollar question storage vendors are looking to solve. It’s also the question that may drive you to hold tight and focus NVMe investment to hyperconverged workloads that benefit from a local NVMe footprint (see the Hitachi UCP HC) or rack scale flash workloads where IT teams may value performance over the cost efficiencies of shared storage.


OPTION 1: Improve shared storage in rack-scale flash. It’s absolutely possible to do this, just be aware that it may not be done completely in the storage array. Some services may be added to the array while others are done host side and managed via a hive intelligence or master control server. Where they are done will depend on the frequency that updates need to be done (lots of status updates across hosts? Do it in the array.) and technical feasibility.


My 2 cents. This is already happening, but it’s going to take a while before they deliver the level of sharing that you expect from an enterprise shared storage array. Even when it is, you can argue whether you want that functionality for the workloads rack-scale flash serves best. In the near term I’d use rack-scale flash for what it does better than anything else. Run analytic workloads at high speed.


OPTION 2: Use a hyper-converged infrastructure with robust data services and NVMe support. The benefit of this strategy is that adds back in our shared storage data services (virtualize and abstract!). By having a software-defined storage element and virtual server hypervisor on the same system you can access NVMe at high speed. There are considerations though:


  1. Data service overhead. The same thing that makes this a better solution also makes it slower. Every data service adds overhead so make sure you can toggle services off and on.
  2. Data set size / Distance impacts. Some day we’ll be able to fold space and instantly transmit data to nodes in different galaxies / pocket universes / points in time. Until then, we have to deal with wires. If a VM has to access another node to find data, you lose time. For smaller data sets this isn't an issue, but as the capacity used by a workload increases, it becomes a reality. You can optimize the code, use SR-IOV to bypass virtual NICs and use bigger pipes (100 GbE) to get around this, but it won’t have the streamlined stack of rack-scale flash.


My 2 cents. For ‘smaller’ data sets, HCIS (hyperconverged infrastructure solution) is a great option, but as data sets grow you need to consider a solution with a lighter stack (rack-scale) or the ability to consolidate larger amounts of NVMe devices.


OPTION 3: Use a NVMe over Fabric Optimized AFA. This doesn't exist yet. Yes, I hear vendor whining, but it doesn't. Here’s what we need, in my ever so insane opinion:


  • More, faster ports. For you to have a real NVMe AFA you need to have a much faster network interconnect.
  • NVMe Optimized Scale-out. Scale-out can be a good way to tap into NVMe power, but you have to consider cost & latency issues. To resolve this an AFA will need controller modules that can support more than a few NVMe devices and new QoS protocols that optimize for locality (including use of Directives) plus dynamically provision resources. Scaling will also likely include tiering. Wait… I said that before didn’t I?
  • Resource Fencing. To get the most out of NVMe devices you need to minimize the impact of data services. Use of offload cards or fixed / dynamic resource fencing is one way of shunting tasks to ‘dedicated’ resources so performance impacts are limited. There is a cost associated with this, but to reach maximum performance, it could be worth it.


My 2 cents. Getting to a fully optimized NVMe AFA is going to take time. Along the way we will see partial implementations that treat NVMe as ‘bulk capacity’ only tapping into the performance of a few devices. For those with deep pockets and the desire to upgrade over time – current solutions are not future proof - this may be fine. For those who want the best ROI, you may want to wait until an architecture has many of these elements. And if you have archive data, even near-line data, there are very few reasons to look at NVMe yet.



What shocks me is that this is a trimmed down blog. There is just too much to say. NVMe and NVMe over Fabrics changes the parameters of how we share storage and necessitates changes to our storage architectures. Those changes are evolving rapidly as vendors consider how to adjust architectures to support new and emerging workloads.


In the near term many of our notions of shared storage will have to adjust if we want to get every ounce of ‘juice’ out of NVMe, but there are options like Hitachi’s UCP HC – our hyperconverged offering that has NVMe as direct attached storage for high performance access. Like other offerings, it doesn’t hit the current scale of enterprise storage, but it is also a lot easier to upgrade and scale. Longer term solutions will evolve but you’ll likely deploy a mix of the solutions I called out (starts to feel a little like the rock band blog…).




Nathan Moffitt