Hu Yoshida

Enterprise Storage Arrays and NVMe

Blog Post created by Hu Yoshida Employee on Aug 29, 2018

The hot topic in storage today is NVMe, an open standards protocol for digital communications between servers and non-volatile memory storage. It replaces the SCSI protocol that was designed and implemented for mechanical hard drives which processed one command at a time. NVMe was designed for flash and other non-volatile storage devices that may be in our future. The command set is leaner, and it supports a nearly unlimited queue depth that takes advantage of the parallel nature of flash drives (a max 64K queue depth for up to 64K separate queues). The NVMe protocol is designed to transport signals over a PCIe bus and removes the need for an I/O controller between the server CPU and the flash drives.

 

PCIe (Peripheral Component Interconnect Express) is a standard type of connection for internal devices in a computer. NVMe SSD devices connected to PCIe has been available in PCs for some time. Hitachi Vantara has implemented NVMe on our hyperconverged, Unified Compute Platform (UCP HC), where internal NVMe flash drives are connected directly to the servers through PCIe. The benefit of this is having a software-defined storage element and virtual server hypervisor on the same system where you can access NVMe at high speed.  It makes sense to us to first bring the performance advantages of NVMe to commodity storage like our UCP HC because improvements will be greater for our customers. There are considerations though: since there is not a separate storage controller, data services will have to be done by the host CPU which adds overhead. If a VM has to access another node to find data, you lose time. For smaller data sets this isn't an issue, but as the workload increases, this negates some of the performance advantages of NVMe. However, you are still ahead of the game compared to SCSI devices and UCP HC with NVMe is a great option for hyperconverged infrastructure workloads.

 

NVMe is definitely the future, but the storage industry is not quite there yet with products that can fully take advantage of the technology. PCIe has not broken out of the individual computer enclosure to function as a high-speed, wide bandwidth, scalable serial interconnect of several meters in length between control platforms and I/O, data storage or other boxes within an I/O rack. Here are the current proposals for NVMe transport:

NVME Transport.png

Clearly this is an evolving area and most storage solutions that are available with NVMe today use PCIe for the back-end transport.  NVMe SSDs plug into a PCIe backplane or switch which plugs directly into the PCI root complex. However, PCIe has limited scalability.  There’s a relatively low number of flash devices that can reside on the bus.  This is okay for hyperconverged storage but it’s not what most customers are used to dealing with in All Flash Arrays. Scalable enterprise NVMe storage arrays will likely require a fabric on the backend.

 

What one would like is an NVMe, All Flash Array, with an enterprise controller for data services and shared connectivity over a high-speed fabric. The backend Flash devices could connect to the controller or data services engine over an internal  NVMe-oF, which would in turn connect to a host system using an external NVMe-oF using FC or RDMA. Since PCIe connections are limited in distance and do not handle switching, a fabric layer is required for host connectivity to external storage systems. While NVMe standards are available, both FC-NVMe and NMVe-oF are still works in progress since Rev1 of the NVMe-oF standards was published in June of 2016 by the NVM Express organization, and Rev 1 of the FC-NVMe standard was just released last summer by the T11 committee of INCITs. Only a few proprietary implementations are available today. Several of our competitors have announced AFAs that they claim to be NVMe “fabric ready”.  In fact, they are promoting features that have not been tested for performance, resiliency or scalability and are based on incomplete standards that are still evolving.  Implementing based on these promises can add a huge risk to your installation and tie you to a platform that may never deliver up to the hype.

 

Here is where I believe we are in the NVMe introduction of enterprise storage arrays.

Enterprise NVME Arrays.png

NVMe-oF is needed to scale the connectivity and speed up the transmission of data between an NVMe SSD device and controller and FC-NVMe or NVMe-oF can do the same between the controller and a fabric connected host. However, there is a lot that goes on in the SSD device, the controller, the fabric, and the host that can affect the overall throughput. In fact, the congestion caused by the higher speeds of NVMe and the higher queue depths can negate the transmission speeds unless the entire system is designed for NVMe.

 

On the backend, flash drives require a lot of software and a lot of processing power for mapping pages to blocks, wear leveling, extended ECC, data refresh, housekeeping, and other management tasks which can limit performance, degrade durability, and limit the capacity of the flash device. The higher I/O rates of NVMe could create bottlenecks in these software functions, especially on writes. While Hitachi Vantara Flash storage systems can use standard SAS SSDs, we also provide our own flash modules, the FMD, FMD DC2 (with inline compression), and the FMD HD for high capacity (14TB) to improve the performance, durability and capacity of NAND devices. In order to support these processing requirements, the FMDs from Hitachi Vantara are built with a quad core multiprocessor, with 8 lanes of PCIe on the FMD PCBA and integrated flash controller logic, which supports 32 paths to the flash array. Having direct access to the engineering resources of Hitachi Ltd., Hitachi Vantara is able to deliver patented new technology in our FMDs, which sets it apart from competitive flash vendors. As the NVMe rollout progresses, expect to see other vendors trying to catch up with us by putting more processing power into their flash modules.  This advantage that Hitachi has from our years of flash hardware engineering efforts is one of the reasons why we aren’t rushing NVMe into our Virtual Storage Platform (VSP) all-flash arrays.  Our customers are already seeing best-in-class performance levels today.

 

One of the biggest reasons for a controller or data services engine is to be able to have a pool of storage that can be shared over a fabric by multiple hosts. This enables hosts and storage to be scaled separately for operational simplicity, storage efficiency and lower costs. Controllers also offload a lot of enterprise functions that are needed for availability, disaster recovery, clones, copies, dedupe, compression, etc. Because of their central role, controllers have to be designed for high availability and scalability to avoid being the bottleneck in the system. Dedupe and Compression are key requirements for reducing the cost of flash and are done in the controller if both are required (note that compression is done in the FMD when they are installed but in the controllers for SSDs). The new controllers for an NVMe controller must support all these functions while talking NVMe to the backend flash devices and FC-NVMe or NVMe-oF across the fabric to the multiple hosts. Here again, the increase in workloads due to NVMe could create bottlenecks in the controller functions unless it’s been designed to handle it.

 

Over the many generations of VSP and the VSP controller software, SVOS; Hitachi has been optimizing the hardware and software for the higher performance and throughput of flash devices. The latest version of Storage Virtualization Operating System RF (SVOS RF) was designed specifically to combine QoS with a flash aware I/O stack to eliminate I/O latency and processing overhead. WWN, ports, and LUN level QoS, provide throughput and transaction controls to eliminate the cascading effects of noisy neighbors which is crucial when multiple NVMe hosts are vying for storage resources. For low latency flash, the SVOS RF priority handling feature bypasses cache staging and avoids cache slot allocation latency for 3x read throughput and 65% lower response time. We have also increased compute efficiency, enabling us to deliver up to 71% more IOPS per core. This is important today and in the future because it allows us to free up CPU resources for other purposes, like high speed media. Dedupe and compression overheads have been greatly reduced by SVOS RF (allows us to run up to 240% faster while data reduction is active) and hardware assist features. Adaptive Data Reduction (ADR) with artificial Intelligence (AI) can detect, in real time, sequential data streams, data migration or copy requests that can more effectively be handled inline. Alternatively, random data writes to cells that are undergoing frequent changes will be handled in a post-process manner to avoid thrashing on the controllers. Without getting into too much technical detail, suffice it to say that the controller has a lot to do with overall performance and more will be required when NVMe is implemented.  The good news is that we’ve done a lot of the necessary design work within SVOS to optimize the data services engine in VSP for NVMe workloads.

 

From a fabric standpoint FC-NVMe can operate over a FC Fabric, so data centers could potentially use the technology they have in place by upgrading the firmware in their switches. The host bus adapters (HBA) would need to be replaced or upgraded with new drivers and the switches and HBAs could be upgraded to 36 Gbps to get the performance promised by NVMe. If NVMe-oF is desired, it will require RDMA implementations which means Infiniband, iWARP or RDMA over Converged Ethernet (RoCE).  Vendors, such as Mellanox, offer adaptor cards capable of speeds as much as 200 Gbps for both Infiniband and Ethernet. Considerations need to be given for the faster speeds, higher queue depths, LUN masking, and QoS, etc, otherwise congestion in the fabrics will degrade performance.  More information about NVMe over fabric can be found in blogs by our partners Broadcom/Brocade and Cisco. J Metz of Cisco published a recent tutorial on Fabrics for SNIA.

 

Another consideration will be whether the current applications can keep up with the volume of I/O. When programs knew they were talking to disk storage, they could branch out and do something else while the data was accessed and transferred into its memory space. Now it may be better to just wait for the data rather than go through the overhead of branching out, waiting for interrupts and branching back.

 

NVMe is definitely in our future. However, moving to NVMe will take careful planning on the part of the vendors and consumers. You don’t want to jump at the first implementation and find out later that you have painted yourself into a corner.  Although the Hitachi Vantara VSP F series of all flash arrays do not support NVMe at this time, it compares very favorably with products which have introduced NVMe.

 

A recent, August 6, 2018, Gartner Critical Capabilities for Solid State Arrays report provides some answers. In terms of performance rating, the VSP F series came in third in front of several vendors that had NVMe. This evaluation did not include the latest SVOS RF and VSP F900/F700/F370/F350 enhancements which were announced in May because they did not make Gartner’s cutoff date for this year’s evaluation. These new enhancements featured an improved overall flash design, with 3x more IOPS, lower latency and 2.5x more capacity than previous VSP all flash systems.

 

The only two vendors ahead of the F series in performance are the Kaminario K2 and the Pure Storage Flash Blade, none of which have the high reliability, scalability and enterprise data services of the VSP.  In fact, the VSP F series placed the highest in RAS (reliability, availability, serviceability) of all 18 products that were evaluated. The Kaminario K2 has a proprietary NVMe-oF host connection which they call NVMeF,and the Pure Storage NVMe arrays followed us with a DirectFlash storage module instead of the standard SSD. One can assume that the performance of the Hitachi Vantara All Flash Arrays would be higher if the new models of the VSP and SVOS RF had been included in the evaluation. Here are the Product scores for the High-Performance Use Case for the top three places on a scale from 1 to 5 with 5 being the highest.

 

Kaminario K2                                       4.13

Pure Storage FlashBlade                    4.08

Hitachi VSP F Series                            4.03

Pure Storage M and X Series              4.03

 

Hitachi product management leadership has confirmed that our VSP core storage will include NVMe in 2019, and we are happy to share more roadmap details to interested customers on an NDA basis. In the meantime I recommend that you follow the NVMe blog posts by Mark Adamsand Nathan Moffit

Outcomes