Last year in August, I posted a blog about NVMe, an open standards protocol for digital communications between servers and non-volatile memory storage. It replaces the SCSI protocol that was designed and implemented for mechanical hard drives which processed one command at a time. NVMe was designed for flash and other non-volatile storage devices that may be in our future. The command set is leaner, and it supports a nearly unlimited queue depth that takes advantage of the parallel nature of flash drives (a max 64K queue depth for up to 64K separate queues).
There are several transports for the NVMe protocol. NVMe by itself can use PCIe (Peripheral Component Interconnect Express), which is a standard type of connection for internal devices in a computer, to transport signals over a PCIe bus from a non-volatile memory storage device (SSD). Hitachi Vantara has implemented NVMe on our hyperconverged, Unified Compute Platform (UCP HC), where internal NVMe flash drives are connected directly to the servers through PCIe. While direct-attached architectures offer high performance and are easy to deploy at a small scale, data services like snapshots and replication will have to be done by the host CPU which adds overhead. If a VM has to access another node to find data, you will need to transfer the data or the application to the same node. For smaller data sets this isn't an issue, but as the workload increases, this negates some of the performance advantages of NVMe. However, you are still ahead of the game compared to SCSI devices and UCP HC with NVMe is a great option for hyperconverged infrastructure workloads.
In my post from last year, I introduced the other transports that enable NVMe to be transported over a Fabric for external attachment (NVMe-oF). These transports included, NVMe-oF using Fibre Channel and NVMe-oF using RDMA over Infiniband, RoCE, or iWARP.
Late last year, another transport was ratified, NVMe-oF using TCP. The value proposition for TCP, is that it’s well-understood, and can use the TCP/IP routers and switches. One of the disadvantages with TCP/IP is congestion. Unlike FC where buffer credits are used to ensure that the target can receive a packet before the packet is sent, The IP layer drops the packet when the network gets congested, and it is up to TCP to ensure that no data is lost, which causes the transport to slow down when the network gets overloaded. While TCP overreacts to congestion, it doesn’t fail; it just slows down. NVMe over TCP is still substantially ahead of SCSI in terms of latency while still behind NVMe over FC and RDMA.
RDMA, provides direct memory access, and will be the choice for high performance, but there will be decisions to be made on the choice of networking protocol; Infiniband, RoCE (RDMA over converged ethernet) or iWARP (Internet Wide Area RDMA Protocol).
While there is still some standardization to be done on NVMe-oF for Fibre Channel, this will probably be the network protocol to be accepted first, since it is more mature than NVMe-oF over TCP, provides flow control through buffer credits, and is a familiar network protocol for storage users. Like TCP, Fibre Channel can use existing routers and switches, with relatively minor changes in software. There will likely be different network protocols depending on use case. Direct attached NVMe over PCIe for hyperconverged and software defined storage, Fibre channel for enterprise storage, TCP for distributed storage and RDMA for high performance storage. SCSI will still be the dominant interface for the next few years. However, NVMe and NVMe-oF will eventually replace traditional SCSI based storage. I would expect the first implementations to be 50% Fibre Channel, 30% TCP, 12% PCIe, and 8% RDMA.
This could change dramatically depending on what the hyper scaling vendors do. This week Amazon acquired E8 Storage, an Israeli Company, that has an end to end 2U NVMe storage system that uses NVMe-oF over TCP. TCP is a logical choice for a cloud company.