An In Depth Conversation with Brocade
As we've discussed over the last several blogs, NVMe is much more than a communication protocol. It’s a catalyst for change. A catalyst that touches every aspect of the data path.
At Hitachi we understand that customers have to consider each of these areas, and so today we’re bringing in a heavy hitter from Brocade to cover their view of how data center network design changes – and doesn't change – with the introduction of NVMe.
The heavy hitter in this case is Curt Beckmann, principle architect for storage networking. A guy who makes me, someone who used to teach SEs how to build and debug FC SANs, feel like a total FC newbie. He’s also a humanitarian, on the board of Village Hope, Inc. Check it out.
Let’s dig in.
Nathan: Does NVMe have a big impact on data center network design?
Curt: Before I answer, we should probably be precise. NVMe is used to communicate over a local PCIe bus to a piece of flash media (see Mark’s NVMe overview blog for more). What we want to focus on is NVMe over Fabric, NVMe-oF. It’s the version of NVMe used when communicating beyond the local PCIe bus.
Nathan: Touché. With that in mind. Does NVMe-oF have a big impact on network design?
Curt: It really depends on how you implement NVMe-oF. If you use a new protocol that changes how a host interacts with a target NVMe device, you may need to make changes to your network environment. If your encapsulating NVMe in existing storage protocols like FC though, you may not need to change your network design at all.
Nathan: New protocols. You’re referring to RDMA based NVMe-oF protocols, right?
Curt: Yes. NVMe over Fabrics protocols that use RDMA, iWARP or RoCE, reduce IP network latency by talking directly to memory. For NVMe devices that can expose raw media, RDMA can bypass CPU processing on the storage controller. This allows faster, more ‘direct’ access between host and media. It does however require changes to the way networks are designed.
Nathan: Can you expand on this? Why would network design need to change?
Curt: Both iWARP and RoCE are based on Ethernet and IP. Ethernet was designed around the idea that data may not always reach its target, or at least not in order, so it relies on higher layer functions, traditionally TCP, to retry communications and reorder data. That’s useful over the WAN, but sub-optimal in the data center. For storage operations, it’s also the wrong strategy.
For a storage network, you need to make sure data is always flowing in order and is ‘lossless’ to avoid retries that add latency. To enable this, you have to turn on point-to-point flow control functions. Both iWARP and RoCE v2 use Explicit Congestion Notification (ECN) for this purpose. iWARP uses it natively. RoCE v2 added Congestion Notification Packets (CNP) to enable ECN to work over UDP. But:
- They aren't always ‘automatic.’ ECN has to be configured on a host. If it isn't, any unconfigured host will not play nice and can interfere with other hosts’ performance.
- They aren't always running. Flow control turns on when the network is under load. Admins need to configure exactly WHEN it turns on. If ECN kicks in too late and traffic is still increasing, you get a ‘pause’ on the network and latency goes up for all hosts.
- They aren't precise. I could spend pages on flow control, but to keep things short, you should be aware that Fibre Channel enables a sender to know precisely how much buffer space remains before it needs to stop. Ethernet struggles here.
There are protocol specific considerations too. For instance, TCP-based protocols like iWARP start slow when communication paths are new or have been idle, and build to max performance. That adds latency any time communication is bursty.
Nathan: So if I net it out, is it fair to say that Ethernet and NVMe is pretty complex today?
Curt: (Smiles). There’s definitely a level of expertise needed. This isn't as simple as just hooking up some cables to existing switches. And since we have multiple RDMA standards which are still evolving (Azure is using a custom RoCE build, call it RoCE v3), admins will need to stay sharp. Which raises a point I forgot to mention. These new protocols require custom equipment.
Nathan: You can’t deploy iWARP or RoCE protocols on to installed systems?
Curt: Not without a NIC upgrade. You need something called an R-NIC. There are a few vendors that have them, but they aren’t fully qualified with every switch in production.
That’s why you are starting to hear about NVMe over TCP. It’s a separate NVMe protocol similar to iSCSI that runs on existing NICs so you don’t need new hardware. It isn't as fast, but it is interoperable with everything. You just need to worry about the network design complexities. You may see it ultimately eclipse RDMA protocols and be the NVMe Ethernet protocol of choice.
Nathan: But what if I don’t care Curt? What if I have the expertise to configure flow control, plan hops / buffer management so I don’t hit a network pause? What if R-NICs are fine by me? If I have a top notch networking team, is NVMe over Fabric with RDMA faster?
Curt: What you can say is that for Ethernet / IP networks, RDMA is faster than no RDMA. In a data center, most of your latency comes from the host stack (virtualization can change the amount of latency here) and a bit from the target storage stack (See Figure 1). That is why application vendors are designing the applications to use a local cache for data that needs the lowest latency. No wire, lower latency. With hard disks, network latency was tiny compared to the disk, and array caching and spindle count could mask the latency of software features. This meant that you could use an array instead of internal drives. Flash is a game changer in this dynamic, because now the performance difference between internal and external flash is significant. Most latency continues to be from software features, which has prompted the move from the sluggish SCSI stack to faster NVMe.
Figure 1: Where Latency Comes From
I've seen claims that RoCE can do small IOs, like 512 bytes, at maybe 1 or 2 microseconds less latency than NVMe over Fibre Channel when the queue depth is set to 1 or some other configuration not used in normal storage implementations. We have not been able to repeat these benchmarks, but this is the nature of comparing benchmarks. We were able to come very close to quoted RoCE numbers for larger IO, like 4K. At those sizes and larger, the winner is the one with faster wire speed. This is where buyers have to be very careful. A comparison of 25G Ethernet to 16G FC is inappropriate. Ditto for 32G FC versus 40G Ethernet. A better comparison is 25G Ethernet to 32G FC, but even here check the numbers and the costs.
Nathan: Any closing thoughts?
Curt: One we didn't really cover is ease of deployment alongside existing systems. For instance, what if you want to use a single storage infrastructure to support NVMe-oF enabled hosts and ‘classic’ hosts that are using existing, SCSI based protocols? With FC you can do that. You can use existing Gen 5 and Gen 6 switches and have servers that supports multiple storage interface types. With Ethernet? Not so much. You need new NICs and quite possibly new switches too. Depending on who you speak with DCB switches are either recommended, if you want decent performance, or required. I recommend you investigate.
CLOSING THOUGHTS <NATHAN’S THOUGHTS>
Every vendor has their own take on things, but I think Curt’s commentary brings to light some very interesting considerations when it comes to NVMe.
- Ecosystem readiness – With FC (and maybe future Ethernet protocols), NVMe may require minimal to no changes in your network resources (granted, a network speed upgrade may be advised). But with RDMA, components change, so check on implementation details and interop. Make sure the equipment cost of changing to a new protocol isn't higher than you expect.
- Standard readiness – Much like any new technology, standards are evolving. FC is looking to make the upgrade transparent and there may even be similar Ethernet protocols coming. If you use RDMA, great. Just be aware you may not be future proofed. That can increase operational costs and result in upgrades sooner than you think.
- Networking expertise – With Ethernet, you may need to be more thoughtful about congestion and flow control design. This may mean reducing the maximum load on components of the network to prevent latency spikes. It can absolutely be done, you just need to be aware that NVMe over Fabric with RDMA may increase operational complexity that could result in lower than expected performance / ROI. To be clear though, my Ethernet friends may have a different view. We’ll have to discuss that with them.
Other than that, I’ll tell you what I told myself when I was a systems administrator. Do your homework. Examine what is out there and ask vendors for details on implementations. If you buy a storage solution that is built around what is available today, you may be designing for a future upgrade versus designing for the future. Beware vendors that say ‘future proof.’ That’s 100% pure marketing spin.