An In Depth Conversation with Cisco.
For those of you that haven’t been following along in real time – c’mon man! – we’re in the midst of a multi-blog series around NVMe and data center design (list of blogs below).
That’s right, data center design. Because NVMe affects more than just storage design. It influences every aspect of how you design an application data path. At least it should if you want maximum return on NVMe investments.
In our last blog we got the Brocade view on how NVMe affects network design. As you might imagine, that conversation was very Fibre Channel-centric. Today we’re looking at the same concept – network design – but bringing in a powerhouse from Cisco.
If you've been in the industry for a while you've probably heard of him: J Michel Metz. Dr. Metz is an R&D Engineer for Advanced Storage at Cisco and sits on the Board of Directors for SNIA, FCIA and NVM Express. So… yeah. He knows a little something about the industry. In fact, check out a blog on his site called storage forces for some background on our discussion today. And if you think you know what he’s going to say, think again.
Ok. Let’s dig in.
Nathan: Does NVMe have a big impact on data center network design?
J: Absolutely. In fact I could argue the networking guys have some of the heaviest intellectual lift when it comes to NVMe. With hard disks, tweaking the network design wasn't nearly as critical. Storage arrays – as fast as they are - were sufficiently slow so you just sent data across the wire and things were fine.
Flash changed things, reducing latency and making network design more critical, but NVMe takes it to another level. As we’re able to put more storage bits on the wire, it increases the importance of network design. You need to treat it almost like weather forecasting; monitoring and adjusting as patterns change. You can’t just treat the storage as “data on a stick;” just some repository of data at the end of the wire, where you only have to worry about accessing it.
Nathan: So how does that influence the way companies design networks and implement storage?
J: To explain I need to start with a discussion of how NVMe communications work. This may sound like a bizarre metaphor, but bear with me.
Think of it like how food is ordered in a ‘50s diner. A waitress takes an order, puts the order ticket on the kitchen counter and rings a bell. The cook grabs the ticket, cooks the food, puts the order back on the counter and rings the bell. The waitress then grabs the order and takes it to the customer. It’s a process that is efficient and allows for parallel work queues (multiple wait staff and cooks).
Now imagine the customers, in this case our applications, are a mile away from the kitchen, our storage. You can absolutely have the waitress or the cook cross that distance, but it isn't very efficient. You can reduce the time to cross the distance by using a pneumatic tube pass orders to the kitchen, but someone ultimately has to walk the food. That adds delays. Again, the same is true with NVMe. You can optimize NVMe to be transferred over a network, but you’re still dealing with the physics of moving across the network.
At this stage you might stop and say ‘hey, at least our process is a lot more efficient and allows for parallelism.’ That could leave you with a solid NVMe over Fabric design. But for maximum speed what you really want is to co-locate the customers and kitchen. You want your hosts as close to the storage as possible. It’s the trade-offs that matter at that point. Sometimes you want the customers in the kitchen. And that’s what hyper-convergence is, but obviously can only grow so large. Sometimes you want a centralized kitchen and many dining rooms. That’s also what you can achieve with rack-scale solutions that put an NVMe capacity layer sufficiently close to the applications, at the ‘top of rack.’ And so on.
Nathan: It sounds like you’re advocating a move away from traditional storage array architectures.
J: I want to be careful because this isn’t an ‘or’ discussion, it’s an ‘and’ discussion. HCIS is solving a management problem. It’s for customers that want a compute solution with a pretty interface and freedom from storage administration. HCIS may not have nearly the same scalability as an external array, but it does allow application administrators to easily and quickly spin up VMs.
As we know though, there are customers that need scale. Scale in capacity; scale in performance and scale in the number of workloads they need to host. For these customers, HCIS isn’t going to fit the bill. Customers that need scale – scale across any vector – will want to make a trade-off in management simplicity for the enterprise growth that you get from an external array.
This also applies to networking protocols. The reason why we choose protocols like iWARP is for simplicity and addressability. You choose the address and then let the network determine the best way to get data from point A to point B. But, there is a performance trade-off.
Nathan: That’s an excellent point. At no point have we ever seen IT coalesce into a single architecture or protocol. If a customer needs storage scale with a high-speed network what would you recommend?
J: Haven’t you heard that every storage question is answered with, “It depends?”
Joking aside, it’s never as simple as figuring out the best connectivity options. All storage networks can be examined “horizontally.” That is the phrase I use to describe the connectivity and topology designs from a host through a network to the storage device. Any storage network can be described this way, so that it’s easy to throw metrics and hero numbers at the problem: what are the IOPS, what is the latency, what are the maximum number of nodes, etc.
What we miss in the question, however, is whether or not there is a mismatch between the overall storage needs (e.g., general purpose network, dedicated storage network, ultra-high performance, massive scale, ultra-low latency, etc.) and the “sweet spot” of what a storage system can provide.
There is a reason why Fibre Channel is the gold standard for dedicated storage networks. Not only is it a well-understood technology, it’s very, very good and not just performance, but reliability. But for some people there are other considerations to pay attention to. Perhaps the workloads don’t need to lend themselves to a dedicated storage network. Perhaps “good enough” is, well, “good enough.” For them, they are perfectly fine with really great performance with Ethernet to the top-of-rack, and don’t need the kind of high availability and resiliency that a Fibre Channel network, for instance, is designed to provide.
Still others are looking more for accessibility and management, and for them the administrative user interface is the most important. They can deal with performance hits because the management is more important. They only have a limited number of virtual machines, perhaps, so HCIS using high-speed Ethernet interconnects is perfect.
As a general rule, “all things being equal” are never actually equal. There’s no shortcut for good storage network design.
Nathan: Let’s look forward now. How does NVMe affect long term network and data center design?
J: <Pause> Ok, for this one I’m going to be very pointedly giving my own personal opinion. I think that the aspect of time is something we’ve been able to ignore for quite a while because storage was slow. With NVMe and flash though, time IS a factor and it is forcing us to reconsider overall storage design, which ultimately affects network design.
Here is what I mean. Every IO is processed by a CPU. The CPU receives a request – write, etc. –passes it on and then goes off to do something else. That process was fine when IO was sufficiently slow. CPUs could go off and do any number of additional tasks. But now, it’s possible for IO to happen so fast that the CPU cannot switch between tasks before the IO response is received. The end result is that a CPU can be completely saturated by a few NVMe drives.
Now, this is a worst-case scenario, and should be taken with a grain of salt. Obviously, there are more processes going on that affect IO as well as CPU utilization. But the basic premise is that we now have technologies that are emerging that threaten to overwhelm both the CPU and the network. The caveat here, the key take-away, is that we cannot simply swap out traditional spinning disk, or flash drives, with NVMe and expect all boats to rise.
In my mind this results in needing more intelligence in the storage layer. Storage systems, either external arrays or hyperconverged infrastructures, will ultimately be able to say no to requests and ask other storage systems for help. They’ll work together to coordinate and decide who handles tasks like an organic being.
Yes, some of this happens as a result of general machine learning advancements, but it will be accelerated because of technologies like NVMe that force us to rethink our notion of time. This may take a number of years to happen, but it will happen.
Nathan: If storage moves down this path, what happens to the network?
J: Well, you still have a network connecting storage and compute but it, too, is more intelligent. The network understands what its primary objectives are and how to prioritize traffic. It also knows how to negotiate with storage and the application to determine the best path for moving data back and forth. In effect, they can act as equal peers to decide on the best route.
You can also see a future where storage might communicate to the network details about what it can and can’t do at any given time. The network could then use this information to determine the best possible storage device to leverage based on SLA considerations. To be fair, this model puts the network in a ‘service broker’ position that some vendors may not be comfortable with. But since the network is a common factor that brings storage and servers together it creates opportunity for us to establish the best end-to-end route.
In a lot of ways, I see end-to-end systems coming together in a similar fashion to what was outlined in Conway’s game of life. What you’ll see is data itself self-organizing based on priorities that are important for the whole system – the application, the server, the network and the storage. In effect, you’ll have autopoiesis, a self-adaptive system.
I should note that what I’m referring to here are really, really large systems of storage, not necessarily smaller host-to-storage-array products. There are a lot of stars that need to align before we can see something like this as a reality. Again, this is my personal view.
Nathan: I can definitely see why you called this out as your opinion. You’re looking pretty far in to the future. What if we pull back to the next 18 – 24 months, how do NVMe fabrics play out?
Nathan: I know. I’m constraining you. Sorry about that.
J: <Laughs> In the near term we’re going to see a lot of battles. That’s to be expected because the standards for NVMe over Fabrics (NVMe-oF) are still relatively new.
Some vendors are taking shortcuts and building easy-to-use proprietary solutions. That gives them a head start and improves traction with customers and mind share, but it doesn't guarantee a long-term advantage. DSSD proved that.
The upside is that these solutions can help the rest of the industry identify interesting ways to implement NVMe-oF and improve the NVMe-oF standard. That will help make standards-based solutions stronger in the long run. The downside is that companies implementing early standards may feel some pain.
Nathan: So to close this out, and maybe lead the witness a bit. Is the safest way to implement NVMe – today – to implement it in an HCI solution and wait for the NVM-oF standards to mature?
J: Yeah. I think that is fair to say, especially if there is a need to address manageability challenges. HCIS absolutely helps there. For customers that do need to implement NVMe over Fabrics today, Fibre Channel is probably the easiest way to do that. But don’t expect FC to be the only team on the ball field, long term.
If I go back to my earlier point, different technologies are optimized for different needs. FC is a deterministic storage network and it’s great for that. Ethernet-based approaches, though, can be good for simplicity of management, though it’s never a strict “either-or” when looking at the different options.
I expect Ethernet-based NVMe-oF to be used for smaller deployment styles to begin with, single switch environments, rack-scale architectures, or standalone servers with wicked fast NVMe drives connected across the network via a Software Defined Storage abstraction layer. We are already seeing some hyperconvergence vendors flirt with NVMe and NVMe-oF as well. So, small deployments will likely be the first forays into NVMe-oF using Ethernet, and larger deployments will probably gravitate towards Fibre Channel, at least in the foreseeable time frame.
CLOSING THOUGHTS <NATHAN’S THOUGHTS>
I can’t agree more. In my mind, NVMe can and should serve as a tipping point that forces us, vendors, to rethink our approach to storage and how devices in the data path interoperate.
This applies to everything from the hardware architecture of storage arrays; to how / when / where data services are implemented; even to the way devices communicate. I have some thoughts around digital force feedback where an IT infrastructures resists a proposed change and respond with a more optimal configuration in real-time (imagine pushing a capacity allocation to an array on your mobile phone and feeling pressure of it resisting then responding with green lights over more optimal locations & details on why the change is proposed), but that is a blog for a day when I have time to draw pictures.
The net is that as architects, administrators and vendors we should view NVMe as an opportunity for change and consider what we keep vs. what we change – over time. As J points out NVMe-oF is still maturing and so are the solutions that leverage it. So to you dear reader:
- NVMe on HCI (hyper-converged infrastructure) is great place to start today.
- External storage with NVMe can be implemented, but beware anyone who says their architecture is future proof or optimized to take full advantage of NVMe (J’s comment on overloading CPUs is a perfect example of why).
- Think beyond the box. Invest in an analytics package that looks at the entire data path and lets you understand where bottlenecks exist.