Introduction
Data Availability ensures reliable access to information, a critical requirement from any storage provider. This concept involves the infrastructure, systems, processes, and policies organizations use to keep data accessible and usable for authorized users. As data volume and complexity grow, organizations allocate more resources to maintain reliability. However, in a TCP/NVMe storage environment, resource scaling has practical limits.
This blog shows a scenario where multiple NVMe namespaces are allocated to an ESXi host through multiple NVMe subsystems and controllers. Such configurations can lead to resource exhaustion during error recovery, resulting in system outages and failed data recovery attempts.
Terms to Know
- NVMe (Non-Volatile Memory Express): A modern storage access and transport protocol designed for flash and next-generation solid-state drives (SSDs), offering unparalleled throughput and response times for diverse enterprise workloads. To achieve a high-bandwidth, low-latency experience, the NVMe protocol leverages a PCI Express (PCIe) bus, that supports tens of thousands of parallel command queues. This architecture enables significantly faster performance compared to hard disks and traditional all-flash systems, that are constrained to a single command queue.
- NVMe over Fabrics (NVMe-oF): Extends the performance and low-latency advantages of NVMe across network fabrics, such as Ethernet, Fibre Channel, and InfiniBand.
- NVMe/TCP: Implements NVMe-oF over Ethernet by encapsulating NVMe commands and data within TCP datagrams. Unlike iSCSI, NVMe/TCP supports more queues and data transport paths, increasing throughput and reducing latency. It can be deployed over any TCP network, offering a simpler and more cost-effective setup.
The following diagram shows a TCP/NVMe network:
Figure 1: TCP/NVMe network
- NVMe Subsystem: In a Hitachi storage system, administrators configure NVMe subsystem similarly to LUN masking or host group creation in Fibre Channel environments. This setup allows NVMe LUNs/namespaces to be assigned to TCP/NVMe hosts. This subsystem uses CCI commands to define a host mode specific to the host operating system. Each NVMe subsystem includes a unique identifier (NVMSS_ID), a name (NVMSS_NAME), and an NVMe Qualified Name (NVMSS_NQN). Additionally, administrators register the NVMe storage ports and the NQNs of the hosts accessing the NVMe subsystem.
- NVMe Namespace: The namespace functions as a mapping layer between the storage system and the host, facilitating seamless interaction with NVMe resources. When creating a namespace, administrators assign a pre-created logical volume on the storage system as the designated namespace. This logical volume serves as a storage segment that the NVMe subsystem presents to the host, enabling read and write operations.
The following diagram shows the workflow of NVMe/TCP, where the Host NIC communicates with the NVM subsystem to access the namespaces created from the Hitachi Virtual Storage Platform One Block 20 (VSP One B20) storage system:
Figure 2: NVMe/TCP workflow
- NVMe Controllers: A controller is linked to one or more NVMe namespaces, providing a path for the ESXi host to access these namespaces within the storage system. After NVMe targets are assigned to the host, the ESXi host automatically adds the controllers upon reboot or manually add them using the WWPN details of the target port.
To access the controller, the host can use either of the following mechanisms:
o Controller Discovery: The ESXi host retrieves a list of available controllers from a discovery controller. Selecting a controller provides access to its namespaces.
o Controller Connection: The ESXi host connects directly to a specified controller, providing access to its linked namespaces.
Scenario
An ESXi host can access multiple NVMe namespaces through single or multiple NVMe subsystems. When a large number of namespaces from a Hitachi storage system need to be presented to an ESXi host, involving multiple initiator (NVMe adapter) and target ports, the overall count of NVMe controllers grows significantly, placing significant strain on host resources. NVMe controllers are categorized as either discovery controllers or IO controllers, with each IO controller requiring four IO queues and one admin queue, totaling five queues per controller.
In this scenario, an ESXi host configured with two TCP/NVMe adapters, where each NVMe namespace is assigned through a single NVM subsystem. This setup functions efficiently when the number of namespaces remains within a manageable range. However, as the number of namespaces increases, the number of NVMe subsystems and controllers increases, resulting in a proportional rise in resource consumption. With one controller per namespace, this growth strains critical host resources, specifically CPU sockets, as each additional controller to the resource load.
During an environmental outage, such as when target ports go offline and multiple controllers transition to an offline state, the ESXi NVMe core layer initiates an error recovery process. The number of maximum helper requests per helper queue is constrained by the available CPU cores in the system, that limits how many helper requests VMware can manage within the kernel. This limitation can lead to resource exhaustion, preventing the error recovery process from completing successfully.
This scenario is documented in the Broadcom Knowledge Base article available at the following URL: https://knowledge.broadcom.com/external/article/378512
Resolution
To reduce the strain on system resources, multiple NVMe namespaces can be assigned to an ESXi host through a single NVMe subsystem. Consolidating namespaces under one subsystem reduces the total number of NVMe controllers required by the host, optimizing resource usage, specifically in CPU allocation. This approach minimizes the overhead of managing multiple controllers, that can become increasingly taxing on host resources as the number of namespaces increases.
Hitachi NVMe/TCP storage system supports exporting multiple namespaces to an ESXi host through a single NVMe subsystem. Following VMware guidelines, it is recommended to limit the number of subsystems on an ESXi host to a maximum of eight, regardless of the number of namespaces the host needs to access.
Conclusion
An ESXi system outage may occur when a large number of NVMe namespaces are presented to the ESXi host in a suboptimal configuration. When a target port goes offline, multiple controllers can simultaneously enter recovery mode, consuming extensive system resources and potentially preventing some controllers from recovering. This issue can be resolved by mapping the same number of namespaces to the ESXi host in a more organized and resource-efficient manner.