Introduction
This blog describes the FC-NVMe HPP path auto-discovery delay observed in VMware ESXi 8.0 Update 1 during link-up events. The ESXi 8.0 Update 1 is a General Availability designation. In situations involving FC-NVMe, a delay is observed in path recovery following a namespace path failure.
Block Diagram
The following diagram shows the FC-NVMe block diagram
Figure: FC-NVMe block diagram
What is HPP?
The High-Performance Plug-in (HPP) is a multipathing software from VMware for storage devices used in ESXi hosts. The default multipath package used by ESXi host is Native Multipathing Plug-in (NMP). The HPP replaces the default NMP for high-speed devices, such as NVMe.
From vSphere 7.0 Update 2 or later versions, HPP became the default plug-in for local NVMe devices. The HPP is the default plug-in that claims NVMe-oF targets, ensuring better performance and reliability by streamlining I/O path management.
Configurations
The following lists the hardware requirements:
· Host System: ProLiant DL380 Gen10
· Host HBA: Emulex LPe35002
The following lists the software requirements:
· OS install media: VMware ESXi 8.0.1 build-21495797
· Host HBA Driver version: v14.2.560.8
· Host HBA Firmware version: v14.2.455.11
Observing the issue
During a path failure between the HBA and a Hitachi Virtual Storage Platform (VSP) port, it was observed that after the cable was reconnected, the path to storage devices remained in a dead state on the ESXi host. After a while, the paths were restored. This behavior is unusual for NVMe devices, which typically require only a few microseconds to recover.
This issue was observed in ESXi 8.0 Update 1 with the following build details:
· VMware ESXi 8.0 Update 1
o Release Date: 18 April, 2023
o GA ISO Build: 21495797
Log Analysis
Analyzing the ESXi vmkernel logs, it was observed that after hardware recovery, a link-up event is received without identifying the NVMe controller. After subsequent attempts, the NVMe controller is discovered, activating the path to the devices.
Linkup
2023-04-18T12:57:42.177Z In(182) vmkernel: cpu46:1049378)lpfc: lpfc_mbx_cmpl_read_topology:3664: 0:1303 Link Up Event x11 received Data: x11 x0 x90 x0 x0
2023-04-18T12:57:42.177Z In(182) vmkernel: cpu46:1049378)NVMFEVT:324 Received event 1 (0x430cd534e8d0) for vmhba64 event queue.
2023-04-18T12:57:42.177Z In(182) vmkernel: cpu71:1049450)NVMFEVT:650 vmhba64 NVMe adapter's link is up
2023-04-18T12:57:42.207Z In(182) vmkernel: cpu46:1049378)lpfc: lpfc_issue_gidft:4108: 0:(0):fc4 type 3
2023-04-18T12:57:42.220Z In(182) vmkernel: cpu46:1049378)lpfc: lpfc_cmpl_els_prli:2423: 0:(0):0103 PRLI completes to NPort x20100 Data: x0 x20000 x14 x0
NVMSS Reconnect Completion
2023-04-18T12:57:42.236Z In(182) vmkernel: cpu36:1049566)NVMEDEV:6178 Discover namespaces on controller 259 is complete
2023-04-18T12:57:42.236Z In(182) vmkernel: cpu36:1049566)NVMEDEV:7944 Reset controller 259 successful.
Identify Controller Failed Log
2023-04-18T12:57:43.222Z Wa(180) vmkwarning: cpu50:1049103)WARNING: NvmeDiscover: 7536: GetIdentifyController failed for controller:nqn.1994-04.jp.co.hitachi:nvme:storage-subsystem-sn.5-30008-nvmssid.00007#vmhba64#50060e8008753845:50060e8008753845,
2023-04-18T12:57:43.222Z Wa(180) vmkwarning: cpu50:1049103)WARNING: status:Transient storage condition, suggest retry
2023-04-18T12:58:54.060Z In(182) vmkernel: cpu25:1058316)NvmeDiscover: 7413: subsystem wide controller probe already in progress - subNqn:nqn.1994-04.jp.co.hitachi:nvme:storage-subsystem-sn.5-30008-nvmssid.00007
2023-04-18T12:58:54.060Z In(182) vmkernel: cpu41:1049721)HPP: HppPathGroupMovePath:661: Path "vmhba64:C0:T0:L0" state changed from "dead" to "active"
From a few iterations, it was observed that the path recovery delay averaged 90 seconds, though in some iterations, it took more than 90 seconds. This issue was intermittent and did not occur during every failure and recovery cycle.
Impact of FC Switch
This issue was observed in both switch and direct paths. This indicates that the problem in not related to the Fibre Channel (FC) switch, and it occurs regardless of whether a switch is present.
Resolution
VMware resolved the issue in ESXi 8.0 Update 1c by updating the nvmetcp vib to address the delay. For more details, see the following links:
Keynote
Although the issue is fixed in ESXi 8.0 Update1c, the auto discovery on networks configured to use vSphere Distributed Switch might still fail. The issue does not affect NVMe/TCP configurations that use standard switches.
Conclusion
This blog explains the path recovery delay issue observed in ESXi 8.0 Update 1 build-21495797, including its identification and resolution. To avoid this issue, it is recommended to upgrade the hosts to ESXi 8.0 Update 1c build-22088125 or later versions.