Software Defined Storage

 View Only

Troubleshooting FC-NVMe HPP Path Auto Discovery Delay in ESXi 8.0 Update 1

By Pratibha Prasad posted 01-29-2025 05:06

  

Introduction

This blog describes the FC-NVMe HPP path auto-discovery delay observed in VMware ESXi 8.0 Update 1 during link-up events. The ESXi 8.0 Update 1 is a General Availability designation. In situations involving FC-NVMe, a delay is observed in path recovery following a namespace path failure.

Block Diagram

The following diagram shows the FC-NVMe block diagram

Figure: FC-NVMe block diagram

What is HPP?

The High-Performance Plug-in (HPP) is a multipathing software from VMware for storage devices used in ESXi hosts. The default multipath package used by ESXi host is Native Multipathing Plug-in (NMP). The HPP replaces the default NMP for high-speed devices, such as NVMe.

From vSphere 7.0 Update 2 or later versions, HPP became the default plug-in for local NVMe devices. The HPP is the default plug-in that claims NVMe-oF targets, ensuring better performance and reliability by streamlining I/O path management.

Configurations

The following lists the hardware requirements:

·        Host System: ProLiant DL380 Gen10

·        Host HBA: Emulex LPe35002

The following lists the software requirements:

·        OS install media: VMware ESXi 8.0.1 build-21495797

·        Host HBA Driver version: v14.2.560.8 

·        Host HBA Firmware version: v14.2.455.11

Observing the issue

During a path failure between the HBA and a Hitachi Virtual Storage Platform (VSP) port, it was observed that after the cable was reconnected, the path to storage devices remained in a dead state on the ESXi host. After a while, the paths were restored. This behavior is unusual for NVMe devices, which typically require only a few microseconds to recover.

 

This issue was observed in ESXi 8.0 Update 1 with the following build details:

·        VMware ESXi 8.0 Update 1

o   Release Date: 18 April, 2023

o   GA ISO Build: 21495797

Log Analysis

Analyzing the ESXi vmkernel logs, it was observed that after hardware recovery, a link-up event is received without identifying the NVMe controller. After subsequent attempts, the NVMe controller is discovered, activating the path to the devices.

Linkup

2023-04-18T12:57:42.177Z In(182) vmkernel: cpu46:1049378)lpfc: lpfc_mbx_cmpl_read_topology:3664: 0:1303 Link Up Event x11 received Data: x11 x0 x90 x0 x0
2023-04-18T12:57:42.177Z In(182) vmkernel: cpu46:1049378)NVMFEVT:324 Received event 1 (0x430cd534e8d0) for vmhba64 event queue.
2023-04-18T12:57:42.177Z In(182) vmkernel: cpu71:1049450)NVMFEVT:650 vmhba64 NVMe adapter's link is up
2023-04-18T12:57:42.207Z In(182) vmkernel: cpu46:1049378)lpfc: lpfc_issue_gidft:4108: 0:(0):fc4 type 3
2023-04-18T12:57:42.220Z In(182) vmkernel: cpu46:1049378)lpfc: lpfc_cmpl_els_prli:2423: 0:(0):0103 PRLI completes to NPort x20100 Data: x0 x20000 x14 x0

NVMSS Reconnect Completion

2023-04-18T12:57:42.236Z In(182) vmkernel: cpu36:1049566)NVMEDEV:6178 Discover namespaces on controller 259 is complete
2023-04-18T12:57:42.236Z In(182) vmkernel: cpu36:1049566)NVMEDEV:7944 Reset controller 259 successful.

Identify Controller Failed Log

2023-04-18T12:57:43.222Z Wa(180) vmkwarning: cpu50:1049103)WARNING: NvmeDiscover: 7536: GetIdentifyController failed for controller:nqn.1994-04.jp.co.hitachi:nvme:storage-subsystem-sn.5-30008-nvmssid.00007#vmhba64#50060e8008753845:50060e8008753845,
2023-04-18T12:57:43.222Z Wa(180) vmkwarning: cpu50:1049103)WARNING: status:Transient storage condition, suggest retry
2023-04-18T12:58:54.060Z In(182) vmkernel: cpu25:1058316)NvmeDiscover: 7413: subsystem wide controller probe already in progress - subNqn:nqn.1994-04.jp.co.hitachi:nvme:storage-subsystem-sn.5-30008-nvmssid.00007
2023-04-18T12:58:54.060Z In(182) vmkernel: cpu41:1049721)HPP: HppPathGroupMovePath:661: Path "vmhba64:C0:T0:L0" state changed from "dead" to "active"

 

From a few iterations, it was observed that the path recovery delay averaged 90 seconds, though in some iterations, it took more than 90 seconds. This issue was intermittent and did not occur during every failure and recovery cycle.

Impact of FC Switch

This issue was observed in both switch and direct paths. This indicates that the problem in not related to the Fibre Channel (FC) switch, and it occurs regardless of whether a switch is present.

Resolution

VMware resolved the issue in ESXi 8.0 Update 1c by updating the nvmetcp vib to address the delay. For more details, see the following links:

Keynote

Although the issue is fixed in ESXi 8.0 Update1c, the auto discovery on networks configured to use vSphere Distributed Switch might still fail. The issue does not affect NVMe/TCP configurations that use standard switches.

Conclusion

This blog explains the path recovery delay issue observed in ESXi 8.0 Update 1 build-21495797, including its identification and resolution. To avoid this issue, it is recommended to upgrade the hosts to ESXi 8.0 Update 1c build-22088125 or later versions.

1 comment
19 views

Permalink

Comments

02-23-2025 22:54

Well explained! @Pratibha Prasad