On ESXi 7.x, one of my VMs caused the whole esxi host to hang. Only way to recover was through a hard reset. The root cause was due to the pass through Nvidia Qaudro P600 Adapter I had configured on the VM. During VM shutdown, the PCI reset function caused the host to hang.

The following articles helped me: Reddit VMware KB

Few of the PCI reset types are:

  • Function Level Reset (FLR)
  • Secondary Bus Reset
  • Link Disable/Enable
  • Device power state transition (D0 > D3hot > D0; non-standard reset method)

To resolve the issue,

Find the vendor id and device id

Use the command esxcli hardware pci list

$ esxcli hardware pci list
0000:04:00.0
   Address: 0000:04:00.0
   Segment: 0x0000
   Bus: 0x04
   Slot: 0x00
   Function: 0x0
   VMkernel Name: 
   Vendor Name: NVIDIA Corporation
   Device Name: GP107GL [Quadro P600]
   Configured Owner: VM Passthru
   Current Owner: VM Passthru
   Vendor ID: 0x10de
   Device ID: 0x1cb2
   SubVendor ID: 0x10de
   SubDevice ID: 0x11bd
   Device Class: 0x0300
   Device Class Name: VGA compatible controller
   Programming Interface: 0x00
   Revision ID: 0xa1
   Interrupt Line: 0x0b
   IRQ: 255
   Interrupt Vector: 0x00
   PCI Pin: 0x00
   Spawned Bus: 0x00
   Flags: 0x3001
   Module ID: 49
   Module Name: pciPassthru
   Chassis: 0
   Physical Slot: 4
   Slot Description: Slot4
   Device Layer Bus Address: s00000004.00
   Passthru Capable: true
   Parent Device: PCI 0:0:3:0
   Dependent Device: PCI 0:0:3:0
   Reset Method: Bridge reset
   FPT Sharable: true
   NUMA Node: -1
   Extended Device ID: 0

Edit the file, /etc/vmware/passthru.map file

[root@dell-t5600:~] cat /etc/vmware/passthru.map 
# passthrough attributes for devices
#
# file format: vendor-id device-id resetMethod fptShareable
# vendor/device id: xxxx (in hex) (ffff can be used for wildchar match)
# reset methods: flr, d3d0, link, bridge, default
# fptShareable: true/default, false
#
# Description:
#
# - fptShareable: when set to true means the PCI device can be shared.
#   Sharing refers to using multiple functions of a multi‐function
#   device in different contexts. That is, sharing between two
#   virtual machines or between a virtual machine and VMkernel.
#
# - resetMethod: override for the type of reset to apply to a PCI device.
#   Bus reset and link reset prevent functions in a multi-function
#   device from being assigned to different virtual machines, or from
#   being assigned between the VMkernel and virtual machines. In
#   some devices it's possible to use PCI power management capability
#   D3->D0 transitions to reset the device. In the absence of the
#   override, the VMkernel decides the type of PCI reset to apply
#   based on the device's capabilities. The VMkernel prioritizes
#   function level reset (flr).
#
# Restrictions:
#
# - PCI SR-IOV physical and virtual functions (PFs/VFs) are not allowed
#   in the list below. Those must support function-level-reset and
#   must be shareable.
#

# Intel 82579LM Gig NIC can be reset with d3d0
8086  1502  d3d0     default
# Intel 82598 10Gig cards can be reset with d3d0
8086  10b6  d3d0     default
8086  10c6  d3d0     default
8086  10c7  d3d0     default
8086  10c8  d3d0     default
8086  10dd  d3d0     default
# Broadcom 57710/57711/57712 10Gig cards are not shareable
14e4  164e  default  false
14e4  164f  default  false
14e4  1650  default  false
14e4  1662  link     false
# Qlogic 8Gb FC card can not be shared
1077  2532  default  false
# LSILogic 1068 based SAS controllers
1000  0056  d3d0     default
1000  0058  d3d0     default
# NVIDIA
10de  ffff  bridge   false
# AMD FCH SATA Controller [AHCI mode]
1022  7901  d3d0     default

As Nvidia was already listed and as the device id ffff covers all devices, I just had to change the reset type from bridge to link.

From:
# NVIDIA
10de  ffff  bridge   false

To:
# NVIDIA
10de  ffff  bridge   false

Reboot the host after making changes.