Fixing host connection issues on Dell servers in vSphere 5.x

I had a conversation recently with a few colleagues at the Dell Enterprise Forum, and as they were describing the symptoms they were having with some Dell servers in their vSphere cluster, it sounded vaguely similar to what I had experienced recently with my new M620 hosts running vSphere 5.0 Update 2.  While I’m uncertain if their issues were related in any way to mine, it occurred to me that I might not have been the only one out there who ran into this problem.  So I thought I’d provide a post to help anyone else experiencing the behavior I encountered.

Symptoms
The new cluster Dell M620 blades running vSphere 5.0 U2 that was being used as our Development Teams code compiling cluster were randomly dropping their connections.  Yep, not good.  This wasn’t normal behavior of course, and the effects ranged anywhere from still being up (but acting odd) to complete isolation of the host with no success at a soft recovery.  The hosts themselves had the latest firmware applied to them, and I used the custom Dell ESXi ISO when building the host.  Each service (Mgmt, LAN, vMotion, storage) were meshed so that one service didn’t depend on a single, multiport NIC adapter, but they still went down.  What was creating the problem?  I won’t leave you hanging.  It was the Broadcom network drivers for ESXi.

Before I figured out what the problem was, here is what I knew:

  • The behavior was only occurring on a cluster of 4 Dell M620 hosts.  The other cluster containing M610’s never experienced this issue.
  • They had occurred on each host at least once, typically when there was a higher likelihood for heavy traffic.
  • Various services had been impacted.  One time it was storage, while the other time it was the LAN side.

Blade configuration background
To understand the symptoms, and the correction a bit better, it is worth getting an overview of what the Dell M620 blade looks like in terms of network connectivity.  What I show below reflects my 1GbE environment, and would look different if I was using 10GbE, or with switch modules instead of passthrough modules.

The M620 blades come with a built in Broadcom NetXtreme II BCM M57810 10gbps Ethernet adapter.  This provides for two 10gbps ports on fabric A of the blade enclosure.  These will negotiate down to 1GbE if you have passthroughs on the back of the enclosure, as I do.

image

There are two spots in each blade that will accept additional mezzanine adapters for fabric B, and fabric C respectively.  In my case, since I also have 1GbE passthroughs on these fabrics as well, I chose to use the Broadcom NetXtreme BCM5719gbe adapter.  Each will provide 4, 1gbe ports.  With passthroughs, only two of the four on each adapter are reachable.  The end result is 6, 1GbE ports available for use for each blade.  Two for storage.  Two for Production LAN traffic, and two for vSphere Mgmt and vMotion.  All services needed (iSCSI, Mgmt, etc.) are assigned so that in the event of a single adapter failure, you’re still good to go.

image

And yes, I’d love to go to 10GbE as much as anyone, but that is a larger matter especially when dealing with blades and the enclosure that they reside in.  Feel free to send me a check, and I’ll return the favor with a nice post.

How to diagnose, and correct
On one of the cases, this event caused an All Paths Down from the host to my storage.  I looked in my /scratch/log for the host, with the intent of looking into the vmkernel and vobd.log files to see what was up.  The following command returned several entries that looked like below

less /scratch/log/vobd.log

2013-04-03T16:17:33.849Z: [iscsiCorrelator] 6384105406222us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.2001-05.com.equallogic:0-8a0906-d0a034d04-d6b3c92ecd050e84-vmfs001 on vmhba40 @ vmk3 failed. The iSCSI initiator could not establish a network connection to the target.

2013-04-03T16:17:44.829Z: [iscsiCorrelator] 6384104156862us: [vob.iscsi.target.connect.error] vmhba40 @ vmk3 failed to login to iqn.2001-05.com.equallogic:0-8a0906-e98c21609-84a00138bf64eb18-vmfs002 because of a network connection failure.

Then I ran the following just to verify what I had for NICs and their associations

esxcfg-nics -l

Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description
vmnic0  0000:01:00.00 bnx2x       Up   1000Mbps  Full   00:22:19:9e:64:9b 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet
vmnic1  0000:01:00.01 bnx2x       Up   1000Mbps  Full   00:22:19:9e:64:9e 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet
vmnic2  0000:03:00.00 tg3         Up   1000Mbps  Full   00:22:19:9e:64:9f 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic3  0000:03:00.01 tg3         Up   1000Mbps  Full   00:22:19:9e:64:a0 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic4  0000:03:00.02 tg3         Down 0Mbps     Half   00:22:19:9e:64:a1 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic5  0000:03:00.03 tg3         Down 0Mbps     Half   00:22:19:9e:64:a2 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic6  0000:04:00.00 tg3         Up   1000Mbps  Full   00:22:19:9e:64:a3 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic7  0000:04:00.01 tg3         Up   1000Mbps  Full   00:22:19:9e:64:a4 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic8  0000:04:00.02 tg3         Down 0Mbps     Half   00:22:19:9e:64:a5 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic9  0000:04:00.03 tg3         Down 0Mbps     Half   00:22:19:9e:64:a6 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet

Knowing what vmnics were being used for storage traffic, I took a look at the driver version for vmnic3

ethtool -i vmnic3

driver: tg3
version: 3.124c.v50.1
firmware-version: FFV7.4.8 bc 5719-v1.31
bus-info: 0000:03:00.1

Time to check and see if there were updated drivers.

Finding and updating the drivers
The first step was to check the compatibility matrix out at the VMware Compatibility Guide for this particular NIC.  The good news was that there was an updated driver for this adapter; 3.129d.v50.1.  I downloaded the latest driver (vib) for that NIC to a datastore that was accessible to the host, so that it could be installed.  The process of making the driver available for installation, as well as the installation itself can certainly be done with the VMware Update Manager, but for my example, I’m performing these steps from the command line.  Remember to go into maintenance mode first. 

esxcli software vib install -v /vmfs/volumes/VMFS001/drivers/broadcom/net-tg3-3.129d.v50.1-1OEM.500.0.0.472560.x86_64.vib

Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: Broadcom_bootbank_net-tg3_3.129d.v50.1-1OEM.500.0.0.472560
VIBs Removed: Broadcom_bootbank_net-tg3_3.124c.v50.1-1OEM.500.0.0.472560
VIBs Skipped:

The final steps will be to reboot the host, and verify the results.

ethtool -i vmnic3

driver: tg3
version: 3.129d.v50.1
firmware-version: FFV7.4.8 bc 5719-v1.31
bus-info: 0000:03:00.0

Conclusion
I initially suspected that the problems were driver related, but the symptoms generated from the bad drivers made it give the impression that there was a larger issue at play.  Nevertheless, I couldn’t get these drivers loaded up fast enough, and since that time (about 3 months), they have been rock solid, and behaving normally.

Helpful links
Determining Network/Storage firmware and driver version in ESXi
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1027206

VMware Compatibility Guide
http://www.vmware.com/resources/compatibility/search.php?deviceCategory=io&productid=19946&deviceCategory=io&releases=187&keyword=bcm5719&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc