Diagnosing a failed iSCSI switch interconnect in a vSphere environment
October 6, 2012 4 Comments
The beauty of a well constructed, highly redundant environment is that if a single point fails, systems should continue to operate without issue. Sometimes knowing what exactly failed is more challenging than it first appears. This was what I ran into recently, and wanted to share what happened, how it was diagnosed, and ultimately corrected.
A group of two EqualLogic arrays were running happily against a pair of stacked Dell PowerConnect 6224 switches, serving up a 7 node vSphere cluster. The switches were rebuilt over a year ago, and since that time they have been rock solid. Suddenly, the arrays started spitting out all kinds of different errors. Many of the messages looked similar to these:
iSCSI login to target ’10.10.0.65:3260, iqn.2001-05.com.equallogic:0-8a0906-b6cc21609-d200014832f4ecfb-vmfs001′ from initiator ’10.10.0.10:52155, iqn.1998-01.com.vmware:esx1-70a98577′ failed for the following reason:
Initiator disconnected from target during login.
Some of the earliest errors on the array looked like this:
10/1/2012 1:01:11 AM to 10/1/2012 1:01:11 AM
Warning: Member PS6000e network port cannot be reached. Unable to obtain network performance data for the member.
Warning: Member PS6100e network port cannot be reached. Unable to obtain network performance data for the member.
10/1/2012 1:01:11 AM to 10/1/2012 1:01:11 AM
Caution: Some SNMP requests to member PS6100e for disk drive information timed out.
Caution: Some SNMP requests for information about member PS6100e disk drives timed out.
VMs that had guest attached volumes were generating errors similar to this:
Subject: ASMME smartcopy from SVR001: MPIO Reconfiguration Request IPC Error – iqn.2001-05.com.equallogic:0-8a0906-bd5d27503-7ef000ed5d54a8c1-ntfs001 on host SVR001
[01:01:11] MPIO failure during reconfiguration request for target iqn.2001-05.com.equallogic:0-8a0906-476f6bd06-0c500008a0c4c41f-ntfs002 with error status 0×16000000.
[01:01:11] MPIO failure during reconfiguration request for target iqn.2001-05.com.equallogic:0-8a0906-dc0da1609-2fe0014145f4e931-ntfs001 with error status 0×80070006.
Before I had a chance to look at anything, I suspected something was wrong with the SAN switch stack, but was uncertain beyond that. I jumped into vCenter to see if anything obvious showed up. But vSphere and all of the VMs were motoring along just like normal. No failed uplink errors, or anything else noticeable. I didn’t do much vSphere log fishing at this point because all things were pointing to something on the storage side, and I had a number of tools that could narrow down the problem. With all things related to storage traffic, I wanted to be extra cautious and prevent making matters worse with reckless attempts to resolve.
First, some background on how EqualLogic arrays work. All arrays have two controllers, working in an active/passive arrangement. Depending on the model of array, each controller will have between two and four ethernet ports per controller, with each port having an IP address assigned to it. Additionally, there will be a single IP address to define the “group” the member array is a part of. (The Group IP is single IP used by systems looking for an iSCSI target, to let the intelligence of the arrays figure out how to distribute traffic across interfaces.) If some of the interfaces can’t be contacted (e.g. disconnected cable, switch failure, etc.), the EqualLogic arrays will be smart enough to distribute across the active links.
The ports of each EqualLogic array are connected to the stacked SAN switches in a meshed arrangement for redundancy. If there ware a switch failure, then one wouldn’t be able to contact the IP addresses of the ethernet ports connected to one of the switches. But using a VM with guest attached volumes (which have direct access to the SAN), I could successfully ping all four interfaces (eth0 through eth3) on each array. Hmm…
So then I decided to SSH into the array and see if I could perform the same test. The idea would be to test from one IP on one of the arrays to see if a ping would be successful on eth0 through eth3 on the other array. The key to doing this is to use an IP of one of the individual interfaces as the source, and not the Group IP. Controlling the source and the target during this test will tell you a lot. After connecting to the array via SSH, the syntax for testing the interfaces on the target array would be this:
ping –I “[sourceIP] [destinationIP]” (quotes are needed!)
From one of the arrays, pinging all four interfaces on the second array revealed that only two of the four ports succeeded. But the earlier test from the VM proved that I could ping all interfaces, so I chose to change the source IP as one of the interfaces living on the other switch. Performed the same test, and the opposite results occurred. The ports that failed on the last test passed on this test, and the ports that passed the last test, failed on this time. This seemed to indicate that both switches were up, but the communication between switches were down.
While I’ve never seen these errors on switches using stacking modules, I have seen the MPIO errors above on a trunked arrangement. One might run into these issues more with trunking, as it tends to leave more opportunity for issues caused by configuration errors. I knew that in this case, the switch configurations had not been touched for quite some time. The status of the switches via the serial console stated the following:
Management Standby Preconfig Plugged-in Switch Code
SW Status Status Model ID Model ID Status Version
1 Mgmt Sw PCT6224 PCT6224 OK 18.104.22.168
2 Unassigned PCT6224 Not Present 0.0.0.0
The result above wasn’t totally surprising, in that if the stacking module was down, the master switch wouldn’t be able to be able to gather the information from the other switch.
Dell also has an interesting little tool call “Lasso.” The Dell Lasso Tool will help grab general diagnostics data from a variety of sources (servers, switches, storage arrays). But in this case, I found it convenient to test connectivity from the array group itself. The screen capture below seems to confirm what I learned through the testing above.
So the next step was trying to figure out what to do about it. I wanted to reboot/reload the slave switch, but knowing both switches were potentially passing live data, I didn’t want to do anything to compromise the traffic. So I employed an often overlooked, but convenient way of manipulating traffic to the arrays; turning off the interfaces on the array that are connected to the SAN switch that needs to be restarted. If one turns off the interfaces on each array connected to the switch that needs the maintenance, then there will not be any live data passing through that switch. Be warned that you better have a nice, accurate wiring schematic of your infrastructure so that you know which interfaces can be disabled. You want to make things better, not worse.
After a restart of the second switch, the interconnect reestablished itself. The interfaces on the arrays were re-enabled, with all errors disappearing. I’m not entirely sure why the interconnect went down, but the primary objective was diagnosing and correcting in a safe, deliberate, yet speedy way. No VMs were down, and the only side effect of the issue was the errors generated, and some degraded performance. Hopefully this will help you in case you see similar symptoms in your environment.
Reworking my PowerConnect 6200 switches for my iSCSI SAN
Dell TechCenter. A great resource all things related to Dell in the Enterprise.