Saving time with the AutoLab for vSphere

Earlier this year, I started building up a few labs for a little work, and a little learning.  (Three Labs for three reasons).  Not too long after that is when I started playing around with the AutoLab.  For those not familiar with what the AutoLab is, it is most easily described as a crafty collection of scripts, open source VMs, and shell VMs that allow one to build a nested vSphere Lab environment with minimal effort.  The nested arrangement can live inside of VMware Workstation, Fusion, or ESXi.  AutoLab comes to you from a gentleman by the name of Alastair Cooke, along with support from many over at the vBrownBag sessions.  What?  You haven’t heard of the vBrownBag sessions either?  Even if you are just a mild enthusiast of VMware and virtualization, check out this great resource.  Week after week, they put out somewhat informal, but highly informative webinars.  The AutoLab and vBrownBag sessions are both great examples of paying it forward to an already strong virtualization community.

The Value of AutoLab
Why use it?  Simple.  It saves time.  Who doesn’t like that?  To be fair, it doesn’t really do anything that you couldn’t do on your own.  But here’s the key.  Scripts don’t forget to do stuff.  People do (me included).  Thus, the true appreciation of it really only comes after you have manually built up your own nested lab a few different times.  With the AutoLab, set up the requirements (well documented in the deployment guide), kick it off, and a few hours later your lab is complete.  There are tradeoffs of course with a fully nested environment, but it is an incredibly powerful arrangement, and thanks to the automation of it all, will allow you to standardize your deployment of a lab.  The AutoLab has made big improvements with each version, and now assists in incorporating vCloud Director, vShield, View, Veeam One, and Veeam Backup & Replication.

Letting the AutoLab take care of some of the menial things through automation is a nice reminder of the power of automation in general.  Development teams are quite familiar with this concept, where automation and unit testing allow them to be more aggressive in their coding to produce better results faster.  Fortunately, momentum seems to be gaining in IT around this, and VMware is doing its part as well with things like vCenter Orchestrator, PowerCLI, and stateless host configurations with AutoDeploy.

In my other post, I touch a bit on what I use my labs for.  Those who know me also know I’m a stickler for documentation.  I have high standards for documentation from software manufacturers, and from myself when providing it for others.  The lab really helps me provide detailed steps and accurate screen shots along the way.

The AutoLab nested arrangement I touch most is on my laptop.  Since my original post, I did bump up my laptop to 32GB of RAM.  Some might think this would be an outrageously expensive luxury, but a 16GB Kit for a Dell Precision M6600 laptop costs only $80 on NewEgg at the time of purchase (Don’t ask me why this is so affordable.  I have no idea.).  Regardless, don’t let this number prevent you from using the AutoLab.  The documentation demonstrates how to make the core of the lab run with just 8GB of RAM.

A few tips
Here are a few tips that make my experience a bit better with the AutoLab nested vSphere environment.  Nothing groundbreaking, but just a few things to make life easier.

  • I have a Shortcut to a folder that houses all of my shortcuts needed for the lab.  Items like the router Administration, the NAS appliance, and your local hosts file which you might need to edit on occasion.
  • I choose to create another VMnet network in VMware Workstation so that I could add a few more vNICs to my nested ESXi hosts. That allows me to create a vSwitch to be used to play with additional storage options (VSAs, etc.) while preserving what was already set up.
  • The FREESCO router VM is quite a gem.  It provides quite a bit of flexibility in a lab environment (a few adjustments and you can connect to another lab living elsewhere), and you might even find other uses for it outside of a lab.
  • To allow direct access to the FreeNAS storage share from your workstation, you will need to click on the “Connect a host virtual adapter to this network” option on the VMnet3 network in the Network Editor of VMware Workstation.
  • You might be tempted to trim up the RAM on various VMs to make everything fit.  Trim up the RAM too much on say, the vMA, and SSH won’t work.  Just something to be mindful of.
  • On my laptop, I have a 256GB Crucial M4 SSD drive for the OS, and a 750GB SATA disk for everything else.  I have a few of the VM’s (the virtualized ESXi hosts, vCenter server and a DC over on the SSD, and most everything else on the SATA drive.  This makes everything pretty fast while using my SSD space wisely.
  • You’ll be downloading a number of ISOs and packages.  Start out with a plan for organization so that you know where things are when/if you have to rebuild.
  • The ReadMe file inside of the packaged FreeNAS VM is key to understanding where and how to place the installation bits.  Read carefully.
  • The automated build of the DC and vCenter VMs can be pretty finicky on which Windows Server ISO it will work with.  If you are running into problems, you may not be using the correct ISO.
  • If you build up your lab on a laptop, and suddenly you can’t get anything to talk to say, the storage network, it may be the wireless (or wired) network you connected to.  I had this happen to me one where the wireless address range happened to be the same as part of my lab. 
  • As with any VMs you’ll be running on a Desktop/Laptop with VMware Workstation, make sure you create Antivirus real-time scanning exceptions for all locations that will be housing VMs.  The last thing you need is your Antivirus thinking its doing you a favor.
  • The laptop I use for my lab is also my primary system.  It’s worth a few bucks to protect it with a disk imaging solution.  I choose to dump the entire system out to an external drive using Acronis TrueImage.  I typically run this when all of the VMs are shut off.

So there you have it.  Get your lab set up, and hunker down with your favorite book, blog, links from twitter, or vBrownBag session, and see what you can learn.  Use it and abuse it.  It’s not a production environment, and is a great opportunity to improve your skills, and polish up your documentation.

– Pete

Helpful Links
AutoLab
http://www.labguides.com/autolab
Twitter: @DemitasseNZ #AutoLab https://twitter.com/DemitasseNZ

vBrownBag sessions.  A great resource, and an easy way to surround yourself (virtually) with smart people. http://professionalvmware.com/brownbags/
Twitter:  @cody_bunch @vBrownBag https://twitter.com/vBrownBag

FREESCO virtual router.  Included and preconfigured with the AutoLab, but worth looking at their site too.
http://freesco.org/

FreeNAS virtual storage.  Also included and preconfigured with the AutoLab.
http://www.freenas.org/

 

Diagnosing a failed iSCSI switch interconnect in a vSphere environment

The beauty of a well constructed, highly redundant environment is that if a single point fails, systems should continue to operate without issue.  Sometimes knowing what exactly failed is more challenging than it first appears.  This was what I ran into recently, and wanted to share what happened, how it was diagnosed, and ultimately corrected.

A group of two EqualLogic arrays were running happily against a pair of stacked Dell PowerConnect 6224 switches, serving up a 7 node vSphere cluster.  The switches were rebuilt over a year ago, and since that time they have been rock solid.  Suddenly, the arrays started spitting out all kinds of different errors.  Many of the messages looked similar to these:

iSCSI login to target ‘10.10.0.65:3260, iqn.2001-05.com.equallogic:0-8a0906-b6cc21609-d200014832f4ecfb-vmfs001’ from initiator ‘10.10.0.10:52155, iqn.1998-01.com.vmware:esx1-70a98577’ failed for the following reason:
Initiator disconnected from target during login.

Some of the earliest errors on the array looked like this:

10/1/2012 1:01:11 AM to 10/1/2012 1:01:11 AM
Warning: Member PS6000e network port cannot be reached. Unable to obtain network performance data for the member.
Warning: Member PS6100e network port cannot be reached. Unable to obtain network performance data for the member.
10/1/2012 1:01:11 AM to 10/1/2012 1:01:11 AM
Caution: Some SNMP requests to member PS6100e for disk drive information timed out.
Caution: Some SNMP requests for information about member PS6100e disk drives timed out.

VMs that had guest attached volumes were generating errors similar to this:

Subject: ASMME smartcopy from SVR001: MPIO Reconfiguration Request IPC Error – iqn.2001-05.com.equallogic:0-8a0906-bd5d27503-7ef000ed5d54a8c1-ntfs001 on host SVR001

[01:01:11] MPIO failure during reconfiguration request for target iqn.2001-05.com.equallogic:0-8a0906-476f6bd06-0c500008a0c4c41f-ntfs002 with error status 0x16000000.

[01:01:11] MPIO failure during reconfiguration request for target iqn.2001-05.com.equallogic:0-8a0906-dc0da1609-2fe0014145f4e931-ntfs001 with error status 0x80070006.

Before I had a chance to look at anything, I suspected something was wrong with the SAN switch stack, but was uncertain beyond that.  I jumped into vCenter to see if anything obvious showed up.  But vSphere and all of the VMs were motoring along just like normal.  No failed uplink errors, or anything else noticeable.  I didn’t do much vSphere log fishing at this point because all things were pointing to something on the storage side, and I had a number of tools that could narrow down the problem.  With all things related to storage traffic, I wanted to be extra cautious and prevent making matters worse with reckless attempts to resolve.

First, some background on how EqualLogic arrays work.  All arrays have two controllers, working in an active/passive arrangement.  Depending on the model of array, each controller will have between two and four ethernet ports per controller, with each port having an IP address assigned to it.  Additionally, there will be a single IP address to define the “group” the member array is a part of.  (The Group IP is single IP used by systems looking for an iSCSI target, to let the intelligence of the arrays figure out how to distribute traffic across interfaces.)  If some of the interfaces can’t be contacted (e.g. disconnected cable, switch failure, etc.), the EqualLogic arrays will be smart enough to distribute across the active links.

The ports of each EqualLogic array are connected to the stacked SAN switches in a meshed arrangement for redundancy.  If there ware a switch failure, then one wouldn’t be able to contact the IP addresses of the ethernet ports connected to one of the switches.  But using a VM with guest attached volumes (which have direct access to the SAN), I could successfully ping all four interfaces (eth0 through eth3) on each array.  Hmm…

So then I decided to SSH into the array and see if I could perform the same test.  The idea would be to test from one IP on one of the arrays to see if a ping would be successful on eth0 through eth3 on the other array.  The key to doing this is to use an IP of one of the individual interfaces as the source, and not the Group IP.  Controlling the source and the target during this test will tell you a lot.  After connecting to the array via SSH, the syntax for testing the interfaces on the target array would be this:

ping –I “[sourceIP] [destinationIP]”  (quotes are needed!)

From one of the arrays, pinging all four interfaces on the second array revealed that only two of the four ports succeeded.  But the earlier test from the VM proved that I could ping all interfaces, so I chose to change the source IP as one of the interfaces living on the other switch.  Performed the same test, and the opposite results occurred.  The ports that failed on the last test passed on this test, and the ports that passed the last test, failed on this time.  This seemed to indicate that both switches were up, but the communication between switches were down. 

While I’ve never seen these errors on switches using stacking modules, I have seen the MPIO errors above on a trunked arrangement.  One might run into these issues more with trunking, as it tends to leave more opportunity for issues caused by configuration errors.  I knew that in this case, the switch configurations had not been touched for quite some time.  The status of the switches via the serial console stated the following:

SANSTACK>show switch
Management Standby Preconfig Plugged-in Switch Code
SW Status Status Model ID Model ID Status Version
1 Mgmt Sw PCT6224 PCT6224 OK 3.2.1.3
2 Unassigned PCT6224 Not Present 0.0.0.0

The result above wasn’t totally surprising, in that if the stacking module was down, the master switch wouldn’t be able to be able to gather the information from the other switch.

Dell also has an interesting little tool call “Lasso.”  The Dell Lasso Tool will help grab general diagnostics data from a variety of sources (servers, switches, storage arrays).  But in this case, I found it convenient to test connectivity from the array group itself.  The screen capture below seems to confirm what I learned through the testing above.

image

So the next step was trying to figure out what to do about it.  I wanted to reboot/reload the slave switch, but knowing both switches were potentially passing live data, I didn’t want to do anything to compromise the traffic.  So I employed an often overlooked, but convenient way of manipulating traffic to the arrays; turning off the interfaces on the array that are connected to the SAN switch that needs to be restarted.  If one turns off the interfaces on each array connected to the switch that needs the maintenance, then there will not be any live data passing through that switch.  Be warned that you better have a nice, accurate wiring schematic of your infrastructure so that you know which interfaces can be disabled.  You want to make things better, not worse.

After a restart of the second switch, the interconnect reestablished itself.  The interfaces on the arrays were re-enabled, with all errors disappearing.  I’m not entirely sure why the interconnect went down, but the primary objective was diagnosing and correcting in a safe, deliberate, yet speedy way.  No VMs were down, and the only side effect of the issue was the errors generated, and some degraded performance.  Hopefully this will help you in case you see similar symptoms in your environment.

Helpful Links

Dell Lasso Tool
http://www.dell.com/support/drivers/us/en/555/DriverDetails?driverId=4T3Y6&c=us&l=en&s=biz

Reworking my PowerConnect 6200 switches for my iSCSI SAN
https://vmpete.com/2011/06/26/reworking-my-powerconnect-6200-switches-for-my-iscsi-san/

Dell TechCenter.  A great resource all things related to Dell in the Enterprise.
http://en.community.dell.com/techcenter/b/techcenter/default.aspx