Side effects of upgrading VM’s to Virtual Hardware 7 in vSphere

 

New Years day, 2010 was a day of opportunity for me.  With the office closed, I jumped on the chance of upgrading my ESX cluster from 3.5 to vSphere 4.0 U1.  Those in IT know that a weekend flanked by a holiday is your friend; offering a bit more time to resolve problems, or back out entirely.  The word on the street was the upgrade to vSphere was a smooth one, but I wanted extra time, just in case.  The plan was laid out, the prep work had been done, and I was ready to pull the trigger.  The steps fell into 4 categories.

1.  Upgrade vCenter 
2.  Upgrade ESX hosts 
3.  Upgrade VM Tools and virtual hardware 
4.  Tidy up by taking care of licensing, VCB, and confirm that everything works as expected.

Pretty straightforward, right?  Well, in fact it was.  VMware should be commended for the fine job they did in making the transition  relatively easy.  I didn’t even need the 3 day weekend to do it.  Did I overlook anything?  Of course.  In particular, I didn’t pay attention to how the Virtual Hardware 7 upgrade was going to affect my systems at the application layer.

At the heart of the matter is that when the Virtual Hardware upgrade is made, VMware will effortlessly disable and hide those old NICs, add new NICs in there, and then reassign the addressing info that were on the old cards.  It’s pretty slick, but  does cause a few problems.

  1. Static IP addressing gets transferred over correctly, but the other NIC settings do not.  Do you have all but IPv4 disabled (e.g. Client for Microsoft networks, QoS, etc.) for your iSCSI connections to your SAN?  Do you have NetBIOS over TCP/IP shut off as well?  Well, after the Hardware 7 upgrade, all of those services will be turned on.  Do you have IPv6 disabled on your LAN NIC?  (no, it’s not a commentary on my dislike of IPv6, but there are many legitimate reasons to do this).  That will be turned back on.
  2. NIC binding order will reset itself to an order you probably do not want.  These affect services in a big way, especially when you factor in side effect #1.  (Please note that none of my systems were multi-homed on the LAN side.  The additional NIC’s for each VM were simply for iSCSI based storage access using a guest iSCSI initiator.)
  3. Guest iSCSI initiator settings *may* be different.  A few of the most common reactions I saw were the “Discovery” tab had duplicate entries, and the “Volumes and Devices” tab no longer had the drive letter of the guest initiated drive.  This is necessary to have in order for some services to not jump the gun too early.
  4. Duplicate reverse DNS records.  I stumbled upon this after the update based off of some errors I was seeing.  Many mysteries can occur with orphaned, duplicate reverse DNS records.  Get rid of ‘em as soon as you see them.  It won’t hurt to check your WINs service and clear that out as well, and keep an eye on those machines with DHCP reservations.
  5. In Microsoft’s Operating Systems, their network configuration subsystem generates a Global Unique Identifier (GUID) for each NIC that is partially based on the MAC address of the NIC.  This GUID may or may not be used in applications like Exchange, CRM, Sharepoint, etc.  When the NIC changes, the GUID changes.  …and services may break.

Items 1 through 4 are pretty easy to handle – even easier when you know what’s coming.  Item #5 is a total wildcard.

What’s interesting about this is that it created the kinds of problems that in many ways are the most problematic for Administrators; where you think it’s running fine, but it’s not.  Most things work long enough to make your VM snapshots no longer relevant, if you plan on the quick fix. 

Now, in hindsight, I see that some of this was documented, as much of this type of thing comes up in P2V, and V2V conversions.  However, much of it was not.  My hope is to save someone else a little heartache.  Here is what I did after each VM was upgraded to Virtual Hardware 7.

All VM’s

Removing old NICs

  1. Open up command shell option for the "run as administrator". In Shell, type set devmgr_show_nonpresent_devices=1, then hit Enter
  2. Type start devmgmt.msc then hit Enter
  3. Click View > Show hidden devices
  4. Expand Network Adapter tree
  5. Right click grayed out NICs, and click uninstall
  6. Click View > Show hidden devices to untoggle.
  7. Exit out of application
  8. type set devmgr_show_nonpresent_devices=0, then hit Enter.

Change binding order of new NICs

  1. Right click on Network, then click Properties > Manage Network Connections
  2. Rename NICs to "LAN" "iSCSI-1" and "iSCSI-2" or whatever you wish.
  3. Change binding order to have LAN NIC at the top of the list
  4. Disable IPV6 on LAN NIC
  5. For iSCSI NIC’s, disable all but TCP/IPv4. Verify static IP (w/ no gateway or DNS servers), verify "Register this connections address in DNS" is unchecked, and Disable NetBIOS over TCP/IP

Verify/Reset iSCSI initator settings

  1. Open iSCSI initiator
  2. Verify that in the Discovery tab, just one entry is in there; x.x.x.x:3260, Default Adapter, Default IP address
  3. Verify Targets and favorite Targets tab
  4. On Volumes and Devices tab, click on "Autoconfigure" to repopulate entries (clears up mangled entries on some).
  5. Click OK, and restart machine.

 DNS and DHCP

  1. Remove duplicate reverse lookup records for systems upgraded to Virtual Hardware 7
  2. For systems that have DHCP reserved addresses, jump into your DHCP manager, and modify as needed.

Exchange 2007 Server

Exchange seemed operational at first, but the more I looked at the event logs, the more I realized I needed to do some clean up

After performing the tasks under the “All VMs” fix, most issues went away.  However, one stuck around.  Because of the GUID issue, if your Exchange Server is running the transport server role, it will flag you with Event ID: 205 errors.  It is still looking for the old GUID.  Here is what to do.

First, determine status of the NICs

[PS] C:\Windows\System32>get-networkconnectioninfo

Name : Intel(R) PRO/1000 MT Network Connection #4
DnsServers : {192.168.0.9, 192.168.0.10}
IPAddresses : {192.168.0.20}
AdapterGuid : 5ca83cae-0519-43f8-adfe-eefca0f08a04
MacAddress : 00:50:56:8B:5F:97

Name : Intel(R) PRO/1000 MT Network Connection #5
DnsServers : {}
IPAddresses : {10.10.0.193}
AdapterGuid : 6d72814a-0805-4fca-9dee-4bef87aafb70
MacAddress : 00:50:56:8B:13:3F

Name : Intel(R) PRO/1000 MT Network Connection #6
DnsServers : {}
IPAddresses : {10.10.0.194}
AdapterGuid : 564b8466-dbe2-4b15-bd15-aafcde21b23d
MacAddress : 00:50:56:8B:2C:22

Then get the transport server info

[PS] C:\Windows\System32>get-transportserver | ft -wrap Name,*DNS*

image 

Then, set the transport server info correctly

set-transportserver SERVERNAME -ExternalDNSAdapterGuid 5ca83cae-0519-43f8-adfe-eefca0f08a04

set-transportserver SERVERNAME -InternalDNSAdapterGuid 5ca83cae-0519-43f8-adfe-eefca0f08a04

 

Sharepoint Server

Thank heavens we are just in the pre-deployment stages for Sharepoint.  Our Sharepoint Consultant asked what I did to mess up the Central Administration site, as it could no longer be accessed (the other sites were fine however).  After a bizarre series of errors, I thought it would be best to restore it from snapshot and test what was going on.  The virtual hardware upgrade definitely broke the connection, but so did removing the existing NIC, and adding another one.   As of now, I can’t determine that it is in fact a problem with the NIC GUID, but it sure seems to be.   My only working solution in the time allowed was to keep the server at Hardware level 4, and build up a new Sharepoint Front End Server. 

One might question why, with the help of vCenter,  a MAC address can’t be forced on the server.   Even though one is able to get the last 12 characters of the GUID (representing the MAC address) the first part is different.  It makes sense because the new device is different.  The applications care about the GUID as a whole, not just the MAC address.

Here is how you can find the GUID of the system’s NIC in question.  Run this BEFORE you perform a virtual hardware update, and save it for future reference in case you run into problems.  Also make note of where it exists in the registry.  It’s not a solution to the issue I had with Sharepoint, but its worth knowing about.

C:\Users\administrator.DOMAINNAME>net config rdr
Computer name
\\SERVERNAME
Full Computer name SERVERNAME.domainname.lan
User name Administrator
Workstation active on
NetbiosSmb (000000000000)
NetBT_Tcpip_{56BB9E44-EA93-43C3-B7B3-88DD478E9F73} (0050568B60BE)
Software version Windows Server (R) 2008 Standard
Workstation domain DOMAINNAME
Workstation Domain DNS Name domainname.lan
Logon domain DOMAINNAME
COM Open Timeout (sec) 0
COM Send Count (byte) 16
COM Send Timeout (msec) 250

 

 In hindsight…

Let me be clear that there is really not much that VMWare can do about this.  The same troubles would occur on a physical machine if you needed to change out Network cards.  The difference is, that it’s not done as easily, as often, or as transparently as on a VM.

If I were to do it over again (and surely I will the day when VMware releases a major upgrade again), I would have done a few things different.

  1. Note all existing application errors and warnings on each server prior to the upgrade, just so I don’t try to ponder if that warning I’m starring at had existed before the upgrade.
  2. Note those GUID’s before you upgrade.  You could always capture it after restoring from a snapshot if you do run into problems, but save yourself a little time and get this down on paper ahead of time.
  3. Take the virtual hardware upgrade slowly.  After everything else went pretty smooth, I was in a “get it done” mentality.  Although the results were not  catastrophic, I could have done better minimizing the issues.
  4. Keep the snapshots around at least the length of your scheduled maintenance window.  It’s not a get-out-of-jail card, but if you have the weekend or off-hours to experiment, it offers you a good tool to do so.

This has also helped me in the decision of taking a very conservative approach to implementing the new VMXNet3 NIC driver to existing VMs.  I might simply update my templates and only deploy them on new systems, or systems that don’t run services that rely on the NIC’s GUID.

One final note.  “GUID” can be many things depending on the context, and may be referenced in different ways (UUID, SID, etc).  Not all GUID’s are NIC GUID’s.  The term can be used quite loosely in various subject matters.   What does this mean to you?  It means that it makes searching the net pretty painful at times.

Interesting links:

A simple powershell script to get the NIC GUID
http://pshscripts.blogspot.com/2008/07/get-guidps1.html

Follow

Get every new post delivered to your Inbox.

Join 870 other followers