Replication with an EqualLogic SAN; Part 1

 

Behind every great virtualized infrastructure is a great SAN to serve everything up.  I’ve had the opportunity to work with the Dell/EqualLogic iSCSI array for a while now, taking advantage of all of the benefits that the iSCSI based SAN array offers.  One feature that I haven’t been able to use is the built in replication feature.  Why?  I only had one array, and I didn’t have a location offsite to replicate to.

I suppose the real “part 1” of my replication project was selling the idea to the Management Team.  When it came to protecting our data and the systems that help generate that data, it didn’t take long for them to realize it wasn’t a matter of what we could afford, but how much we could afford to lose.  Having a building less than a mile away burn to the ground also helped the proposal.  On to the fun part; figuring out how to make all of this stuff work.

Of the many forms of replication out there, the most obvious one for me to start with is native SAN to SAN replication.  Why?  Well, it’s built right into the EqualLogic PS arrays, with no additional components to purchase, or license keys or fees to unlock features.  Other solutions exist, but it was best for me to start with the one I already had.

For companies with multiple sites, replication using EqualLogic arrays seems pretty straight forward.  For a company with nothing more than a single site, there are a few more steps that need to occur before the chance to start replicating data can happen.

 

Decision:  Colocation, or hosting provider

One of the first decisions that had to be made was if we wanted our data to be replicated to a Colocation (CoLo) with equipment that we owned and controlled, or with a hosting provider that can provide native PS array space and replication abilities.  Most hosting providers use a mixed variety of metering of data replicated to charge.  Accurately estimating your replication costs assumes you have a really good understanding of how much data will be replicated.  Unfortunately, this is difficult to know until you start replicating.  The pricing models of these hosting providers reminded me too much of a cab fare; never knowing what you are going to pay until you get the big bill when you are finished.    A CoLo with equipment that we owned fit with our current and future objectives much better.  We wanted fixed costs, and the ability to eventually do some hosting of critical services at the CoLo (web, ftp, mail relay, etc.), so it was an easy decision for us.

Our decision was to go with a CoLo facility located in the Westin Building in downtown Seattle.  Commonly known as the Seattle Internet Exchange (SIX), this is an impressive facility not only in it’s physical infrastructure, but how it provides peered interconnects directly from one ISP to another.  Our ISP uses this facility, so it worked out well to have our CoLo there as well

 

Decision:  Bandwidth

Bandwidth requirements for our replication was, and is still unknown, but I knew our bonded T1’s probably weren’t going to be enough, so I started exploring other options for higher speed access.  The first thing to check was to see if we qualified for a Metro-E or “Ethernet over Copper” (award winner for the dumbest name ever).  Metro-E removes the element of T-carrier lines along with any proprietary signaling, and provides internet access of point-to-point connections at Layer 2, instead of Layer 3.  We were not close enough to the carriers central office to get adequate bandwidth, and even if we were, it probably wouldn’t scale up to our future needs.

Enter QMOE, or Qwest Metro Optical Ethernet.  This solution feeds Layer 2 Ethernet to our building via fiber, offering the benefit of high bandwidth, low latency, that can be scaled easily.

Our first foray using QMOE is running a 30mbps point-to-point feed to our CoLo, and uplinked to the Internet.  If we need more later, there is no need to add or change equipment.  Just have them turn up the dial, and bill you accordingly.

 

Decision:  Topology

Topology planning has been interesting to say the least.  The best decision here depends on the use-case, and lets not forget, what’s left in the budget. 

Two options immediately presented themselves.

1.  Replication data from our internal SAN would be routed (Layer 3) to the SAN at the CoLo.

2.  Replication data  from our internal SAN would travel by way of a VLAN to the SAN at the CoLo.

If my need was only to send replication data to the CoLo, one could take advantage of that layer 2 connection, and send replication data directly to the CoLo, without it being routed.  This would mean that it would have to bypass any routers/firewalls in place, and have to be running to the CoLo on it’s own VLAN.

The QMOE network is built off of Cisco Equipment, so in order to utilize any VLANing from the CoLo to the primary facility, you must have Cisco switches that will support their VLAN trunking protocol (VTP).  I don’t have the proper equipment for that right now.

In my case, here is a very simplified illustration as to how the two topologies would look:

Routed Topology

image

 

Topology using VLANs

image

One may introduce more overhead and less effective throughput when the traffic becomes routed.  This is where a WAN optimization solution could come into play.  These solutions (SilverPeak, Riverbed, etc.) appear to be extremely good at improving effective throughput across many types of WAN connections.  These of course must sit at the correct spot in the path to the destination.  The units are often priced on bandwidth speed, and while they are very effective, are also quite an investment.  But they work at layer 3, and must in between the source and a router at both ends of the communication path; something that wouldn’t exist on a Metro-E circuit where VLANing was used to transmit replicated data.

The result is that for right now, I have chosen to go with a routed arrangement with no WAN optimization.  This does not differ too much from a traditional WAN circuit, other than my latencies should be much better.  The next step if our needs are not sufficiently met would be to invest in a couple of Cisco switches, then send replication data over it’s own VLAN to the CoLo, similar to the illustration above.

 

The equipment

My original SAN array is an EqualLogic PS5000e connected to a couple of Dell PowerConnect 5424 switches.  My new equipment closely mirrors this, but is slightly better;  An EqualLogic PS6000e and two PowerConnect 6224 switches.  Since both items will scale a bit better, I’ve decided to change out the existing array and switches with the new equipment.

 

Some Lessons learned so far

If you are changing ISPs, and your old ISP has authoritative control of your DNS zone files, make sure your new ISP has the zone file EXACTLY the way you need it.  Then confirm it one more time.  Spelling errors and omissions in DNS zone files doesn’t work out very well, especially when you factor in the time it takes for the corrections to propagate through the net.  (Usually up to 72 hours, but can feel like a lifetime when your customers can’t get to your website) 

If you are going to go with a QMOE or Metro-E circuit, be mindful that you might have to force the external interface on your outermost equipment (in our case, the firewall/router, but could be a managed switch as well) to negotiate to 100mbps full duplex.  Auto negotiation apparently doesn’t work to well on many Metro-E implementations, and can cause fragmentation that will reduce your effective throughput by quite a bit.  This is exactly what we saw.  Fortunately it was an easy fix.

 

Stay tuned for what’s next…

Using OneNote in IT

 

It’s hard to believe that as an IT administrator, one of my favorite applications I use is one of the least technical.  Microsoft created an absolutely stellar application when they created OneNote.  If you haven’t used it, you should.

Most IT Administrators have high expectations of themselves.  Somehow we expect to remember pretty much everything.  Deployment planning, research, application specific installation steps and issues.  Information gathering for troubleshooting, and documenting as-built installations.  You might have information that you work with every day, and think “how could I ever forget that?” (you will), along with that obscure, required setting on your old phone system that hasn’t been looked at in years.

The problem is that nobody can remember everything. 

After years of using my share of spiral binders, backs of print outs, and Post-It notes to gather and manage systems and technologies, I’ve realized a few things.  1.)  I can’t read my own writing.  2.)  I never wrote enough down for the information to be valuable.  3.)  What I can’t fit on one physical page, I squeeze in on another page that makes no sense at all.  4.)  The more I have to do, the more I tried (and failed) to figure out a way to file it.  5.)  These notes eventually became meaningless, even though I knew I kept them for a reason.  I just couldn’t remember why.

Do you want to make a huge change in how you work?   Read on.

OneNote was first adopted by our Sales team several years ago, and while I knew what it was, I never bothered to use it for real IT projects until late in 2007, when a colleague of mine (thanks Glenn if you are reading) suggested that it was working well for him and his IT needs.  Ever since then, I wonder how I ever worked without it.

If you aren’t familiar with OneNote, there isn’t too much to understand.  It’s an electronic Notebook. 

image

It’s arranged just as you’d expect a real notebook.  The left side represents notebooks, the top area of tabs represent sections or earmarks, and the right side represents the pages in a notebook.  It’s that easy.   Just like it’s physical counterpart, it’s free-form formatting allows you to place object anywhere on a page (goodbye MS Word).

What has transpired since my experiment to use OneNote is how well it tackles every single need I have in information gathering and mining of that data after the fact.  Here are some examples.

Long term projects and Research

What better time to try out a new way of working on one of the biggest projects I’ve had to tackle in years, right?  Virtualizing my infrastructure was a huge undertaking, and I had what seemed like an infinite amount of information to learn in a very short period of time, under all different types of subject matters.  In a Notebook called “Virtualization” I had sections that narrowed subject matters down to things like ESX, SAN array, Blades, switchgear, UPS, etc.  Each one of those sections had pages (at least a few dozen for the ESX section, as there was a lot to tackle) that were specific subject matters of information I needed to gather to learn about, or to keep for reference.  Links, screen captures, etc.  I dumped everything in there, including my deployment steps before, during, and after.

 

Procedures

Our Linux code compiling machines have very specific package installations and settings that need to be set before deployment.  OneNote works great for this.  The no-brainer checkboxes offer nice clarity.

image

If you maintain different flavors of Unix or various distributions of Linux, you know how much the syntax can vary.  OneNote helps keep your sanity.  With so many Windows products going the way of Powershell, you’d better have your command line syntax down for that too.

This has also worked well with backend installations.  My Installations of VMware, SharePoint, Exchange, etc. have all been documented this way.  It takes just a bit longer, but is invaluable later on.  Below is a capture of part of my cutover plan from Exchange 2003 to Exchange 2007.

image

Migrations and Post migration outstanding issues

After big migrations, you have to be on your toes to address issues that are difficult to predict.  OneNote has allowed me to use a simple ISSUE/FIX approach.  So, in an “Apps” notebook, under an “E2007 Migration” section, I might have a page called “Postfix” and it might look something like this.

image

You can label these pages “Outstanding issues” or as I did for my ESX 3.5 to vSphere migration, “Postfix” pages.

image

As-builts

Those in the Engineering/Architectural world are quite familiar with As-built drawings.  Those are drawings that reflect how things were really built.  Many times in IT, deployment plans and documentation never go further than the day you deploy it.  OneNote allows for an easy way to turn that deployment plan into a living copy, or as-built configuration of the product you just deployed.  Configurations are as dynamic as the technologies that power them.  Its best to know what sort of monster you created, and how to recreate it if you need to.

 

Daily issues (fire fighting)

Emergencies, impediments, fires, or whatever you’d like to call them, come up all the time.  I’ve found OneNote to be most helpful in two specific areas on this type of task.  I use it as a quick way to gather data on an issue that I can look at later (copying and pasting screenshot and URLs into OneNote), and for comparing the current state of a system against past configurations.  Both ways help me solve the problems more quickly.

Searching text in bitmapped screen captures

One of the really interesting things about OneNote is that you can paste a screen capture of say, a dialog box in the notebook, then when searching later for a keyword, it will include those bitmaps in the search results!!!!  Below is one of the search results OneNote pulled up when I searched for “KDC”  This was a screen capture sitting in OneNote.  Neat.

image

 

Goodbye Browser Bookmarks

How many times have you spent trying to organize your web browser bookmarks or favorites, only to never look at them again, or try to figure out why you bookmarked it?  Its an exercise in futility.  No more!  Toss them all away.  Paste those links into the various locations in OneNote (where the subject matter is applicable, and enter a brief little description on top of it, and you can always find it later when searching for it.

 

Summary

I won’t ever go without using OneNote for projects large or small again.  It is right next to my email as my most used application.  OneNote users tend to be a loyal bunch, and after a few years of using it, I can see why.  At about $80 retail, you can’t go wrong.  And, lucky for you, it will be included in all versions of Office 2010.

Additional Links

New features coming in OneNote 2010
http://blogs.msdn.com/descapa/archive/2009/07/15/overview-of-onenote-2010-what-s-new-for-you.aspx

Using OneNote with SharePoint
http://blogs.msdn.com/mcsnoiwb/archive/2008/12/03/onenote-and-sharepoint-the-basics.aspx 

Interesting tips and tricks with OneNote
http://blogs.msdn.com/onenotetips/

Comparing Nehalem and Harpertown running vSphere in a production environment

 

The good press that Intel’s Nehalem chip and underlying architecture has been receiving lately gave me pretty good reason to be excited for the arrival of two Dell M610 blades based on the Nehalem chipset.  I really wanted to know how they were going to stack up against my Dell M600’s (running Harpertown chips).  So I thought I’d do some side-by-side comparisons in a real world environment.  It was also an opportunity to put some 8 vCPU VMs to the test under vSphere.

First, a little background information.  The software my company produces runs on just about every version of Windows, Linux, and Unix there is.  We have to compile and validate (exercise) those builds on every single platform.  The majority of our customers run under Windows and Linux, so the ability to virtualize our farm of Windows and Linux build machines was a compelling argument in my case for our initial investment.

Virtualizing build/compiler machines is a great way to take advantage of your virtualized infrastructure.  What seems odd to me though is that I never read about others using their infrastructure in this way.  Our build machines are critical to us.  Ironically, they’d often be running on old leftover systems.  Now that they are virtualized, we are now letting those physical machines do nothing but exercise and validate the builds.  Unfortunately, we cannot virtualize our exerciser machines because of our reliance on GPU’s from the physical machine’s video cards in our validation routines. 

Our Development Team has also invested heavily in Agile and Scrum principals.  One of the hallmarks of that is Test Driven Development (TDD).    Short development cycles, and the ability for each developer to compile and test their changes allows for more aggressive programming, producing more dramatic results.

How does this relate?  Our Developers need build machines that are as fast as possible.  Unlike so many other applications, their compilers actually can use every processor you give them (some better than others, as you will see).  This meant that many Developer machines were being over spec’d, because we’d use them as a build machine as well the Developer’s primary workstation.  This worked, but you could imagine the disruption that occurs when a Developer’s machine was scheduled to be upgraded or modified in any way. (read:  angry Developer gets territorial over their system, even though YOU are the IT guy).    Plus, we typically spend more for desktop workstations than necessary because of the needed horsepower for these systems performing dual roles.

Two recent advancements have allowed me to deliver on my promises to leverage our virtualized infrastructure for our build processes.  vSphere’s improved co-scheduler (along with support for 8 vCPUs), and Intel’s Nehalem chip.  Let’s see how the improvements pan out.

Hardware tested

  • Dell PowerEdge M600 (Harpertown).  Dual chip, quad core Intel E5430 (2.66 Ghz).  32GB RAM
  • Dell PowerEdge M610 (Nehalem).  Dual chip, quad core Intel x5550.  (2.66 Ghz). 32GB RAM

 

Software & VM’s and applications tested

  • vSphere Enterprise Plus 4.0 Update 1
  • VM:  Windows XP x64.  2GB RAM.  4 vCPUs.  Visual Studio 2005*
  • VM:  Windows XP x64.  2GB RAM.  8 vCPUs.  Visual Studio 2005*
  • VM:  Ubuntu 8.04 x64.  2GB RAM.  4 vCPUs.  Cmake
  • VM:  Ubuntu 8.04 x64.  4GB RAM**.  8 vCPUs.  Cmake

*I wanted to test Windows 7 and Visual Studio 2008, which is said to be better at multithreading, but ran out of time.

** 8vCPU Linux VM was bumped up to 4GB of RAM to eliminate some swapping errors I was seeing, but it never used more than about 2.5 GB during the build run.

 

Testing scenarios

My goals for testing were pretty straight forward

  • Compare how VMs executing full builds, running on hosts with Harpertown chips compared to the same VMs running on hosts with Nehalem chips
  • Compare performance of builds when I changed the number of vCPU’s assigned to a VM.
  • Observe how well each compiler on each platform handled multithreading

I limited observations to full build runs, as incremental builds don’t lend well to using multiple threads. 

I admit that my testing methods were far from perfect.  I wish I could have sampled more data to come up with more solid numbers, but these were production build systems, and the situation dictated that I not interfere too much with our build processes just for my own observations.  My focus is mostly on CPU performance in real world scenarios.  I monitored other resources such as disk I/O and memory just to make sure they were not inadvertently affecting the results beyond my real world allowances.

The numbers

Each test run shows two graphs.  The Line graph shows total CPU utilization as a percentage, that is available to the VM.  The stacked line graph shows the number of CPU cycles in Mhz used by the given vCPU. 

Each testing scenario shows the time in minutes to complete.

  Windows XP64 Linux x64
2 vCPU Nehalem 41 N/A
4 vCPU Harpertown 32 38
4 vCPU Nehalem 27 32
8 vCPU Nehalem 32 8.5

 

VM #1  WinXP64.  4 vCPU.  2GB RAM.  Visual Studio 2005.
HarperTown chipset (E5430)
Full build:  33 minutes

 01-tpb004-4vcpu-m600-cpu

 02-tpb004-4vcpu-m600-cpustacked

VM #2 WinXP64.  4 vCPU.  2GB RAM.  Visual Studio 2005
Nehalem chipset (x5550)
Full build:  27 minutes

01-tpb004-4vcpu-m610-cpu

 02-tpb004-4vcpu-m610-cpustacked

VM #3 WinXP64.  8 vCPU.  2GB RAM.  Visual Studio 2005.
Nehalem chipset (x5550)
Full build:  32 minutes

01-tpb004-8vcpu-m610-cpu

02-tpb004-8vcpu-m610-cpustacked

VM #4 WinXP64.  2 vCPU.  2GB RAM.  Visual Studio 2005.
Nehalem chipset (x5550)
Full build:  41 minutes

01-tpb004-2vcpu-m610-cpu

 02-tpb004-2vcpu-m610-cpustacked

VM #5 Ubuntu 8.04 x64.  4 vCPU.  2GB RAM.  Cmake.
HarperTown chipset (E5430)
Full build:  38 minutes

(no graphs available.  My dog ate ‘em.)

 

VM #6 Ubuntu 8.04 x64.  4 vCPU.  2GB RAM.  Cmake.
Nehalem chipset (x5550)
Full build:  32 minutes

01-tpb002-4vcpu-m610-cpu

02-tpb002-4vcpu-m610-cpustacked

VM #7 Ubuntu 8.04 x64.  8 vCPU.  4GB RAM.  Cmake.
Nehalem chipset (x5550)
Full build:  8.5 minutes  (note:  disregard first blip of data on chart)

01-tpb002-8vcpu-m610-cpu

02-tpb002-8vcpu-m610-cpustacked

Notice the  tremendous multithreading performance of build process under Ubuntu 8.10 (x64)!!!  Remarkably even for each vCPU and thread, which is best observed on the stacked graph charts, where the higher that it is stacked, the better it is using all vCPUs available.  Windows and it’s compiler were not nearly as good, actually becoming less efficient when I moved from 4 vCPUs to 8 vCPUs.  The build times reflect this.

A few other things I noticed along the way…

Unlike the old E5430 hosts, hyper threading is possible on the x5550 hosts, and according to VMWare’s documentation, is recommended.  Whether it actually improves performance is subject to some debate, as found here.

If you want to VMotion VMs between your x5550 and your E5430 based hosts, you will need to turn on EVC mode in VCenter.  You can do this in the cluster settings section of VCenter.  According to Intel and VMware, you won’t be dumbing down or hurting the performance of your new hosts. 

My Dell M610 blades (Nehalem) had the Virtualization Technology toggle turned off in the BIOS.  This was the same as my M600’s (Harpertown).  Why this is the default is beyond me, especially on a blade.  Remember to set that before you even start installing vSphere.

For windows VM’s, remember that the desktop OS’ are limited to what it sees as two physical sockets.  By default, it relates one core on the ESX host as one processor in one socket.  To utilze more than just 2 vCPUs on those VMs, set the “cpuid.corespersocket” setting in the settings of the VM.  More details can be found here.

Conclusion
I’ve observed nice performance gains using the hosts with the Nehalem chips.  15 to 20% from my small samples.  However, my very crude testing has not revealed improvements as noted in various posts suggesting that a single vCPU VM running on a Nehalem chips would be nearly equal to that of a 2 vCPU VM on a Harpertown chip (see here).  This is not to say that it can’t happen.  I just haven’t seen that yet.

I was impressed how well and even the multithreading abilities of the compilers running on a Linux VM are, versus the Windows counterpart.  So were the Developers, who saw the 8.5 minute build time as good or better than any physical system we have in the office.  But make no mistake, if you are running a VM with 8 vCPU’s on a host with 8 cores, and it’s able to use all of those 8 vCPU’s, you won’t be getting something for nothing.  Your ESX host will be nearly pegged for those times its running full tilt, and other VMs will suffer.  This was the reason behind our purchase of additional blades.

Side effects of upgrading VM’s to Virtual Hardware 7 in vSphere

 

New Years day, 2010 was a day of opportunity for me.  With the office closed, I jumped on the chance of upgrading my ESX cluster from 3.5 to vSphere 4.0 U1.  Those in IT know that a weekend flanked by a holiday is your friend; offering a bit more time to resolve problems, or back out entirely.  The word on the street was the upgrade to vSphere was a smooth one, but I wanted extra time, just in case.  The plan was laid out, the prep work had been done, and I was ready to pull the trigger.  The steps fell into 4 categories.

1.  Upgrade vCenter 
2.  Upgrade ESX hosts 
3.  Upgrade VM Tools and virtual hardware 
4.  Tidy up by taking care of licensing, VCB, and confirm that everything works as expected.

Pretty straightforward, right?  Well, in fact it was.  VMware should be commended for the fine job they did in making the transition  relatively easy.  I didn’t even need the 3 day weekend to do it.  Did I overlook anything?  Of course.  In particular, I didn’t pay attention to how the Virtual Hardware 7 upgrade was going to affect my systems at the application layer.

At the heart of the matter is that when the Virtual Hardware upgrade is made, VMware will effortlessly disable and hide those old NICs, add new NICs in there, and then reassign the addressing info that were on the old cards.  It’s pretty slick, but  does cause a few problems.

  1. Static IP addressing gets transferred over correctly, but the other NIC settings do not.  Do you have all but IPv4 disabled (e.g. Client for Microsoft networks, QoS, etc.) for your iSCSI connections to your SAN?  Do you have NetBIOS over TCP/IP shut off as well?  Well, after the Hardware 7 upgrade, all of those services will be turned on.  Do you have IPv6 disabled on your LAN NIC?  (no, it’s not a commentary on my dislike of IPv6, but there are many legitimate reasons to do this).  That will be turned back on.
  2. NIC binding order will reset itself to an order you probably do not want.  These affect services in a big way, especially when you factor in side effect #1.  (Please note that none of my systems were multi-homed on the LAN side.  The additional NIC’s for each VM were simply for iSCSI based storage access using a guest iSCSI initiator.)
  3. Guest iSCSI initiator settings *may* be different.  A few of the most common reactions I saw were the “Discovery” tab had duplicate entries, and the “Volumes and Devices” tab no longer had the drive letter of the guest initiated drive.  This is necessary to have in order for some services to not jump the gun too early.
  4. Duplicate reverse DNS records.  I stumbled upon this after the update based off of some errors I was seeing.  Many mysteries can occur with orphaned, duplicate reverse DNS records.  Get rid of ‘em as soon as you see them.  It won’t hurt to check your WINs service and clear that out as well, and keep an eye on those machines with DHCP reservations.
  5. In Microsoft’s Operating Systems, their network configuration subsystem generates a Global Unique Identifier (GUID) for each NIC that is partially based on the MAC address of the NIC.  This GUID may or may not be used in applications like Exchange, CRM, Sharepoint, etc.  When the NIC changes, the GUID changes.  …and services may break.

Items 1 through 4 are pretty easy to handle – even easier when you know what’s coming.  Item #5 is a total wildcard.

What’s interesting about this is that it created the kinds of problems that in many ways are the most problematic for Administrators; where you think it’s running fine, but it’s not.  Most things work long enough to make your VM snapshots no longer relevant, if you plan on the quick fix. 

Now, in hindsight, I see that some of this was documented, as much of this type of thing comes up in P2V, and V2V conversions.  However, much of it was not.  My hope is to save someone else a little heartache.  Here is what I did after each VM was upgraded to Virtual Hardware 7.

All VM’s

Removing old NICs

  1. Open up command shell option for the "run as administrator". In Shell, type set devmgr_show_nonpresent_devices=1, then hit Enter
  2. Type start devmgmt.msc then hit Enter
  3. Click View > Show hidden devices
  4. Expand Network Adapter tree
  5. Right click grayed out NICs, and click uninstall
  6. Click View > Show hidden devices to untoggle.
  7. Exit out of application
  8. type set devmgr_show_nonpresent_devices=0, then hit Enter.

Change binding order of new NICs

  1. Right click on Network, then click Properties > Manage Network Connections
  2. Rename NICs to "LAN" "iSCSI-1" and "iSCSI-2" or whatever you wish.
  3. Change binding order to have LAN NIC at the top of the list
  4. Disable IPV6 on LAN NIC
  5. For iSCSI NIC’s, disable all but TCP/IPv4. Verify static IP (w/ no gateway or DNS servers), verify "Register this connections address in DNS" is unchecked, and Disable NetBIOS over TCP/IP

Verify/Reset iSCSI initator settings

  1. Open iSCSI initiator
  2. Verify that in the Discovery tab, just one entry is in there; x.x.x.x:3260, Default Adapter, Default IP address
  3. Verify Targets and favorite Targets tab
  4. On Volumes and Devices tab, click on "Autoconfigure" to repopulate entries (clears up mangled entries on some).
  5. Click OK, and restart machine.

 DNS and DHCP

  1. Remove duplicate reverse lookup records for systems upgraded to Virtual Hardware 7
  2. For systems that have DHCP reserved addresses, jump into your DHCP manager, and modify as needed.

Exchange 2007 Server

Exchange seemed operational at first, but the more I looked at the event logs, the more I realized I needed to do some clean up

After performing the tasks under the “All VMs” fix, most issues went away.  However, one stuck around.  Because of the GUID issue, if your Exchange Server is running the transport server role, it will flag you with Event ID: 205 errors.  It is still looking for the old GUID.  Here is what to do.

First, determine status of the NICs

[PS] C:\Windows\System32>get-networkconnectioninfo

Name : Intel(R) PRO/1000 MT Network Connection #4
DnsServers : {192.168.0.9, 192.168.0.10}
IPAddresses : {192.168.0.20}
AdapterGuid : 5ca83cae-0519-43f8-adfe-eefca0f08a04
MacAddress : 00:50:56:8B:5F:97

Name : Intel(R) PRO/1000 MT Network Connection #5
DnsServers : {}
IPAddresses : {10.10.0.193}
AdapterGuid : 6d72814a-0805-4fca-9dee-4bef87aafb70
MacAddress : 00:50:56:8B:13:3F

Name : Intel(R) PRO/1000 MT Network Connection #6
DnsServers : {}
IPAddresses : {10.10.0.194}
AdapterGuid : 564b8466-dbe2-4b15-bd15-aafcde21b23d
MacAddress : 00:50:56:8B:2C:22

Then get the transport server info

[PS] C:\Windows\System32>get-transportserver | ft -wrap Name,*DNS*

image 

Then, set the transport server info correctly

set-transportserver SERVERNAME -ExternalDNSAdapterGuid 5ca83cae-0519-43f8-adfe-eefca0f08a04

set-transportserver SERVERNAME -InternalDNSAdapterGuid 5ca83cae-0519-43f8-adfe-eefca0f08a04

 

Sharepoint Server

Thank heavens we are just in the pre-deployment stages for Sharepoint.  Our Sharepoint Consultant asked what I did to mess up the Central Administration site, as it could no longer be accessed (the other sites were fine however).  After a bizarre series of errors, I thought it would be best to restore it from snapshot and test what was going on.  The virtual hardware upgrade definitely broke the connection, but so did removing the existing NIC, and adding another one.   As of now, I can’t determine that it is in fact a problem with the NIC GUID, but it sure seems to be.   My only working solution in the time allowed was to keep the server at Hardware level 4, and build up a new Sharepoint Front End Server. 

One might question why, with the help of vCenter,  a MAC address can’t be forced on the server.   Even though one is able to get the last 12 characters of the GUID (representing the MAC address) the first part is different.  It makes sense because the new device is different.  The applications care about the GUID as a whole, not just the MAC address.

Here is how you can find the GUID of the system’s NIC in question.  Run this BEFORE you perform a virtual hardware update, and save it for future reference in case you run into problems.  Also make note of where it exists in the registry.  It’s not a solution to the issue I had with Sharepoint, but its worth knowing about.

C:\Users\administrator.DOMAINNAME>net config rdr
Computer name
\\SERVERNAME
Full Computer name SERVERNAME.domainname.lan
User name Administrator
Workstation active on
NetbiosSmb (000000000000)
NetBT_Tcpip_{56BB9E44-EA93-43C3-B7B3-88DD478E9F73} (0050568B60BE)
Software version Windows Server (R) 2008 Standard
Workstation domain DOMAINNAME
Workstation Domain DNS Name domainname.lan
Logon domain DOMAINNAME
COM Open Timeout (sec) 0
COM Send Count (byte) 16
COM Send Timeout (msec) 250

 

 In hindsight…

Let me be clear that there is really not much that VMWare can do about this.  The same troubles would occur on a physical machine if you needed to change out Network cards.  The difference is, that it’s not done as easily, as often, or as transparently as on a VM.

If I were to do it over again (and surely I will the day when VMware releases a major upgrade again), I would have done a few things different.

  1. Note all existing application errors and warnings on each server prior to the upgrade, just so I don’t try to ponder if that warning I’m starring at had existed before the upgrade.
  2. Note those GUID’s before you upgrade.  You could always capture it after restoring from a snapshot if you do run into problems, but save yourself a little time and get this down on paper ahead of time.
  3. Take the virtual hardware upgrade slowly.  After everything else went pretty smooth, I was in a “get it done” mentality.  Although the results were not  catastrophic, I could have done better minimizing the issues.
  4. Keep the snapshots around at least the length of your scheduled maintenance window.  It’s not a get-out-of-jail card, but if you have the weekend or off-hours to experiment, it offers you a good tool to do so.

This has also helped me in the decision of taking a very conservative approach to implementing the new VMXNet3 NIC driver to existing VMs.  I might simply update my templates and only deploy them on new systems, or systems that don’t run services that rely on the NIC’s GUID.

One final note.  “GUID” can be many things depending on the context, and may be referenced in different ways (UUID, SID, etc).  Not all GUID’s are NIC GUID’s.  The term can be used quite loosely in various subject matters.   What does this mean to you?  It means that it makes searching the net pretty painful at times.

Interesting links:

A simple powershell script to get the NIC GUID
http://pshscripts.blogspot.com/2008/07/get-guidps1.html

Resource allocation for Virtual Machines

Ever since I started transitioning our production systems to my ESX cluster, I’ve been fascinated how visible resource utilization has become.  Or to put it another way, how blind I was before.  I’ve also been interested to hear about the natural tendency of many Administrator’s to over allocate resources to their VM’s.  Why does it happen?  Who’s at fault?  From my humble perspective, it’s a party everyone has shown up to.

  • Developers & Technical writers
  • IT Administrators
  • Software Manufacturers
  • Politics

Developers & Technical Writers
Best practices and installation guides are usually written by Technical Writers for that Software Manufacturer.  They are provided information by whom else?  The Developers.  Someone on the Development team will take a look-see at their Task Manager or htop in Linux, maybe PerfMon if they have extra time.  They determine (with a less than thorough vetting process) what the requirement should be, and then pass it off to the Technical Writer.  Does the app really need two CPU’s or does that just indicate it’s capable of multithreading?  Or both?  …Or none of the above?  Developers are the group that seems to be most challenged at understanding the new paradigm of virtualization, yet are the one’s to decide what the requirements are.  Some barely know what it is, or dismiss it as nothing more than a cute toy that won’t work for their needs.  It’s pretty fun to show them otherwise, but frustrating to see their continued suspicions of the technology.

IT Administrators (yep, me included)
Take a look at any installation guide for your favorite (or least favorite) application or OS.  Resource minimums are still written for hardware based provisioning.   Most best practice guides outline memory and CPU requirements within the first few pages.  Going against recommendations on page 2 generally isn’t good IT karma.  It feels as counterintuitive as trying to breathe with your head under water.  Only through experience have I grown more comfortable with the practice.  It’s still tough though.

Software Manufacturers
Virtualization can be a sensitive matter to Software Manufactures.  Some would prefer that it doesn’t exist, and choose to document it and license it in that way.  Others will insist that resources are resources, and why would they ever recommend their server application can run with just 768MB of RAM and a single CPU core if there was even a remote possibility of it hurting performance.

Politics
Let’s face it.  How much is Microsoft going to dive into memory recommendations for an Exchange Server when their own virtualization solution does not support some of the advanced memory handling features that VMWare supports?  The answer is, they aren’t.  It’s too bad, because their products run so well in a virtual environment.  Politics can also come from within.  IT departments get coerced by management, project teams or departments, or are just worried about SLA’s of critical services.  They acquiesce to try to keep everyone happy.

What can be done about it.
Rehab for everyone involved.  Too ambitious?  Okay, let’s just try to improve Installation/Best Practices guides from the Software Manufactures.

  • Start with two or three sets of minimums for requirements.  Provisioning the application or OS on a physical machine, followed by provisioning on a VM accommodating a few different hypervisors.
  • Clearly state if the application is even capable of multithreading.  That would eliminate some confusion on whether you even want to consider two or more vCPU’s  on a VM.  I suspect many red faces would show up when software manufactures admit to their customers they haven’t designed their software to work with more than one core anyway.  But this simple step would help Administrators greatly.
  • For VM based installations, note the low threshold amount for RAM in which unnecessary amounts of disk paging will begin to occur.   While the desire is to allocate as little resources as needed, nobody wants disk thrashing to occur.
  • For physical servers, one may have a single server playing a dozen different roles.  Minimums sometimes assume this, and they will throw in a buffer to pad the minimums – just in case.  With a VM, it might be providing just a single role.  Acknowledge that this new approach exists, and adjust your requirements accordingly.

Wishful thinking perhaps, but it would be a start.  Imagine the uproar (and competition) that would occur if a software manufacturer actually spec’d a lower memory or CPU requirement when running under one hypervisor versus another?  …Now I’m really dreaming.

IT Administrators have some say in this too.  Simply put, the IT department is a service provider.  Your staff and the business model are your customers.  As a Virtualization Administrator, you have the ability to assert your expertise on provisioning systems to provide a service, and it work as efficiently as possible.  Let them define what the acceptance criteria for the need they have, and then you deal with how to make it happen.

Real World Numbers
There are many legitimate variables that make it difficult to give one size fits all recommendations on resource requirements.  This makes it difficult for those first starting out.  Rather than making suggestions, I decided I would just summarize some of my systems I have virtualized, and how the utilization rates are for a staff of about 50 people, 20 of them being Software Developers.  These are numbers pulled during business hours.  I do not want to imply that these are the best or most efficient settings.  In fact, many of them were  “first guess” settings that I plan on adjusting later.  They might offer you a point of reference for comparison, or help in your upcoming deployment.

Server/Function Avg % of RAM used Avg % of CPU used / occasional spike Comments
AD Domain (all roles) Controller, DNS, DHCP. 
Windows Server 2008 x64
1 vCPU, 2GB RAM
9% 2% / 15% So much for my DC’s working hard.  2GB is overkill for sure, and I will be adjusting all three of my DC’s RAM downward.  I thought the chattiness of DC’s was more of a burden than it really is.
Exchange 2007 Server (all roles)
Windows Server 2008 x64
1 vCPU, 2.5GB RAM
30% 50% / 80% Consistently our most taxed VM, but pleasantly surprised by how well this runs.
Print Server, AV server 
Windows Server 2008 x64
1 vCPU, 2GB RAM
18% 3% / 10% Sitting as a separate server only because I hate having application servers running as print servers.
Source Code Control Database Server
Windows Server 2003 x64
1 vCPU, 1GB RAM
14% 2% / 40% There were fears from our Dev Team that this was going to be inferior to our physical server, and they suggested the idea of assigning 2 vCPU’s “just in case.”  I said no.  They reported a 25% performance improvement compared to the physical server.  Eventually they might figure out the ol’ IT guy knows what he’s doing.
File Server
Windows Server 2008 x64
1 vCPU, 2GB RAM
8% 4% / 20% Low impact as expected.  Definitely a candidate to reduce resources.
Sharepoint Front End Server
Windows Server 2008 x64
1 vCPU, 2.5GB RAM
10% 15% / 30% Built up, but not fully deployed to everyone in the organization.
Sharepoint Back End/SQL Server
Windows Server 2008 x64
1 vCPU, 2.5GB RAM
9% 15% / 50% I will be keeping a close eye on this when it ramps up to full production use.  SharePoint farms are known to be hogs.  I’ll find out soon enough.
SQL Server for project tracking.
Windows Server 2003 x64
1 vCPU, 1.5GB RAM
12% 4% / 50% Lower than I would have thought.
Code compiling system
Windows XP x64
1 vCPU 1GB RAM
35% 5% / 100% Will spike to 100% CPU usage during compiling (20 min.).  Compilers allow for telling it how many cores to use.
Code compiling system
Ubuntu 8.10 LTS x64
1 vCPU 1GB RAM
35% 5% / 100% All Linux distros seem to naturally prepopulate more RAM than their Windows counterparts, at the benefit perhaps of doing less paging.
       

To complicate matters a bit, you might observe different behaviors on some OS’s (XP versus Vista/2008 versus Windows 7/2008R2, or SQL 2005 versus SQL 2008) in their willingness to pre populate RAM.  Give SQL 2008 4GB of RAM, and it will want to use it even if it isn’t doing much.   You might notice this when looking at relatively idle VM’s with different OS’s, where some have a smaller memory footprint than others.   At the time of this writing, none of my systems were running Windows 2008 R2, as it wasn’t supported on ESX 3.5 as I was deploying them.

Some of these numbers are a testament to ESX’s/vSphere’s superior memory management handling and CPU scheduling.  Memory ballooning, swapping, Transparent Page Sharing all contribute to pretty startling efficiency.

I have yet to virtualize my CRM, web, mail relay, and miscellaneous servers, so I do not have any good data yet for these types of systems.  Having just upgraded to vSphere in the last few days, this also clears the way for me to assign multiple vCPU’s to the code compiling machines (as many as 20 VM’s).  The compilers have switches that can toggle exactly how many cores end up being used, and our Development Team needs these builds compiled as fast as possible.  That will be a topic for another post.

Lessons Learned
I love having systems isolated to performing their intended function now.  Who wants Peachtree crashing their Email server anyway?  Those administrators working in the trenches know that a server that is serving up a single role is easy to manage, way more stable, and doesn’t cross contaminate other services.  In a virtual environment, it’s  worth any additional costs in OS licensing or overhead.

When the transition to virtualizing our infrastructure began, I thought our needs, and our circumstances would be different than they’ve proven to be.  Others claim extraordinary consolidation ratios with virtualization.  I believed we’d see huge improvements, but those numbers wouldn’t possibly apply to us, because (chests puffed out) we needed real power.  Well, I was wrong, and so far, we really are like everyone else.

Helpful links

Discovering AutoDiscover in Exchange 2007

 

In my post “Exchange 2007… Better Later than Never” I mentioned one of the post-deployment difficulties I faced was getting the "AutoDiscover” function to behave the way it was designed.  For those unfamiliar with the feature, it allows for automated discovery and configuration of various connectivity methods to an Exchange Server.  Exchange MAPI clients, Exchange HTTP/RPC clients, and mobile devices using ActiveSync all can use AutoDiscover in some form or another.

While it wasn’t critical for the transition itself, AutoDiscover was vital for our future deployments of “Outlook Anywhere” and “ActiveSync.”  I figured skimming over a few TechNet articles and blog postings, and I’d be quickly onto the next project.  That began my long ugly journey getting AutoDiscover to work.

It became clear that the ingredients for AutoDiscover to work correctly was a properly configured ISA Server, SSL certificates, namespace/DNS accommodations, and of course, Exchange.  What was really interesting about this particular project was that I was dealing with very mature products, yet, I never ran across so much contradicting information on how to make it work.  Perhaps some of that stems from so many valid topologies and configurations, or possibly big changes between the RTM versions of Exchange and ISA and their first service packs.  Still, it seemed odd.  I sifted through postings from desperate IT Administrators in similar situations who had no more hair to pull out.  You could sense the defeat in their words.  Now I understand.

One guideline mentioned quite often was the need for a special SSL certificate that allowed for more than one FQDN to be assigned to it.  You’ll see it referred to as a Unified Communication Certificate (UCC or UC) or a Subject Alternative Name (SAN) certificate.  The purpose is the same, but the names and the references are different.  While UC certificates are not technically a requirement, it is best to think of it that way.  For AutoDiscover, the names needed on a UC cert would look something like:

mypubliccompanyname.com
autodiscover.mypubliccompanyname.com
mail.mypubliccompanyname.com
internalmailservername
internalmailservername.myprivatelanname.lan

I went with a UC cert from DigiCert, but any of the larger commercial CA’s should work.  However, a word of warning.  Exchange doesn’t like self signed certificates, and many mobile phones have troubles with private certificates as well as those from smaller commercial CAs.  You should be fine if you run Certificate Services internally (or so I’m told), and your namespace checks out okay.  Don’t forget look at your ISA server and make sure you are running SP1 or later, due to limitations on how the RTM version handled UC certificates.

Speaking of namespaces, time for a thorn in my side to come back and sting me.  My internal namespace is not a name that we own (a legacy issue I should have taken care of long ago).  Certificate Authorities will not issue standard or UC SSL certificates to names you do not own for obvious reasons, even if the references are private.  Fortunately, I was able to work around this by making absolutely sure the simple name was used in any Exchange configuration settings that usually accepted the internal FQDN.  Disaster averted.

Now for the dirt on how I was able to make it work.  My as-built  design is modeled somewhat after Jason Jones’ method of Publishing Exchange 2007 Services with ISA 2006.  Following the construct of:

  • Not using the existing listener created for OWA, and creating a separate listener for Outlook Anywhere (OA)/Autodiscover, and binding UC cert to that listener. Using HTTP authentication with Integrated/Windows Auth (aka NTLM). This would provides HTTP/Integrated auth from the client to the FW, then basic auth from the FW to the Exchange server.
  • Allowing the ISA server to utilize Kerberos constrained delegation (KCD) by way of changes in AD.
  • Creating a single Publishing rule for OA , where KCD is used.
  • Setting internal and external URL’s to their respective internal and external locations (internalmailservername and autodiscover.mycompanyname.com)

After configuring it as above, AutoDiscover worked internally, but not externally.  Continually getting failures with the /rpc directory when testing internally (via test-outlookwebservices) and externally (via testexchangeconnecivity.com).  I found a post that gave the missing piece of the puzzle, and modified my configuration per the recommendations  http://forums.isaserver.org/m_2002041377/mpage_2/key_/tm.htm:

  • Create a 2nd Publishing rule for OA, sitting on top of primary OA publishing rule.
    • Only /rpc/* is published
    • Auth Delegation is set to "No Delegation, but client may authenticate directly"
    • Set to "all users" instead of "authenticated users"
    • Changing "EXPR" Outlookprovider to msstd:mail.mycompanyname.com so that the certificate mutual authentication test passes.

Under the conditions described above, Outlook Anywhere with Autodiscover functions as desired.

As Jason Jones put it best, “The reason for the need of a separate listener is that Windows Authentication (NTLM) and Forms Based Authentication (FBA) are mutually exclusive. It is not possible to use a single web listener for all Exchange 2007 publishing and achieve transparent authentication within Outlook anywhere.”  Thus the need to create a dedicated listener to be used exclusively for Outlook Anywhere and associated services.

I did have to make one other adjustment that is rarely brought up in the AutoDiscover deployment scenarios.  We know that AutoDiscover wants to look at your TLD name (e.g. yourpubliccompanyname.com) when doing it’s discovery process.  However, you may have simply had an “A” record of “yourpubliccompanyname.com” pointing to your web server to catch those users who forget to type in “www” before your domain name.  It’s also not a far fetch to assume you had an SSL certificate bound to that web server as well.  This is exactly what I had, so I had to make the following changes:

1.  Have our ISP (or whomever has authoritative control on the DNS zone file for “mypubliccompanyname.com”) change the “A” record from my public web server IP address, to my autodiscover address.

2.  In ISA, add a new “DENY (REDIRECT)” rule for mypubliccompanyname.com that does, well, a deny, and a redirect www.mypubliccompanyname.com.  This sits right above the web publishing rule for www.mypubliccompanyname.com

The original setup was a carryover from an earlier time.  The configuration above is the way I should have set it up.  Nice to do a little cleanup along the way.

I can’t tell you how relieved I was in getting this to work, no matter how many hoops I had to jump through.  I also have a complete set of as-built notes in case I need to recreate or debug the existing configuration.  It’s been stable since, but I have a feeling I’ll be looking at this again as soon as we transition to Exchange 2010. 

Other helpful links:

Microsoft Exchange Remote Connectivity Analyzer
https://www.testexchangeconnectivity.com/

Publishing Exchange 2007 Services with ISA Server 2006…
http://blog.msfirewall.org.uk/2008/07/publishing-exchange-2007-services-with.html

Technet white paper:  Exchange 2007 Autodiscover service:
http://technet.microsoft.com/en-us/library/bb332063.aspx

Generating SSL certificates for Exchange 2007 and ISA 2006:
http://www.isaserver.org/tutorials/Generating-SSL-Certificates-Exchange-2007-ISA-Server-2006.html 

Dr. Tom Shinder’s guides on Publishing Exchange 2007 OWA, activeSync, and RPC/HTTP using ISA 2006:
http://www.isaserver.org/tutorials/Publishing-Exchange-2007-OWA-Exchange-ActiveSync-RPCHTTP-using-2006-ISA-Firewall-Part1.html

A bulk discount on Tylenol.  …You’ll need it.
http://www.costco.com

Living with ISA 2006 and the ISA Firewall client

 

One of my big projects in 2008 was making the transition from my old firewall to a new solution.  I’ve had 18 months or so to work with ISA and the workstations running the Firewall Client software, and thought I’d share my experiences.

First, a little background.  The network I inherited long ago was protected by a Watchguard Firewall.  At the time, it was a moderately capable stateful packet inspection (SPI) unit that performed what was asked of it;  ingress filtering with a little protection from a few application layer proxies.  But times had changed and communication sessions had become more sophisticated.  Exploits were getting more creative and difficult to defend against because they were occurring high up at the application layer.  Like many SPI firewalls, it’s ability to intelligently control outbound traffic was limited.

My acceptance criteria included better protection at the application layer, as well as close integration with my Active Directory based infrastructure.  I also needed a firewall that would help me get a handle on outbound traffic.  ISA 2006 was the answer.  I chose a Celestix MsA4000i appliance running ISA to simplify the hardware procurement and deployment process.

During my implementation planning, I had the opportunity to talk at length with Richard Hicks, a Senior Engineer for Celestix Networks.  Celestix makes a fine product line of security solution appliances running ISA, and Richard (a recent MVP award winner) had excellent insight into ISA implementations, large and small.  I give him credit for helping me translate the functional requirements I was used to with my old firewall, while giving practical recommendations on how ISA performs those same functions, and policy design and implementation.

One of the unique traits of ISA is the various methods it allows internal clients to communicate with. 

  • SecureNAT.  The most basic of the three, and uses ISA as the gateway/router for traditional perimeter based protection.  Used when a default gateway is assigned to the client.
  • Web Proxy Client.  Generally called upon when there are web based requests such as HTTP and FTP calls, etc. 
  • Firewall client.  An optional piece of the ISA solution that runs on Windows clients, and extends the functionality of ISA in ways that cannot be matched by other solutions.

None of these are mutually exclusive, and can be run all at the same time.  Unfortunately, this flexibility can hinder your intentions.  If you want to restrict outbound communication to authenticated access only, running SecureNAT will compromise that ability.  The solution?  Run all non server systems without a default gateway, to force the client to use the web proxy client, or firewall client.  In the event that the target is beyond your LAN, the firewall client will handle all routing.

The easiest transition would have been using SecureNAT for the initial deployment, but there was an opportunity for monumental improvements if I attempted to go without it.  Am I glad I took this extra step?  Yes!  Some of the highlights have been:

  • Outbound connections limited to authenticated users only.  If an outbound connection is made,  I could see what user is requesting it.  Logging provides meaningful data now.
  • True egress control.  Connections initiated from the inside can finally be controlled.  Once everything was up and running, it was fascinating to see what was initiating outbound connections.
  • Forces compliance of application related restrictions.  IM and P2P applications specialize in working their way around firewalls.  The combination of the web proxy, and the firewall client with no SecureNAT helps achieve this.
  • Suppression of malware.   The combination of allowing only authenticated outbound access, along with utilizing an automated malware blacklist database helped control users who had a knack of making a mess out of their PCs.

The results of the improved security stance was impressive.  So was the amount of complaining from end users.  They were furious.  I had angry developers shutting off the firewall client software on their PC.  It made them feel good until they realized shutting down the firewall client gave them less access, not more.  They made claims that BitTorrent was a necessary part of their job, and found it insulting that outbound SSH sessions were not allowed to any host on the Internet.  They didn’t like that their non-domain joined test machines (or unapproved personal laptop) would require a username and password before they could access the Internet.  Their complaints went straight to the top of the organization, as did my explanations.  Security won out, and policies stood without change.

There were some hiccups along the way.  Most deployment related problems were fixed, while others forced some changes in how we worked.  The ISA community is an active one, but with the move of using workstations running the ISA firewall client without a default gateway, it made finding out answers much more difficult.  Some of the obstacles I ran into were:

  • Lack of support for CIFS traversing across network segments.  The firewall client cannot handle this alone, and needs a default gateway.
  • Vista and later workstations need a static route added for remote targets that were not web based.  This can be added via DHCP (option 121, but don’t try to add it via the DHCP snap-in in Vista, otherwise it won’t work).  Thanks to some assistance from Richard Hicks and Microsoft for ultimately explaining the reason behind the inconsistent behavior between XP and Vista.  More info can be found here: http://tmgblog.richardhicks.com/2009/01/10/dns-resolver-behavior-in-windows-vista/ 
  • Building up a healthy list of domains that will be allowed to have anonymous outbound access.  OS and application update domains and mirrors are good examples of this.
  • Older Outlook Clients (2003) wouldn’t talk to the internal Exchange Server using it’s MAPI connection until the following tweak was made:  http://www.isaserver.org/articles/2004olpop3smtp.html
  • Web services that use SSL, but do not run over port 443 had to be accommodated for.  http://www.isaserver.org/articles/2004tunnelportrange.html
  • Browser proxy configurations in *nix workstations may not be enough.  For those workstations, leave a default gateway.

As you can see from the links I provide, I found www.isaserver.org invaluable during my implementation.  It attracts some of the brightest and the best in the security world who contribute articles, and to community forums.  It’s a great resource for any ISA administrator. 

My biggest annoyances in using the firewall client are small, but still worth mentioning.

  • The virtual black hole that the occurs on socket of the workstation running the firewall client.  Trying to debug via traditional methods is nearly impossible.  It simplifies the number of connections from the client, but it’s hard to tell the contents of the connection.
  • The name.  “Firewall Client” implies that it is some application that protects a workstation like ZoneAlarm, Norton, or the Windows Firewall.  A simple name change would eliminate this confusion to newer users, and some IT guys not familiar with ISA.

If I were to do it over again, I would have given more notice on what changes would be occurring, and why.  I had previous verbal green lights from management to restrict thing things like P2P and IM sessions, and our written IT policies had already reflected these restrictions.  I just never had the capability to do so.  I warned staff, but apparently not enough.  I had to do a healthy amount of explaining, which was fine because I had the technical reasons, and the business case on my side. 

I look forward to the next version of ISA (Threat Management Gateway, or TMG) and the steps it takes to improve upon the Firewall Client component.  Recommended reading on using the Firewall Client in ISA 2004 and 2006 can be found below.

Firewall Client
http://www.isaserver.org/tutorials/Understanding-ISA-Firewall-Client-Part1.html

http://www.isaserver.org/articles/2004firewallclient.html

http://www.isaserver.org/tutorials/Understanding_and_installing_ISA_Firewall_Clients.html

http://www.isaserver.org/tutorials/ISA_Clients__Part_2_SecureNAT_and_Web_Proxy_Client.html

Database of malware domains that can be imported directly into ISA
http://www.malwaredomains.com/

A special thanks to Richard Hicks from Celestix, and my good friend Glenn Barnas from Inno-Tech, who provided invaluable information when I needed it most.