October 2010

Replication with an EqualLogic SAN; Part 5

Well, I’m happy to say that replication to my offsite facility is finally up and running now. Let me share with you the final steps to get this project wrapped up.

You might recall that in my previous offsite replication posts, I had a few extra challenges. We were a single site organization, so in order to get replication up and running, an infrastructure at a second site needed to be designed and in place. My topology still reflects what I described in the first installment, but simple pictures don’t describe the work getting this set up. It was certainly a good exercise in keeping my networking skills sharp. My appreciation for the folks who specialize in complex network configurations, and address management has been renewed. They probably seldom hear words of thanks for say, that well designed sub netting strategy. They are an underappreciated bunch for sure.

My replication has been running for some time now, but this was all within the same internal SAN network. While other projects prevented me from completing this sooner, it gave me a good opportunity to observe how replication works.

Here is the way my topology looks fully deployed.

Most Collocations or Datacenters give you about 2 square feet to move around, (only a slight exaggeration on the truth) so it’s not the place you want to be contemplating reasons why something isn’t working. It’s also no fun realizing you don’t have the remote access you need to make the necessary modifications, and you don’t, or can’t drive to the CoLo. My plan for getting this second site running was simple. Build up everything locally (switchgear, firewalls, SAN, etc.) and change my topology at my primary site to emulate my the 2nd site.

Here is the way it was running while I worked out the kinks.

All replication traffic occurs over TCP port 3260. Both locations had to have accommodations for this. I also had to ensure I could manage the array living offsite. Testing this out with the modified infrastructure at my primary site allowed me to verify traffic was flowing correctly.

The steps taken to get two SAN replication partners transitioned from a single network to two networks (onsite) were:

Verify that all replication is running correctly when the two replication partners are in the same SAN Network
You will need a way to split the feed from your ISP, so if you don’t have one already, place a temporary switch at the primary site on the outside of your existing firewall. This will allow you to emulate the physical topology of the real site, while having the convenience of all of the equipment located at the primary site.
After the 2nd firewall (destined for the CoLo) is built and configured, place it on that temporary switch at the primary site.
Place something (a spare computer perhaps) on the SAN segment of the 2nd firewall so you can test basic connectivity (to ensure routing is functioning, etc) between the two SAN networks.
Pause replication on both ends, take the target array and it’s switchgear offline.
Plug the target array’s Ethernet ports to the SAN switchgear for the second site, then change the IP addressing of the array/group so that it’s running under the correct net block.
Re-enable replication and run test replicas. Starting out with the Group Manager. Then to ASM/VE, then onto ASM/ME.

It would be crazy not to take one step at a time on this, as you learn a little on each step, and can identify issues more easily. Step 3 introduced the most problems, because traffic has to traverse routers that also are secure gateways. Not only does one have to consider a couple of firewalls, you now run into other considerations that may be undocumented. For instance.

ASM/VE replication occurs courtesy of vCenter. But ASM/ME replication is configured inside the VM. Sure, it’s obvious, but so obvious it’s easy to overlook. That means any topology changes will require adjustments in each VM that utilize guest attached volumes. You will need to re-run the “Remote Setup Wizard” to adjust the IP address of the target group that you will be replicating to.
ASM/ME also uses a VSS control channel to talk with the array. If you changed the target array’s group and interface IP addresses, you will probably need to adjust what IP range will be allowed for VSS control.
Not so fast though. VM’s that use guest iSCSI initiated volumes typically have those iSCSi dedicated virtual network cards set with no default gateway. You never want to enter more than one default gateway on this sort of situation. The proper way to do this will be to add a persistent static route. This needs to be done before you run the remote Setup Wizard above. Fortunately the method to do this hasn’t changed for at least a decade. Just type in

route –p add [destinationnetwork] [subnetmask] [gateway] [metric]

Certain kinds of traffic that passes almost without a trace across a layer 2 segment shows up right away when being pushed through very sophisticated firewalls who’s default stances are deny all unless explicitly allowed. Fortunately, Dell puts out a nice document on their EqualLogic arrays.
If possible, it will be easiest to configure your firewalls with route relationships between the source SAN and the target SAN. It may complicate your rulesets (NAT relationships are a little more intelligent when it comes to rulesets in TMG), but it simplifies how each node is seeing each other. This is not to say that NAT won’t work, but it might introduce some issues that wouldn’t be documented.

Step 7 exposed an unexpected issue; terribly slow replicas. Slow even though it wasn’t even going across a WAN link. We’re talking VERY slow, as in 1/300th the speed I was expecting. The good news is that this problem had nothing to do with the EqualLogic arrays. It was an upstream switch that I was using to split my feed from my ISP. The temporary switch was not negotiating correctly, and causing packet fragmentation. Once that switch was replaced, all was good.

The other strange issue was that even though replication was running great in this test environment, I was getting errors with VSS. ASM/ME at startup would indicate “No control volume detected.” Even though replicas were running, the replica’s can’t be accessed, used, or managed in any way. After a significant amount of experimentation, I eventually opened up a case with Dell Support. Running out of time to troubleshoot, I decided to move the equipment offsite so that I could meet my deadline. Well, when I came back to the office, VSS control magically worked. I suspect that the array simply needed to be restarted after I had changed the IP addressing assigned to it.

My CoLo facility is an impressive site. Located in the Westin Building in Seattle, it is also where the Seattle Internet Exchange (SIX) is located. Some might think of it as another insignificant building in Seattle’s skyline, but it plays an important part in efficient peering for major Service Providers. Much of the building has been converted from a hotel to a top tier, highly secure datacenter and a location in which ISP’s get to bridge over to other ISP’s without hitting the backbone. Dedicated water and power supplies, full facility fail-over, and elevator shafts that have been remodeled to provide nothing but risers for all of the cabling. Having a CoLo facility that is also an Internet Exchange Point for your ISP is a nice combination.

Since I emulated the offsite topology internally, I was able to simply plug in the equipment, and turn it on, with the confidence that it will work. It did.

My early measurements on my feed to the CoLo are quite good. Since the replication times include buildup and teardown of the sessions, one might get a more accurate measurement on sustained throughput on larger replicas. The early numbers show that my 30mbps circuit is translating to replication rates that range in the neighborhood of 10 to 12GB per hour (205MB per min, or 3.4MB per sec.). If multiple jobs are running at the same time, the rate will be affected by the other replication jobs, but the overall throughput appears to be about the same. Also affecting speeds will be other traffic coming to and from our site.

There is still a bit of work to do. I will monitor the resources, and tweak the scheduling to minimize the overlap on the replication jobs. In past posts, I’ve mentioned that I’ve been considering the idea of separating the guest OS swap files from the VM’s, in an effort to reduce the replication size. Apparently I’m not the only one thinking about this, as I stumbled upon this article. It’s interesting, but a nice amount of work. Not sure if I want to go down that road yet.

I hope this series helped someone with their plans to deploy replication. Not only was it fun, but it is a relief to know that my data, and the VM’s that serve up that data, are being automatically replicated to an offsite location.

Firewall adventures: Transitioning from ISA 2006 to TMG

One of the key parts of my ~~seemingly never-ending~~ Offsite Replication project was to build out a second location to replicate my data to. Before I could do this, some prep work to my network was in order. It was a great opportunity for me to replace my existing firewall running Microsoft’s ISA 2006 server, to their newest edition, named ForeFront Threat Management Gateway, or TMG.

My new TMG system is running on a 1u appliance provided by Celestix Networks, Inc. Introduced to the Celestix line of appliances back in 2007, I’ve been very happy with the great turn-key solutions they provide. Its great for those who want to run ISA/TMG, but do not want to build up their own unit, and do not want to handle licensing of the OS or TMG. The lineup they offer ranges anywhere from branch office solutions to backbone class systems Some really nice abilities are built right into the unit, such as web based management, and updating the unit to a new build by booting to PXE. It also offers a “Last Good Version” (LGV) that will reimage the disk the the state it was saved, in the event of a configuration change going terribly wrong. Definitely peace of mind for those critical upgrades. The nature of the image creation and restore is such that it requires the system to be offline. I hope that in the future, Celestix can perhaps partner with Acronis, or some other disk imaging solution to make this process a little more convenient. It still works pretty well though. Anyway, onto the transition.

Upgrade, or transition?

This seems to be one of those ubiquitous IT related questions to almost any enterprise solution that is being run in a production environment. Should you do an in place upgrade, or should you transition to a pristine installation? In this particular case, this was already answered for me, as my old appliance ran a 32bit version of Windows Server 2003, and could not be upgraded due to system requirements. That was okay with me. A true upgrade fell out of favor with me years ago; there are just too many unknowns introduced, which can make post deployment issues extremely difficult to diagnose. I’ve also sensed that the true upgrade has fallen out of favor with software manufacturers as well. Whether it’s Exchange, SQL, or a server OS, the recommended way these days seems to be transitioning to a pristine installation.

The new box

For the new environment I was building, I chose two Celestix MSA5200i units; one for the primary facility, and one for the CoLocation. These particular units run TMG Standard, on top of Windows Server 2008R2. It would have been nice to go with a unit running the Enterprise Edition of TMG (that offers the ability to create a redundant array of servers), but I had to cut costs, and going with the Standard Edition was the easiest way to do this.

With the new unit sitting in front of me, I decided to build it up in its entirety offline, and wait for a weekend to cut it over. ISA has the ability to dump out all, or parts of the old configuration in XML, so my early (albeit naive) visions had me thinking that my transition steps would simply be exporting the configuration running on the ISA 2006 box, and importing it to the TMG box. Well, the devil is in the details, and while this could work for certain scenarios, it didn’t work for me on the first a few tries. I had a choice. Continue chasing down the reason why it wasn’t importing (an unknown time limit), or pound out a new configuration in a few days (a known time limit). No time to complain – just do it and get it over with. Good documentation in OneNote, and the ability to RDP into your existing ISA installation is key to this being a successful way to build a new configuration from scratch. To minimize typos and other fat fingering, I did export custom sets and protocols at the very granular level. Sure, I could type them out easy enough, but it was more reliable to export at the very small item level.

A properly configured TMG box is almost always joined to Active Directory, and there are some steps that you just have to wait to get to on the day of transition. This is reasonable, but it does have to be planned for. Things like using Kerberos Constrained Delegation in publishing rules, can only be configured after it’s joined. It’s also worth making sure you know all AD related settings (Delegation, OU location, GPO overrides, etc.) for the existing Firewall that you will be decommissioning. Nothing like a oversight here to mess you up.

Post installation surprises

The abilities of TMG make it far more than a simple edge security device. It is what truly separates it from the competition. Since it is integrated into the operation of so many functions up and down the protocol stack, transition like this can be a bit disruptive. I’m happy to say that considering the type of change, I didn’t run into too many troubles. I had prepared a checklist of basic functions and services I could run over to quickly validate a successful transition. This made validation easy, and prevented most Monday morning surprises.

After about 20 minutes, I had the old ISA box removed from the domain, and the new one added and configured. The rest of the time was spent confirming functionality, and resolving a few issues. Here were some of the minor ones:

ARP caching. This isn’t the first time this has bitten me. I forgot that the ARP cache on the connecting devices needed to be flushed. Silly mistake, but the nice part is, that it eventually corrects itself. (I wish I had a few more of those kinds of problems).
Publishing rules and Listeners. After you join the box to the domain, you will want to check these, and recreate if necessary. I had a few publishing rules that I had to recreate. Not a big deal. They looked okay, but just didn’t work.
I have several publicly registered IP addresses bound to the external (WAN) interface. Windows 2008 and TMG didn’t bind to the IP address I was thinking it was going to bind to (or at least the way Win2003 and ISA did). A quick fix in the TMG configuration resolved this. Look to this TechNet Article on why the behavior is different.

The final issue was a little trickier to fix. The symptoms were that web browsing was working, but it just took a while to connect. After looking at the logging, (and being tipped off on a thread on isaserver.org’s community forum), I noticed that the web proxy was attempting to use one of the RRAS adapters as the default gateway. It was being caused by web proxy clients getting confused when reading WPAD for automatic browser/proxy configuration. The slow browsing would go away as soon as the web browser’s proxy settings were manually configured. Apparently this behavior wasn’t unique to TMG (others on ISA 2006 have experienced similar behavior), but this was the first time I’ve ever seen it.

There was a .vbs script that supposedly fixed the issue. The purpose of the .vbs script was to insert the FQDN of the TMG unit into WPAD. While the script ran successfully, it didn’t change the behavior for me. At this point, a little bit of panic set in. I thought it best to tap into the expertise of my good friend, and TMG superstar Richard Hicks. Richard is a Microsoft MVP, and has a great blog that should be in everyone’s RSS feed list. After briefing him on the scenario, he provided me with another script (courtesy of Technet) that would attempt to achieve the same results as the failed script.

‘http://blogs.technet.com/isablog/archive/2008/06/26/understanding-by-design-behavior-of-isa-server-2006-using-kerberos-authentication-for-web-proxy-requests-on-isa-server-2006-with-nlb.aspx

Option Explicit

Const fpcCarpNameSystem_DNS = 0
Const fpcCarpNameSystem_WINS = 1
Const fpcCarpNameSystem_IP = 2

Dim Root, Array, WebProxy

Set Root = CreateObject("FPC.Root")
Set Array = Root.GetContainingArray
Set WebProxy = Array.ArrayPolicy.WebProxy

If fpcCarpNameSystem_DNS = WebProxy.CarpNameSystem Then

MsgBox "ISA is already configured to provide DNS names in the WPAD script.", vbInformation
WScript.Quit

End If

WebProxy.CarpNameSystem = fpcCarpNameSystem_DNS
WebProxy.Save true

MsgBox "ISA was configured to provide DNS names in the WPAD script.", vbInformation

Set WebProxy = Nothing
Set Array = Nothing
Set Root = Nothing

After I applied the .vbs script above, the issue has seemed to resolve itself, and now it’s all running smooth.

Observations

During my initial build of the new TMG unit, the first thing I noticed was the apparent efforts the TMG Team took to maintain the same look and feel as the previous version. I had seen screenshots of TMG, but that doesn’t give a good feel for UI interaction. Aside from the new features, it was quiet a relief to feel instantly comfortable with the UI. What a welcome relief to the overworked IT guy.

The next step was to give myself a refresher on what was new with TMG, and digest how that was going to influence my configuration after the cutover was complete. The improvements really do read like a wish list for the seasoned ISA 2006 user. Sometimes the Value Proposition for a software manufacturer, and their customers don’t match up. The result is this odd rollout of new features that the customer never asked for, and ignoring what the customer wants. That doesn’t seem to be the case at all with this product.

For my transition, it was most prudent for me to delay taking advantage of some of these features, just to reduce all variables, but will definitely be exploring the great features of of TMG in the coming weeks and months. The top priority right now is getting my second TMG unit built and configured for my CoLo facility, and test my replication. That’s what a deadline does for you. It ruins all the fun.

Once again, a big thanks to ISAserver.org for being a great resource for the ISA/TMG user community, as well as the folks at Microsoft, Rich, and the others at Celestix for making a quality product.