Architecture/planning

Upgrading to vSphere 5.0 by starting from scratch. …well, sort of.

It is never any fun getting left behind in IT. Major upgrades every year or two might not be a big deal if you only had to deal with one piece of software, but take a look at most software inventories, and you’ll see possibly dozens of enterprise level applications and supporting services that all contribute to the chaos. It can be overwhelming for just one person to handle. While you may be perfectly justified in holding off on specific upgrades, there still seems to be a bit of guilt around doing so. You might have ample business and technical factors to support such decisions, and a well crafted message providing clear reasons to stakeholders. The business and political pressures ultimately win out, and you find yourself addressing the more customer/user facing application upgrades before the behind-the-scenes tools that power it all.

That is pretty much where I stood with my virtualized infrastructure. My last major upgrade was to vSphere 4.0. Sure, I had visions of keeping up with every update and patch, but a little time passed, and several hundred distractions later, I found myself left behind. When vSphere 4.1 came out, I also had every intention of upgrading. However, I was one of the legions of users who had a vCenter server running on a 32bit OS, and that complicated matters a little bit. I looked at the various publications and posts on the upgrade paths and experiences. Nothing seemed quite as easy as I was hoping for, so I did what came easiest to my already packed schedule; nothing. I wondered just how many Administrators found themselves in the same predicament; not touching an aging, albeit perfectly fine running system.

My ESX 4.0 cluster served my organization well, but times change, and so do needs. A few things come up to kick-start the desire to upgrade.

I needed to deploy a pilot VDI project, fast. (more about this in later posts)
We were a victim of our own success with virtualization, and I needed to squeeze even more power and efficiency out of our investment in our infrastructure.

Both are pretty good reasons to upgrade, and while I would have loved to do my typical due diligence on every possible option, I needed a fast track. My move to vSphere 5.0 was really just a prerequisite of sorts to my work with VDI.

But how should I go about an upgrade?

Do I update my 4.0 hosts to the latest update that would be eligible for an upgrade path to 5.0, and if so, how much work would that be? Should I transition to a new vCenter server, migrating the database, then run a mixed environment of ESX hosts running with different versions? What sort of problems would that introduce? After conferring with a trusted colleague of mine who always seems to have pragmatic sensibilities when it comes to virtualization, I decided which option was going to be the best for me. I opted not to do any upgrade, and simply transition to a pristine new cluster. It looked something like this:

Take a host (either new, or by removing an existing one from the cluster), and build it up with ESXi 5.0.
Build up a new 64bit VM for running a brand new vCenter, and configure as needed.
Remove one VM at a time from the old cluster by powering them down, remove from inventory, add to the new cluster.
Once enough VM’s have been removed, take another host, remove from the old cluster, rebuild as ESXi 5.0, and add to the new cluster.
Repeat until finished.

For me, the decision to start from scratch won out. Why?

I could build up a pristine vCenter server, with a database that wasn’t going to carry over any unwanted artifacts of my previous installation.
I could easily set up the new vCenter to emulate my old settings. Folders, EVC settings, resource pools, etc.
I could transition or build up my supporting VM’s or appliances to my new infrastructure to make sure they worked before committing to the transition.
I could afford a simple restart of each VM as I transitioned it to a new cluster. I used this as an opportunity to update the VMware Tools when added to the new inventory.
I was willing to give up historical data in my old vSphere 4.0 cluster for the sake of simplicity of the plan and cleanliness of the configuration.
Predictability. I didn’t have to read a single white paper or discussion thread on database migrations or troubles with DSNs.
I have a well documented ESX host configuration that is not terribly complex, and easy to recreate across 6 hosts.
I just happened to have purchased an additional blade and license of ESX, so it was an ideal time to introduce it to my environment.
I could get my entire setup working, then get my licensing figured out after it’s all complete.

You’ll notice that one option similar to this approach would have been to simply remove a host of running VM’s out of the existing cluster, and add it to the new cluster. This may have been just as good of a plan, as it would have avoided the need to manually shut down and remove each VM one at a time during the transition. However, I would have needed to run a mix of ESX 4.0 and 5.0 hosts in the new cluster. I didn’t want to carry anything over from the old setup. I would have needed to upgrade or rebuild the host anyway, and I had to restart each VM to make sure it was running the latest tools. If for nothing other than clarity of mind, my approach seemed best for me.

Prior to beginning the transition, I needed to update my Dell EqualLogic firmware to 5.1.2. A collection of very nice improvements made this a nice upgrade, but a requirement for what I wanted to do. While the upgrade itself went smoothly, it did re-introduce an issue or two. The folks at Dell EqualLogic are aware of this, and are working to address it hopefully in their next release. The combination of the firmware upgrade, and vSphere 5 allowed me to use the latest and greatest tools from EqualLogic, primarily the Host Integration Tools VMWare Edition (HIT/VE) and the storage integration in vSphere thanks to VASA. Although, as of this writing, EqualLogic does not have a full production release of their MultiPathing Extension Module (MEM) for vSphere 5.0. The EPA version was just released, but I’ll probably wait for the full release of MEM to come out before I apply it to the hosts in the cluster.

While I was eager to finish the transition, I didn’t want to prematurely create any problems. I took a page from my own lessons learned during my upgrade to ESX 4.0, and exercised some restraint when it came to updating my Virtual Hardware for each VM to version 8. My last update of Virtual Hardware levels in each VM caused some unexpected results, as I shared in “Side effects of upgrading VM’s to Virtual Hardware 7 in vSphere” Apparently, I wasn’t the only one who ran into issues, because that post has statistically been my all time most popular post. The abilities of Virtual Hardware 8 powered VMs are pretty neat, but I’m in no rush to make any virtual hardware changes to some of my key production systems, especially those noted.

So, how did it work out? The actual process completed without a single major hang-up, and am thrilled with the result. The irony here is that even though vSphere provides most of the intelligence behind my entire infrastructure, and does things that are mind bogglingly cool, it was so much easier to upgrade than say, SharePoint, AD, Exchange, or some other enterprise software. Great technologies are great because they work like you think they should. No exception here. If you are considering a move to vSphere 5.0, and are a little behind on your old infrastructure, this upgrade approach might be worth considering.

Now, onto that little VDI project…

Resources

A great resource on setting up SQL 2008 R2 for vCenter
How to Install Microsoft SQL Server 2008 R2 for VMware vCenter 5

Installing vCenter 5 Best Practices
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2003790

A little VMFS 5.0 info
http://www.yellow-bricks.com/2011/07/13/vsphere-5-0-what-has-changed-for-vmfs/

Information on the EqualLogic Multipathing Extension Module (MEM), and if you are an EqualLogic customer, why you should care.
https://whiteboardninja.wordpress.com/2011/02/01/equallogic-mem-and-vstorage-apis/

Zero to 32 Terabytes in 30 minutes. My new EqualLogic PS4000e

Rack it up, plug it in, and away you go. Those are basically the steps needed to expand a storage pool by adding another PS array using the Dell/EqualLogic architecture. A few weeks ago I took delivery of a new PS4000e to compliment my PS6000e at my primary site. The purpose of this additional array was really simple. We needed raw storage capacity. My initial proposal and deployment of my virtualized infrastructure a few years ago was a good one, but I deliberately did not include our big flat-file storage servers in this initial scope of storage space requirements. There was plenty to keep me occupied between the initial deployment, and now. It allowed me to get most of my infrastructure virtualized, and gave a chance for buy-in to the skeptics who thought all of this new-fangled technology was too good to be true. Since that time, storage prices have fallen, and larger drive sizes have become available. Delaying the purchase aligned well with “just-in-time” purchasing principals, and also gave me an opportunity to address the storage issue in the correct way. At first, I thought all of this was a subject matter not worthy of writing about. After all, EqualLogic makes it easy to add storage. But that only addresses part of the problem. Many of you face the same dilemma regardless of what your storage solution is; user facing storage growth.

Before I start rambling about my dilemma, let me clarify what I mean by a few terms I’ll be using; “user facing storage” and “non user facing storage.”

User Facing Storage is simply the storage that is presented to end users via file shares (in Windows) and NFS mounts (in Linux). User facing storage is waiting there, ready to be sucked up by an overzealous end user.
Non User Facing Storage is the storage occupied by the servers themselves, and the services they provide. Most end users generally have no idea on how much space a server reserves for say, SQL databases or transaction logs (nor should they!) Non user facing storage is easier to anticipate needs and manage because it is only exposed to system administrators.

Which array…

I decided to go with the PS4000e because of the value it returns, and how it addresses my specific need. If I had targeted VDI or some storage for other I/O intensive services, I would have opted for one of the other offerings in the EqualLogic lineup. I virtualized the majority of my infrastructure on one PS6000e with 16, 1TB drives in it, but it wasn’t capable of the raw capacity that we now needed to virtualize our flat-file storage. While the effective number of 1GB ports is cut in half on the PS4000e as compared to the PS6000e, I have not been able to gather any usage statistics against my traditional storage servers that suggest the throughput of the PS4000e will not be sufficient. The PS4000e allowed me to trim a few dollars off of my budget line estimates, and may work well at our CoLo facility if we ever need to demote it.

I chose to create a storage pool so that I could keep my volumes that require higher performance on the PS6000, and have the dedicated storage volumes on the PS4000. I will do the same for when I eventually add other array types geared for specific roles, such as VDI.

Truth be told, we all know that 16, 2 terabyte drives does not equal 32 Terabytes of real world space. RAID50 penalty knocks that down to about 21TB. Cut that by about half for average snapshot reserves, and it’s more like 11TB. Keeping a little bit of free pools space available is always a good idea, so let’s just say it effectively adds 10TB of full fledged enterprise class storage. This adds to my effective storage space of 5TB on my PS6000. Fantastic. …but wait, one problem. No, several problems.

The Dilemma

Turning up the new array was the easy part. In less than 30 minutes, I had it mounted, turned on, and configured to work with my existing storage group. Now for the hard part; figuring out how to utilize the space in the most efficient way. User facing storage is a wildcard; do it wrong and you’ll pay for it later. While I didn’t know the answer, I did know some things that would help me come to an educated decision.

If I migrate all of the data on my remaining physical storage servers (two of them, one Linux, and one Windows) over to my SAN, it will consume virtually all of my newly acquired storage space.
If I add a large amount of user-facing storage, and present that to end users, it will get sucked up like a vacuum.
If I blindly add large amounts of great storage at the primary site without careful thought, I will not have enough storage at the offsite facility to replicate to.
Large volumes (2TB or larger) not only run into technical limitations, but are difficult to manage. At that size, there may also be a co-mingling of data that is not necessarily business critical. Doling out user facing storage in large volumes is easy to do. It will come back to bite you later on.
Manipulating the old data in the same volume as new data does not bode well for replication and snapshots, which look at block changes. Breaking them into separate volumes is more effective.
Users will not take the time or the effort clean up old data.
If data retention policies are in place, users will generally be okay with it after a substantial amount of complaining. It’s not too different than the complaining you might here when there are no data retention policies, but you have no space. Pick your poison.
Your users will not understand data retention policies if you do not understand them. Time for a plan.

I needed a way to compartmentalize some of the data so that it could be identified as “less important” and then perhaps live on less important storage. By “less important storage” this could mean that it lives on a part of the SAN that is not replicated, or in a worst case scenario, on even some old decommissioned physical servers, where it resides for a defined amount of time before it is permanently archived and removed from the probationary location.

The Solution (for now)

Data Lifecycle management. For many this means some really expensive commercial package. This might be the way to go for you too. To me, this is really nothing more than determining what is important data, and what isn’t as important, and having a plan to help automate the demotion, or retirement of that data. However, there is a fundamental problem of this approach. Who decides what’s important? What are the thresholds? Last accessed time? Last modified time? What are the ramifications of cherry-picking files from a directory structure because they exceed policy thresholds? What is this going to break? How easy is it to recover data that has been demoted? There are a few steps that I need to do to accomplish this.

1. Poor man’s storage tiering. If you are out of SAN space, re-provision an old server. The purpose of this will be to serve up volumes that can be linked to the primary storage location through symbolic links. These volumes can then be backed up at a less frequent interval, as it would be considered less important. If you eventually have enough SAN storage space, these could be easily moved onto the SAN, but in a less critical role, or on a SAN array that has larger, slower disks.

2. Breaking up large volumes. I’m convinced that giant volumes do nothing for you when it comes to understanding and managing the contents. Turning larger blobs into smaller blobs also serves another very important role. It allows the intelligence of the EqualLogic solutions to do their work on where the data should live in a collection of arrays. A storage Group that consists of say, an SSD based array, a PS6000, and a PS4000 can effectively store the volumes in the correct array that best suites the demand.

3. Automating the process. This will come in two parts; a.) deciding on structure, policies, etc. and b.) making or using tools to move the files from one location to another. On the Linux side, this could mean anything from a bash script, or something written in python. Then use cron to schedule the occurrence. In Windows, you could leverage PowerShell, vbscript, or batch files. This would be as simple, or as complex as your needs require. However, if you are like me, you have limited time to tinker with scripting. If there is something turn-key that does the job, go for it. For me, that is an affordable little utility called “TreeSize Pro” This gives you not only the ability to analyze the contents of NTFS volumes, but can easily automate the pruning of this data to another location.

4. Monitoring the result. This one is easy to overlook, but you will need to monitor the fruits of your labor, and make sure it is doing what it should be doing; maintaining available storage space on critical storage devices. There are a handful of nice scripts that have been written for both platforms that help you monitor free storage space at the server level.

The result

The illustration below helps demonstrate how this would work.

As seen below, once a system is established to automatically move and house demoted data, you can more effectively use storage on the SAN.

Separation anxiety…

In order to make this work, you will have to work hard in making sure that the all of this is pretty transparent to the end user. If you have data that has complex external references, you would want to preserve the integrity of the data that relies on those dependent files. Hey, I never said this was going to be easy.

A few things worth remembering…

If 17 years in IT, and a little observation in human nature has taught me one thing, it is that we all undervalue our current data, and overvalue our old data. You see it time and time again. Storage runs out, and there are cries for running down to the local box store and picking up some $99 hard drives. What needs to reside on there is mission critical (hence the undervaluing of the new data). Conversely, efforts to have users clean up old data from 10+ years ago had users hiding files in special locations, even though it was recorded that it had not been modified, or even accessed in 4+ years. All of this of course lives on enterprise class storage. An all too common example of overvaluing old data.

Tip. Remember your Service Level Agreements. It is common in IT to not only have SLAs for systems and data, but for one’s position. These without doubt are tied to one another. Make sure that one doesn’t compromise the other. Stop gap measures to accommodate more storage will trigger desperate, affordable solutions. (e.g. adding cheap non-redundant drives in an old server somewhere). Don’t do it! All of those arm-chair administrators in your organization will be nowhere to be found when those drives fail, and you are left to clean up the mess.

Tip. Don’t ever thin provision user facing storage. Fortunately, I was lucky to be clued into this early on, but I could only imagine the well intentioned administrator who wanted to present a nice amount of storage space to the user, only to find it sucked up a few days later. Save the thin provisioning for non user facing storage (servers with SQL databases and transaction logs, etc.)

Tip. If you are presenting proposals to management, or general information updates to users, I would suggest quoting only the amount of effective, usable space that will be added. In other words, don’t say you are adding 32TB to your storage infrastructure, when in fact, it is closer to 10TB. Say that it is 10TB of extremely sophisticated, redundant enterprise class storage that you can “bet the business” on. It’s scalability, flexibility and robustness is needed for the 24/7 environments we insist upon. It will just make it easier that way.

Tip. It may seem unnecessary to you, but continue to show off snapshots, replication, and other unique aspects of SAN storage, if you still have those who doubt the power of this kind of technology – especially when they see the cost per TB. Repeat to them how long (if even possible) it would take to protect that same data under traditional storage. Do everything you can to help those who approve these purchases. More than likely, they won’t be as impressed by say, how quick a snapshot is, but rather, shocked how traditional storage can’t be protected very well.

You may have noticed I do not have any rock-solid answers for managing the growth and sustainability of user facing data. Situations vary, but the factors that help determine that path for a solution are quite similar. Whether you decide on a turn-key solution, or choose to demonstrate a little ingenuity in times of tight budgets, the topic is one that you will probably have to face at some point.

Finally. A practical solution to protecting Active Directory

Active Directory. It is the brains of most modern-day IT infrastructures, providing just about every conceivable control of how users, computers and information will interact with each other. Authentication, user, group and computer access control, all help provide logical barriers that allow for secure access, but a seamless user experience with single sign-on access to resources. While it has the ability to improve and integrate critical services such as DNS, DHCP, and NTP, in many ways those services become dependent on Active Directory. These days, Active Directory controls more than just pure Windows environments. Integration with non Microsoft Operating systems like Ubuntu, Suse, and VMWare’s vSphere is becoming more common thanks to products such as LikeWise. The environment that I manage has Windows Servers and clients, most distributions of Linux, Macs, a few flavors of Unix, VMware, and iPhones. All of them rely on Active Directory. You quickly learn that if Active Directory goes down, so does your job security.

Active Directory will run happily even under less than ideal circumstances. It is incredibly resilient, and somehow can put up with server crashes, power outages, and all sorts of debauchery. But neglect is not a required ingredient for things to go wrong. When it does, the results can be devastating. AD problems can be difficult to track down, and it’s tentacles will affect services you never considered. A corrupt Active Directory, or the Controllers it runs on, can make your Exchange and SQL servers crumble around you. I lived through this experience (barely) a while back, and even though my preparation for such scenarios looked very good on paper, I spent a healthy amount of time licking my wounds, and reassessing my backup strategy of Active Directory. I never want to put myself in that position again.

As important as Active Directory is, it can be quiet challenging to protect. Why? I believe the answer can be boiled down into two main factors; it’s distributed, and it’s transaction based. In other words, the two traits that makes it robust also makes it difficult to protect. Large enterprises usually have a well architected AD infrastructure, and at least understand the complexities of protecting their AD environment. Many others are left with pondering the various ways to protect it.

File based backups using traditional backup methods. This has never been enough, but my bet is that you’d find a number of smaller environments do this – if they do anything at all. It has worked for them only because they’ve never had a failure of any sort.
AD backup agents that are a part of a commercial backup application. Some applications like Symantec Backup Exec (what I previously relied on) seem like a good idea, but show their true colors when you actually try to use it for recovery. While the agents should be extending the functionality of the backup software, they just add to an already complex solution that feels like a monstrosity geared for other purposes.
Exporting AD on Windows 2008 based Domain controllers by using NTDSUTIL and the like. This is difficult at best, arguably incomplete, and if you have a mix of Windows 2008 and Windows 2003 DC’s, won’t work.
Those who have virtualized their domain controllers often think that the well timed independent snapshot or VCB backup will protect them. This is not true either. You will have a VM consistent backup of the VM itself, but it does nothing to coordinate the application with the other Domain Controllers and the integrity of it’s contents. In theory, they could be backed up properly if every single DC was shut down at the same time, but most of us know that would not be a solution at all.
Dedicated Solutions exist to protect Active Directory, but can be overly complex, and outrageously expensive. I’m sure they do their job well, but I couldn’t get the line item past our budget line owner to find out.

The result can be a desire to want to protect AD, but uncertainty on what “protect” really means. Is protecting the server good enough? Is protecting AD itself enough? Does one need both, and if so, how does one go about doing that? Without fully understanding the answers to those questions, something inevitably goes wrong, and the Administrator is frantically flipping through the latest TechNet Article on Authoritative Restores, while attempting to figure out their backup software. It’s particularly painful to the Administrator, who had the impression that they were protecting their Organization (and themselves) when in fact, they were not.

In my opinion, protecting the domain should occur at two different levels.

Application layer. This is critical. Among other things, the backup will coordinate Active Directory so that all of it’s Update Sequence Numbers (USN’s) are at an agreed upon state. This will avoid USN’s that are out of sync, which can be the trouble of so many AD related problems. Application layer protection should also honor these AD specific attributes so that granular recovery of individual objects is possible. Good backup software should leverage API’s that take advantage of Volume Shadow Copy Services (VSS).
Physical layer. This protects the system that the services may be running on. If it’s a physical server, it could be using some disk imaging software such as Acronis, or Backup Exec System Recovery. If it’s virtualized, an independent backup of the VM will do. Some might suggest that protecting the actual machine isn’t technically required. The idea behind that reasoning is that if there is a problem with the physical machine, or the OS, one can quickly decommission and commission another DC with “dcpromo.” While protecting the system that AD runs on may not be required, it may help speed up your ability (in conjunction with Application layer protection) to correct issues from a previously known working state.

I was introduced to CionSystems by a colleague of mine who suggested their “Active Directory Self-Service” product to help us with another need of ours. Along the way, I couldn’t help but notice their AD backup offering. Aptly named, “Active Directory Recovery” is a complete application layer solution. I tried it out, and was sold. It allows for a simple, coordinated backup and recovery of Active Directory. A recovery can be either a complete point-in-time, or a granular restore of an object. It is agentless, meaning that you don’t have to install software on the DCs. The first impression after working with it is that it was designed for one purpose; to backup Active Directory. It does it, and does it well.

The solution will run on any spare machine running IIS and SQL. Once installed, configuring it is just a matter of pointing it to your Domain Controller that runs the PDC Emulator role. After a few configuration entries are made, the Administration console can be accessed with your web browser from anywhere on your network.

The next step is to set up a backup job, and let it run. That’s it. Fast, simple, and complete. From the home page, there are a few different ways you can look at objects that you want to recover.

If it’s a deleted object, you can click on the “Deleted Objects” section. Objects with a backup to restore from will show up in green, and present itself below each object. Below you will see a deleted computer object, and the backups that it can be restored from.

The “List Backups” simply shows the backups created in chronological order. From there you can do full restores, or restore an individual object that still exists in AD. Unlike authoritative restores, you do not have to do any system restarts.

During the restore process, “Active Directory Recovery” will expose individual attributes of the object that you want to restore – if you wish for the restore to be that granular. If it’s restorable, there is a checkbox next to it. Non-modifiable objects will not have a checkbox next to it.

One of my favorite features is that it provides a way for a true, portable backup. One can export the backup to a single file (a proprietary .bin file) that is your entire AD backup, and save it onto a CD, or to a remote location. This is a wish list item I’ve had for about as long as AD has been around. There are many other nice features, such as email notifications, filtering and comparison tools, as well as backup retention settings.

I use this product to compliment my existing strategy for protecting my AD infrastructure. While my virtualized Domain Controllers are replicated to a remote site (the physical protection, so to speak), I protect my AD environment at the application level with this product. The server that “Active Directory Recovery” runs on is also replicated, but to be extra safe, I create a portable/exported backup that is also shipped off to the offsite location. This way I have a fully independent backup of AD. If I’m doing some critical updates to my Domain Controllers, I first make a backup using Active Directory Recovery, then make my snapshots on my virtualized DC’s That way, I have a way to roll back the changes that are truly application consistent.

After using the product for a while, I can appreciate that I don’t have to invest much time to keep my backups up and running. I previously used Symantec’s Backup Exec to protect AD, but grew tired of agent issues, licensing problems, and the endless backup failure messages. I lost confidence in its ability to protect AD, and am not interested in going back.

Hopefully this gives you a little food for thought on how you are protecting your Active Directory environment. Good luck!

Replication with an EqualLogic SAN; Part 5

Well, I’m happy to say that replication to my offsite facility is finally up and running now. Let me share with you the final steps to get this project wrapped up.

You might recall that in my previous offsite replication posts, I had a few extra challenges. We were a single site organization, so in order to get replication up and running, an infrastructure at a second site needed to be designed and in place. My topology still reflects what I described in the first installment, but simple pictures don’t describe the work getting this set up. It was certainly a good exercise in keeping my networking skills sharp. My appreciation for the folks who specialize in complex network configurations, and address management has been renewed. They probably seldom hear words of thanks for say, that well designed sub netting strategy. They are an underappreciated bunch for sure.

My replication has been running for some time now, but this was all within the same internal SAN network. While other projects prevented me from completing this sooner, it gave me a good opportunity to observe how replication works.

Here is the way my topology looks fully deployed.

Most Collocations or Datacenters give you about 2 square feet to move around, (only a slight exaggeration on the truth) so it’s not the place you want to be contemplating reasons why something isn’t working. It’s also no fun realizing you don’t have the remote access you need to make the necessary modifications, and you don’t, or can’t drive to the CoLo. My plan for getting this second site running was simple. Build up everything locally (switchgear, firewalls, SAN, etc.) and change my topology at my primary site to emulate my the 2nd site.

Here is the way it was running while I worked out the kinks.

All replication traffic occurs over TCP port 3260. Both locations had to have accommodations for this. I also had to ensure I could manage the array living offsite. Testing this out with the modified infrastructure at my primary site allowed me to verify traffic was flowing correctly.

The steps taken to get two SAN replication partners transitioned from a single network to two networks (onsite) were:

Verify that all replication is running correctly when the two replication partners are in the same SAN Network
You will need a way to split the feed from your ISP, so if you don’t have one already, place a temporary switch at the primary site on the outside of your existing firewall. This will allow you to emulate the physical topology of the real site, while having the convenience of all of the equipment located at the primary site.
After the 2nd firewall (destined for the CoLo) is built and configured, place it on that temporary switch at the primary site.
Place something (a spare computer perhaps) on the SAN segment of the 2nd firewall so you can test basic connectivity (to ensure routing is functioning, etc) between the two SAN networks.
Pause replication on both ends, take the target array and it’s switchgear offline.
Plug the target array’s Ethernet ports to the SAN switchgear for the second site, then change the IP addressing of the array/group so that it’s running under the correct net block.
Re-enable replication and run test replicas. Starting out with the Group Manager. Then to ASM/VE, then onto ASM/ME.

It would be crazy not to take one step at a time on this, as you learn a little on each step, and can identify issues more easily. Step 3 introduced the most problems, because traffic has to traverse routers that also are secure gateways. Not only does one have to consider a couple of firewalls, you now run into other considerations that may be undocumented. For instance.

ASM/VE replication occurs courtesy of vCenter. But ASM/ME replication is configured inside the VM. Sure, it’s obvious, but so obvious it’s easy to overlook. That means any topology changes will require adjustments in each VM that utilize guest attached volumes. You will need to re-run the “Remote Setup Wizard” to adjust the IP address of the target group that you will be replicating to.
ASM/ME also uses a VSS control channel to talk with the array. If you changed the target array’s group and interface IP addresses, you will probably need to adjust what IP range will be allowed for VSS control.
Not so fast though. VM’s that use guest iSCSI initiated volumes typically have those iSCSi dedicated virtual network cards set with no default gateway. You never want to enter more than one default gateway on this sort of situation. The proper way to do this will be to add a persistent static route. This needs to be done before you run the remote Setup Wizard above. Fortunately the method to do this hasn’t changed for at least a decade. Just type in

route –p add [destinationnetwork] [subnetmask] [gateway] [metric]

Certain kinds of traffic that passes almost without a trace across a layer 2 segment shows up right away when being pushed through very sophisticated firewalls who’s default stances are deny all unless explicitly allowed. Fortunately, Dell puts out a nice document on their EqualLogic arrays.
If possible, it will be easiest to configure your firewalls with route relationships between the source SAN and the target SAN. It may complicate your rulesets (NAT relationships are a little more intelligent when it comes to rulesets in TMG), but it simplifies how each node is seeing each other. This is not to say that NAT won’t work, but it might introduce some issues that wouldn’t be documented.

Step 7 exposed an unexpected issue; terribly slow replicas. Slow even though it wasn’t even going across a WAN link. We’re talking VERY slow, as in 1/300th the speed I was expecting. The good news is that this problem had nothing to do with the EqualLogic arrays. It was an upstream switch that I was using to split my feed from my ISP. The temporary switch was not negotiating correctly, and causing packet fragmentation. Once that switch was replaced, all was good.

The other strange issue was that even though replication was running great in this test environment, I was getting errors with VSS. ASM/ME at startup would indicate “No control volume detected.” Even though replicas were running, the replica’s can’t be accessed, used, or managed in any way. After a significant amount of experimentation, I eventually opened up a case with Dell Support. Running out of time to troubleshoot, I decided to move the equipment offsite so that I could meet my deadline. Well, when I came back to the office, VSS control magically worked. I suspect that the array simply needed to be restarted after I had changed the IP addressing assigned to it.

My CoLo facility is an impressive site. Located in the Westin Building in Seattle, it is also where the Seattle Internet Exchange (SIX) is located. Some might think of it as another insignificant building in Seattle’s skyline, but it plays an important part in efficient peering for major Service Providers. Much of the building has been converted from a hotel to a top tier, highly secure datacenter and a location in which ISP’s get to bridge over to other ISP’s without hitting the backbone. Dedicated water and power supplies, full facility fail-over, and elevator shafts that have been remodeled to provide nothing but risers for all of the cabling. Having a CoLo facility that is also an Internet Exchange Point for your ISP is a nice combination.

Since I emulated the offsite topology internally, I was able to simply plug in the equipment, and turn it on, with the confidence that it will work. It did.

My early measurements on my feed to the CoLo are quite good. Since the replication times include buildup and teardown of the sessions, one might get a more accurate measurement on sustained throughput on larger replicas. The early numbers show that my 30mbps circuit is translating to replication rates that range in the neighborhood of 10 to 12GB per hour (205MB per min, or 3.4MB per sec.). If multiple jobs are running at the same time, the rate will be affected by the other replication jobs, but the overall throughput appears to be about the same. Also affecting speeds will be other traffic coming to and from our site.

There is still a bit of work to do. I will monitor the resources, and tweak the scheduling to minimize the overlap on the replication jobs. In past posts, I’ve mentioned that I’ve been considering the idea of separating the guest OS swap files from the VM’s, in an effort to reduce the replication size. Apparently I’m not the only one thinking about this, as I stumbled upon this article. It’s interesting, but a nice amount of work. Not sure if I want to go down that road yet.

I hope this series helped someone with their plans to deploy replication. Not only was it fun, but it is a relief to know that my data, and the VM’s that serve up that data, are being automatically replicated to an offsite location.

Firewall adventures: Transitioning from ISA 2006 to TMG

One of the key parts of my ~~seemingly never-ending~~ Offsite Replication project was to build out a second location to replicate my data to. Before I could do this, some prep work to my network was in order. It was a great opportunity for me to replace my existing firewall running Microsoft’s ISA 2006 server, to their newest edition, named ForeFront Threat Management Gateway, or TMG.

My new TMG system is running on a 1u appliance provided by Celestix Networks, Inc. Introduced to the Celestix line of appliances back in 2007, I’ve been very happy with the great turn-key solutions they provide. Its great for those who want to run ISA/TMG, but do not want to build up their own unit, and do not want to handle licensing of the OS or TMG. The lineup they offer ranges anywhere from branch office solutions to backbone class systems Some really nice abilities are built right into the unit, such as web based management, and updating the unit to a new build by booting to PXE. It also offers a “Last Good Version” (LGV) that will reimage the disk the the state it was saved, in the event of a configuration change going terribly wrong. Definitely peace of mind for those critical upgrades. The nature of the image creation and restore is such that it requires the system to be offline. I hope that in the future, Celestix can perhaps partner with Acronis, or some other disk imaging solution to make this process a little more convenient. It still works pretty well though. Anyway, onto the transition.

Upgrade, or transition?

This seems to be one of those ubiquitous IT related questions to almost any enterprise solution that is being run in a production environment. Should you do an in place upgrade, or should you transition to a pristine installation? In this particular case, this was already answered for me, as my old appliance ran a 32bit version of Windows Server 2003, and could not be upgraded due to system requirements. That was okay with me. A true upgrade fell out of favor with me years ago; there are just too many unknowns introduced, which can make post deployment issues extremely difficult to diagnose. I’ve also sensed that the true upgrade has fallen out of favor with software manufacturers as well. Whether it’s Exchange, SQL, or a server OS, the recommended way these days seems to be transitioning to a pristine installation.

The new box

For the new environment I was building, I chose two Celestix MSA5200i units; one for the primary facility, and one for the CoLocation. These particular units run TMG Standard, on top of Windows Server 2008R2. It would have been nice to go with a unit running the Enterprise Edition of TMG (that offers the ability to create a redundant array of servers), but I had to cut costs, and going with the Standard Edition was the easiest way to do this.

With the new unit sitting in front of me, I decided to build it up in its entirety offline, and wait for a weekend to cut it over. ISA has the ability to dump out all, or parts of the old configuration in XML, so my early (albeit naive) visions had me thinking that my transition steps would simply be exporting the configuration running on the ISA 2006 box, and importing it to the TMG box. Well, the devil is in the details, and while this could work for certain scenarios, it didn’t work for me on the first a few tries. I had a choice. Continue chasing down the reason why it wasn’t importing (an unknown time limit), or pound out a new configuration in a few days (a known time limit). No time to complain – just do it and get it over with. Good documentation in OneNote, and the ability to RDP into your existing ISA installation is key to this being a successful way to build a new configuration from scratch. To minimize typos and other fat fingering, I did export custom sets and protocols at the very granular level. Sure, I could type them out easy enough, but it was more reliable to export at the very small item level.

A properly configured TMG box is almost always joined to Active Directory, and there are some steps that you just have to wait to get to on the day of transition. This is reasonable, but it does have to be planned for. Things like using Kerberos Constrained Delegation in publishing rules, can only be configured after it’s joined. It’s also worth making sure you know all AD related settings (Delegation, OU location, GPO overrides, etc.) for the existing Firewall that you will be decommissioning. Nothing like a oversight here to mess you up.

Post installation surprises

The abilities of TMG make it far more than a simple edge security device. It is what truly separates it from the competition. Since it is integrated into the operation of so many functions up and down the protocol stack, transition like this can be a bit disruptive. I’m happy to say that considering the type of change, I didn’t run into too many troubles. I had prepared a checklist of basic functions and services I could run over to quickly validate a successful transition. This made validation easy, and prevented most Monday morning surprises.

After about 20 minutes, I had the old ISA box removed from the domain, and the new one added and configured. The rest of the time was spent confirming functionality, and resolving a few issues. Here were some of the minor ones:

ARP caching. This isn’t the first time this has bitten me. I forgot that the ARP cache on the connecting devices needed to be flushed. Silly mistake, but the nice part is, that it eventually corrects itself. (I wish I had a few more of those kinds of problems).
Publishing rules and Listeners. After you join the box to the domain, you will want to check these, and recreate if necessary. I had a few publishing rules that I had to recreate. Not a big deal. They looked okay, but just didn’t work.
I have several publicly registered IP addresses bound to the external (WAN) interface. Windows 2008 and TMG didn’t bind to the IP address I was thinking it was going to bind to (or at least the way Win2003 and ISA did). A quick fix in the TMG configuration resolved this. Look to this TechNet Article on why the behavior is different.

The final issue was a little trickier to fix. The symptoms were that web browsing was working, but it just took a while to connect. After looking at the logging, (and being tipped off on a thread on isaserver.org’s community forum), I noticed that the web proxy was attempting to use one of the RRAS adapters as the default gateway. It was being caused by web proxy clients getting confused when reading WPAD for automatic browser/proxy configuration. The slow browsing would go away as soon as the web browser’s proxy settings were manually configured. Apparently this behavior wasn’t unique to TMG (others on ISA 2006 have experienced similar behavior), but this was the first time I’ve ever seen it.

There was a .vbs script that supposedly fixed the issue. The purpose of the .vbs script was to insert the FQDN of the TMG unit into WPAD. While the script ran successfully, it didn’t change the behavior for me. At this point, a little bit of panic set in. I thought it best to tap into the expertise of my good friend, and TMG superstar Richard Hicks. Richard is a Microsoft MVP, and has a great blog that should be in everyone’s RSS feed list. After briefing him on the scenario, he provided me with another script (courtesy of Technet) that would attempt to achieve the same results as the failed script.

‘http://blogs.technet.com/isablog/archive/2008/06/26/understanding-by-design-behavior-of-isa-server-2006-using-kerberos-authentication-for-web-proxy-requests-on-isa-server-2006-with-nlb.aspx

Option Explicit

Const fpcCarpNameSystem_DNS = 0
Const fpcCarpNameSystem_WINS = 1
Const fpcCarpNameSystem_IP = 2

Dim Root, Array, WebProxy

Set Root = CreateObject("FPC.Root")
Set Array = Root.GetContainingArray
Set WebProxy = Array.ArrayPolicy.WebProxy

If fpcCarpNameSystem_DNS = WebProxy.CarpNameSystem Then

MsgBox "ISA is already configured to provide DNS names in the WPAD script.", vbInformation
WScript.Quit

End If

WebProxy.CarpNameSystem = fpcCarpNameSystem_DNS
WebProxy.Save true

MsgBox "ISA was configured to provide DNS names in the WPAD script.", vbInformation

Set WebProxy = Nothing
Set Array = Nothing
Set Root = Nothing

After I applied the .vbs script above, the issue has seemed to resolve itself, and now it’s all running smooth.

Observations

During my initial build of the new TMG unit, the first thing I noticed was the apparent efforts the TMG Team took to maintain the same look and feel as the previous version. I had seen screenshots of TMG, but that doesn’t give a good feel for UI interaction. Aside from the new features, it was quiet a relief to feel instantly comfortable with the UI. What a welcome relief to the overworked IT guy.

The next step was to give myself a refresher on what was new with TMG, and digest how that was going to influence my configuration after the cutover was complete. The improvements really do read like a wish list for the seasoned ISA 2006 user. Sometimes the Value Proposition for a software manufacturer, and their customers don’t match up. The result is this odd rollout of new features that the customer never asked for, and ignoring what the customer wants. That doesn’t seem to be the case at all with this product.

For my transition, it was most prudent for me to delay taking advantage of some of these features, just to reduce all variables, but will definitely be exploring the great features of of TMG in the coming weeks and months. The top priority right now is getting my second TMG unit built and configured for my CoLo facility, and test my replication. That’s what a deadline does for you. It ruins all the fun.

Once again, a big thanks to ISAserver.org for being a great resource for the ISA/TMG user community, as well as the folks at Microsoft, Rich, and the others at Celestix for making a quality product.

Replication with an EqualLogic SAN; Part 4

If you had asked me 6+ weeks ago how far along my replication project would be on this date, I would have thought I’d be basking in the glory of success, and admiring my accomplishments.

…I should have known better.

Nothing like several IT emergencies unrelated to this project to turn one’s itinerary into garbage. A failed server (an old physical storage server that I don’t have room on my SAN for), a tape backup autoloader that tanked, some Exchange Server and Domain Controller problems, and a host of other odd things that I don’t even want to think about. It’s overlooked how much work it takes to keep an IT infrastructure from not losing any ground from the day before. At times, it can make you wonder how any progress is made on anything.

Enough complaining for now. Lets get back to it.

Replication Frequency

For my testing, all of my replication is set to occur just once a day. This is to keep it simple, and to help me understand what needs to be adjusted when my offsite replication is finally turned up at the remote site.

I’m not overly anxious to turn up the frequency even if the situation allows. Some pretty strong opinions exist on how best to configure the frequency of the replicas. Do a little bit with a high frequency, or a lot with a low frequency. What I do know is this. It is a terrible feeling to lose data, and one of the more overlooked ways to lose data is for bad data to overwrite your good data on the backups before you catch it in time to stop it. Tapes, disk, simple file cloning, or fancy replication; the principal is the same, and so is the result. Since the big variable is retention period, I want to see how much room I have to play with before I decide on frequency. My purpose of offsite replication is disaster recovery. …not to make a disaster bigger.

Replication Sizes

The million dollar question has always been how much changed data, as perceived from the SAN will occur for a given period of time, on typical production servers. It is nearly impossible to know this until one is actually able to run real replication tests. I certainly had no idea. This would be a great feature for Dell/EqualLogic to add to their solution suite. Have a way for a storage group to run in a simulated replication where it simply collects statistics that would accurately reflect the amount of data that would be replicate during the test period. What a great feature for those looking into SAN to SAN replication.

Below are my replication statistics for a 30 day period, where the replicas were created once per day, after the initial seed replica was created.

Average data per day per VM

2 GB for general servers (service based)
3 GB for servers with guest iSCSI attached volumes.
5.2 GB for code compiling machines

Average data per day for guest iSCSI attached data volumes

11.2 GB for Exchange DB and Transaction logs (for a 50GB database)
200 MB for a SQL Server DB and Transaction logs
2 GB for SharePoint DB and Transaction logs

The replica sizes for the VM’s were surprisingly consistent. Our code compiling machines had larger replica sizes, as they write some data temporarily to the VM’s during their build processes.

The guest iSCSI attached data volumes naturally varied more from day-to-day activities. Weekdays had larger amounts of replicated data than weekends. This was expected.

Some servers, and how they generate data may stick out like sore thumbs. For instance, our source code control server uses a crude (but important) way of an application layer backup. The result is that for 75 GB worth of repositories, it would generate 100+ GB of changed data that it would want to replicate. If the backup mechanism (which is a glorified file copy and package dump) is turned off, the amount of changed data is down to a very reasonable 200 MB per day. This is a good example of how we will have to change our practices to accommodate replication.

Decreasing the amount of replicated data

Up to this point, the only step to reduce the amount of data replication is the adjustment made in vCenter to move the VM’s swap files off onto another VMFS volume that will not be replicated. That of course only affects the VM’s paging files – not the guest VM’s paging files that are controlled by the OS. I suspect that a healthy amount of changed data on the VMs are the paging files for the OS. The amount of changed data on those VM’s looked suspiciously similar to the amount of RAM assigned to the VM. There typically is some correlation to how much RAM an OS has to run with, and the size of the page file. This is pure speculation at this point, but certainly worth looking into.

The next logical step would be to figure out what could be done to reconfigure VM’s to perhaps place their paging/swap files in a different, non-replicated location. Two issues come to mind when I think about this step.

1.) This adds an unknown amount of complexity (for deploying, and restoring) to the systems running. You’d have to be confident in the behavior of each OS type when it comes to restoring from a replica where it expects to see a page file in a certain location, but does not. How scalable this approach is would also need to be asked. It might be okay for a few machines, but how about a few hundred? I don’t know.

2.) It is unknown as to how much of a payoff there will be. If the amount of data per VM gets reduced by say, 80%, then that would be pretty good incentive. If it’s more like 10%, then not so much. It’s disappointing that there seems to be only marginal documentation on making such changes. I will look to test this when I have some time, and report anything interesting that I find along the way.

The fires… unrelated, and related

One of the first problems to surface recently were issues with my 6224 switches. These were the switches that I put in place of our 5424 switches to provide better expandability. Well, something wasn’t configured correctly, because the retransmit ratio was high enough that SANHQ actually notified me of the issue. I wasn’t about to overlook this, and reported it to the EqualLogic Support Team immediately.

I was able to get these numbers under control by reconfiguring the NIC’s on my ESX hosts to talk to the SAN with standard frames. Not a long term fix, but for the sake of the stability of the network, the most prudent step for now.

After working with the 6224’s, they do seem to behave noticeably different than the 5242’s. They are more difficult to configure, and the suggested configurations from the Dell documentation seem were more convoluted and contradictory. Multiple documents and deployment guides had inconsistent information. Technical Support from Dell/EqualLogic has been great in helping me determine what the issue is. Unfortunately some of the potential fixes can be very difficult to execute. Firmware updates on a stacked set of 6224’s will result in the ENTIRE stack rebooting, so you have to shut down virtually everything if you want to update the firmware. The ultimate fix for this would be a revamp of the deployment guides (or lets try just one deployment guide) for the 6224’s that nullifies any previous documentation. By way of comparison, the 5424 switches were, and are very easy to deploy.

The other issue that came up was some unexpected behavior regarding replication, and it’s use of free pool space. I don’t have any empirical evidence to tie these two together, but this is what I had observed.

During this past month in which I had an old physical storage server fail on me, there was a moment where I had to provision what was going to be a replacement for this box, as I wasn’t even sure if the old physical server was going to be recoverable. Unfortunately, I didn’t have a whole lot of free pool space on my array, so I had to trim things up a bit, to get it to squeeze on there. Once I did, I noticed all sorts of weird behavior.

1. Since my replication jobs (with ASM/ME and ASM/VE) leverage the free pool space for the creation of temporary replica/snap that is created on the source array, this caused problems. The biggest one was that my Exchange server would completely freeze during it’s ASM/ME snapshot process. Perhaps I had this coming to me, because I deliberately configured it to use free pool space (as opposed to a replica reserve) for it’s replication. How it behaved caught me off guard, and made it interesting enough for me to never want to cut it close on free pool space again.

2. ASM/VE replica jobs also seems to behave odd with very little free pool space. Again, this was self inflicted because of my configuration settings. It left me desiring a feature that would allow you to set a threshold so that in the event of x amount of free pool space remaining, replication jobs would simply not run. This goes for ASM/VE and ASM/ME.

Once I recovered that failed physical system, I was able to remove that VM I set aside for emergency turn up. That increased my free pool space back up over 1TB, and all worked well from that point on.

Timing

Lastly, one subject matter came up that doesn’t show up in any deployment guide I’ve seen. The timing of all this protection shouldn’t be overlooked. One wouldn’t want to stack several replication jobs on top of each other that use the same free pool space, but haven’t had the time to replicate. Other snapshot jobs, replicas, consistency checks, traditional backups, etc should be well coordinated to keep overlap to a minimum. If you are limited on resources, you may also be able to use timing to your advantage. For instance, set your daily replica of your Exchange database to occur at 5:00am, and your daily snapshot to occur at 5:00pm. That way, you have reduced your maximum loss period from 24 hours to 12 hours, just by offsetting the times.

Replication with an EqualLogic SAN; Part 1

Behind every great virtualized infrastructure is a great SAN to serve everything up. I’ve had the opportunity to work with the Dell/EqualLogic iSCSI array for a while now, taking advantage of all of the benefits that the iSCSI based SAN array offers. One feature that I haven’t been able to use is the built in replication feature. Why? I only had one array, and I didn’t have a location offsite to replicate to.

I suppose the real “part 1” of my replication project was selling the idea to the Management Team. When it came to protecting our data and the systems that help generate that data, it didn’t take long for them to realize it wasn’t a matter of what we could afford, but how much we could afford to lose. Having a building less than a mile away burn to the ground also helped the proposal. On to the fun part; figuring out how to make all of this stuff work.

Of the many forms of replication out there, the most obvious one for me to start with is native SAN to SAN replication. Why? Well, it’s built right into the EqualLogic PS arrays, with no additional components to purchase, or license keys or fees to unlock features. Other solutions exist, but it was best for me to start with the one I already had.

For companies with multiple sites, replication using EqualLogic arrays seems pretty straight forward. For a company with nothing more than a single site, there are a few more steps that need to occur before the chance to start replicating data can happen.

Decision: Colocation, or hosting provider

One of the first decisions that had to be made was if we wanted our data to be replicated to a Colocation (CoLo) with equipment that we owned and controlled, or with a hosting provider that can provide native PS array space and replication abilities. Most hosting providers use a mixed variety of metering of data replicated to charge. Accurately estimating your replication costs assumes you have a really good understanding of how much data will be replicated. Unfortunately, this is difficult to know until you start replicating. The pricing models of these hosting providers reminded me too much of a cab fare; never knowing what you are going to pay until you get the big bill when you are finished. A CoLo with equipment that we owned fit with our current and future objectives much better. We wanted fixed costs, and the ability to eventually do some hosting of critical services at the CoLo (web, ftp, mail relay, etc.), so it was an easy decision for us.

Our decision was to go with a CoLo facility located in the Westin Building in downtown Seattle. Commonly known as the Seattle Internet Exchange (SIX), this is an impressive facility not only in it’s physical infrastructure, but how it provides peered interconnects directly from one ISP to another. Our ISP uses this facility, so it worked out well to have our CoLo there as well

Decision: Bandwidth

Bandwidth requirements for our replication was, and is still unknown, but I knew our bonded T1’s probably weren’t going to be enough, so I started exploring other options for higher speed access. The first thing to check was to see if we qualified for a Metro-E or “Ethernet over Copper” (award winner for the dumbest name ever). Metro-E removes the element of T-carrier lines along with any proprietary signaling, and provides internet access of point-to-point connections at Layer 2, instead of Layer 3. We were not close enough to the carriers central office to get adequate bandwidth, and even if we were, it probably wouldn’t scale up to our future needs.

Enter QMOE, or Qwest Metro Optical Ethernet. This solution feeds Layer 2 Ethernet to our building via fiber, offering the benefit of high bandwidth, low latency, that can be scaled easily.

Our first foray using QMOE is running a 30mbps point-to-point feed to our CoLo, and uplinked to the Internet. If we need more later, there is no need to add or change equipment. Just have them turn up the dial, and bill you accordingly.

Decision: Topology

Topology planning has been interesting to say the least. The best decision here depends on the use-case, and lets not forget, what’s left in the budget.

Two options immediately presented themselves.

1. Replication data from our internal SAN would be routed (Layer 3) to the SAN at the CoLo.

2. Replication data from our internal SAN would travel by way of a VLAN to the SAN at the CoLo.

If my need was only to send replication data to the CoLo, one could take advantage of that layer 2 connection, and send replication data directly to the CoLo, without it being routed. This would mean that it would have to bypass any routers/firewalls in place, and have to be running to the CoLo on it’s own VLAN.

The QMOE network is built off of Cisco Equipment, so in order to utilize any VLANing from the CoLo to the primary facility, you must have Cisco switches that will support their VLAN trunking protocol (VTP). I don’t have the proper equipment for that right now.

In my case, here is a very simplified illustration as to how the two topologies would look:

Routed Topology

Topology using VLANs

One may introduce more overhead and less effective throughput when the traffic becomes routed. This is where a WAN optimization solution could come into play. These solutions (SilverPeak, Riverbed, etc.) appear to be extremely good at improving effective throughput across many types of WAN connections. These of course must sit at the correct spot in the path to the destination. The units are often priced on bandwidth speed, and while they are very effective, are also quite an investment. But they work at layer 3, and must in between the source and a router at both ends of the communication path; something that wouldn’t exist on a Metro-E circuit where VLANing was used to transmit replicated data.

The result is that for right now, I have chosen to go with a routed arrangement with no WAN optimization. This does not differ too much from a traditional WAN circuit, other than my latencies should be much better. The next step if our needs are not sufficiently met would be to invest in a couple of Cisco switches, then send replication data over it’s own VLAN to the CoLo, similar to the illustration above.

The equipment

My original SAN array is an EqualLogic PS5000e connected to a couple of Dell PowerConnect 5424 switches. My new equipment closely mirrors this, but is slightly better; An EqualLogic PS6000e and two PowerConnect 6224 switches. Since both items will scale a bit better, I’ve decided to change out the existing array and switches with the new equipment.

Some Lessons learned so far

If you are changing ISPs, and your old ISP has authoritative control of your DNS zone files, make sure your new ISP has the zone file EXACTLY the way you need it. Then confirm it one more time. Spelling errors and omissions in DNS zone files doesn’t work out very well, especially when you factor in the time it takes for the corrections to propagate through the net. (Usually up to 72 hours, but can feel like a lifetime when your customers can’t get to your website)

If you are going to go with a QMOE or Metro-E circuit, be mindful that you might have to force the external interface on your outermost equipment (in our case, the firewall/router, but could be a managed switch as well) to negotiate to 100mbps full duplex. Auto negotiation apparently doesn’t work to well on many Metro-E implementations, and can cause fragmentation that will reduce your effective throughput by quite a bit. This is exactly what we saw. Fortunately it was an easy fix.

Stay tuned for what’s next…

Using OneNote in IT

It’s hard to believe that as an IT administrator, one of my favorite applications I use is one of the least technical. Microsoft created an absolutely stellar application when they created OneNote. If you haven’t used it, you should.

Most IT Administrators have high expectations of themselves. Somehow we expect to remember pretty much everything. Deployment planning, research, application specific installation steps and issues. Information gathering for troubleshooting, and documenting as-built installations. You might have information that you work with every day, and think “how could I ever forget that?” (you will), along with that obscure, required setting on your old phone system that hasn’t been looked at in years.

The problem is that nobody can remember everything.

After years of using my share of spiral binders, backs of print outs, and Post-It notes to gather and manage systems and technologies, I’ve realized a few things. 1.) I can’t read my own writing. 2.) I never wrote enough down for the information to be valuable. 3.) What I can’t fit on one physical page, I squeeze in on another page that makes no sense at all. 4.) The more I have to do, the more I tried (and failed) to figure out a way to file it. 5.) These notes eventually became meaningless, even though I knew I kept them for a reason. I just couldn’t remember why.

Do you want to make a huge change in how you work? Read on.

OneNote was first adopted by our Sales team several years ago, and while I knew what it was, I never bothered to use it for real IT projects until late in 2007, when a colleague of mine (thanks Glenn if you are reading) suggested that it was working well for him and his IT needs. Ever since then, I wonder how I ever worked without it.

If you aren’t familiar with OneNote, there isn’t too much to understand. It’s an electronic Notebook.

It’s arranged just as you’d expect a real notebook. The left side represents notebooks, the top area of tabs represent sections or earmarks, and the right side represents the pages in a notebook. It’s that easy. Just like it’s physical counterpart, it’s free-form formatting allows you to place object anywhere on a page (goodbye MS Word).

What has transpired since my experiment to use OneNote is how well it tackles every single need I have in information gathering and mining of that data after the fact. Here are some examples.

Long term projects and Research

What better time to try out a new way of working on one of the biggest projects I’ve had to tackle in years, right? Virtualizing my infrastructure was a huge undertaking, and I had what seemed like an infinite amount of information to learn in a very short period of time, under all different types of subject matters. In a Notebook called “Virtualization” I had sections that narrowed subject matters down to things like ESX, SAN array, Blades, switchgear, UPS, etc. Each one of those sections had pages (at least a few dozen for the ESX section, as there was a lot to tackle) that were specific subject matters of information I needed to gather to learn about, or to keep for reference. Links, screen captures, etc. I dumped everything in there, including my deployment steps before, during, and after.

Procedures

Our Linux code compiling machines have very specific package installations and settings that need to be set before deployment. OneNote works great for this. The no-brainer checkboxes offer nice clarity.

If you maintain different flavors of Unix or various distributions of Linux, you know how much the syntax can vary. OneNote helps keep your sanity. With so many Windows products going the way of Powershell, you’d better have your command line syntax down for that too.

This has also worked well with backend installations. My Installations of VMware, SharePoint, Exchange, etc. have all been documented this way. It takes just a bit longer, but is invaluable later on. Below is a capture of part of my cutover plan from Exchange 2003 to Exchange 2007.

Migrations and Post migration outstanding issues

After big migrations, you have to be on your toes to address issues that are difficult to predict. OneNote has allowed me to use a simple ISSUE/FIX approach. So, in an “Apps” notebook, under an “E2007 Migration” section, I might have a page called “Postfix” and it might look something like this.

You can label these pages “Outstanding issues” or as I did for my ESX 3.5 to vSphere migration, “Postfix” pages.

As-builts

Those in the Engineering/Architectural world are quite familiar with As-built drawings. Those are drawings that reflect how things were really built. Many times in IT, deployment plans and documentation never go further than the day you deploy it. OneNote allows for an easy way to turn that deployment plan into a living copy, or as-built configuration of the product you just deployed. Configurations are as dynamic as the technologies that power them. Its best to know what sort of monster you created, and how to recreate it if you need to.

Daily issues (fire fighting)

Emergencies, impediments, fires, or whatever you’d like to call them, come up all the time. I’ve found OneNote to be most helpful in two specific areas on this type of task. I use it as a quick way to gather data on an issue that I can look at later (copying and pasting screenshot and URLs into OneNote), and for comparing the current state of a system against past configurations. Both ways help me solve the problems more quickly.

Searching text in bitmapped screen captures

One of the really interesting things about OneNote is that you can paste a screen capture of say, a dialog box in the notebook, then when searching later for a keyword, it will include those bitmaps in the search results!!!! Below is one of the search results OneNote pulled up when I searched for “KDC” This was a screen capture sitting in OneNote. Neat.

Goodbye Browser Bookmarks

How many times have you spent trying to organize your web browser bookmarks or favorites, only to never look at them again, or try to figure out why you bookmarked it? Its an exercise in futility. No more! Toss them all away. Paste those links into the various locations in OneNote (where the subject matter is applicable, and enter a brief little description on top of it, and you can always find it later when searching for it.

Summary

I won’t ever go without using OneNote for projects large or small again. It is right next to my email as my most used application. OneNote users tend to be a loyal bunch, and after a few years of using it, I can see why. At about $80 retail, you can’t go wrong. And, lucky for you, it will be included in all versions of Office 2010.

Additional Links

New features coming in OneNote 2010
http://blogs.msdn.com/descapa/archive/2009/07/15/overview-of-onenote-2010-what-s-new-for-you.aspx

Using OneNote with SharePoint
http://blogs.msdn.com/mcsnoiwb/archive/2008/12/03/onenote-and-sharepoint-the-basics.aspx

Interesting tips and tricks with OneNote
http://blogs.msdn.com/onenotetips/

Virtualization. Making it happen

It’s difficult to put into words how exciting, and how overwhelming the idea of moving to a virtualized infrastructure was for me. In 12 months, I went from investigating solutions, to presenting our options to our senior management, onto the procurement process, followed by the design and implementation of the systems. And finally, making the transition of our physical machines to a virtualized environment.

It has been an incredible amount of work, but equally as satisfying. The pressure to produce results was even bigger than the investment itself. With this particular project, I’ve taken away a few lessons I learned along the way, some of which had nothing to do with virtualization. Rather than providing endless technical details on this post, I thought I’d share what I learned that has nothing to do with vswitches or CPU Utilization.

1. The sell. I never would have been able to achieve what I achieved without the support of our Management Team. I’m an IT guy, and do not have a gift of crafty PowerPoint slides, and fluid presentation skills. But there was one slide that hit it out of the park for me. It showed how much this crazy idea was going to cost, but more importantly, how it compared against what we were going to spend anyway under a traditional environment. We had delayed server refreshing for a few years, and it was catching up to us. Without even factoring in the projected growth of the company, the two lines intersected in less than one year. I’m sure the dozen other slides helped support my proposal, but this one offered the clarity needed to get approval.

2. Let go. I tend to be self-reliant and made a habit of leaning on my own skills to get things done. At a smaller company, you get used to that. Time simply didn’t allow for that approach to be taken on this project. I needed help, and fast. I felt very fortunate to establish a great working relationship with Mosaic Technologies. They provided resources to me that gave me the knowledge I needed to make good purchasing decisions, then assisted with the high level design. I had access to a few of the most knowledgeable folks in the industry to help me move forward on the project, minimizing the floundering on my part. They also helped me with sorting out what could be done, versus real-world recommendations on deployment practices. It didn’t excuse me from the learning that needed to occur, and making it happen, but rather, helped speed up the process, and apply a virtualization solution to our environment correctly. There is no way I would have been able to do it in the time frame required without them.

3. Ditch the notebook. Consider the way you assemble what you’re learning. I’ve never needed to gather as much information on a project as this. I hated not knowing what I didn’t know. (take that Yogi Berra) I was pouring through books, white papers, and blogs to give myself a crash course on a number of different subjects – all at the same time because they needed to work together. Because of the enormity of the project, I decided from the outset that I needed to try something different. This was the first project where I abandoned scratchpads and binders, highlighters (mostly) and printouts. I documented ALL of my information in Microsoft OneNote. This was a huge success, which I will describe more in another post.

4. Tune into RSS feeds. Virtualization was a great example of a topic that many smart people dedicate their entire focus towards, then are kind enough to post this information on their blogs. Having feeds come right to your browser is the most efficient way to keep up on the content. Every day I’d see my listing of feeds for a few dozen or so VMware related blogs I was keeping track of. It was uncanny how timely, and how applicable some of the information posted was. Not every bit of information could be unconditionally trusted, but hey, it’s the Internet.

5. Understand the architecture. Looking back, I spent an inordinate amount of time in the design phase. Much of this was trying to fully understand what was being recommended to me by my resources at Mosaic, as well as other material, and how that compared to other environments. At times, grass grew faster than I was moving on the project at the time, (exacerbated by other projects getting in the way) but I don’t regret my stubbornness to understand what was I was trying to absorb before moving forward. We now have a scalable, robust system that helps avoid some of the common mistakes I see occur on user forums.

6. Don’t be a renegade. Learn from those who really know what they are doing, and choose proven technologies, while recognizing trends in the fast-moving virtualization industry. For me there was a higher up front cost to this approach, but time didn’t allow for any experimentation. It helped me settle on VMware ESX powered by Dell blades, running on a Dell/EqualLogic iSCSI SAN. That is not a suggestion that a different, or lesser configuration will not work, but for me, it helped expedite my deployment.

7. Just because you are a small shop, doesn’t mean you don’t have to think big. Much of my design considerations surrounded planning for the future. How the system could scale and change, and how to minimize the headaches with those changes. I wanted my VLAN’s arranged logically, and address boundaries configured in a way that would make sense for growth. For a company of about 50 employees/120 systems, I never had to deal with this very much. Thanks to another good friend of mine whom I’d been corresponding with on a project a few months prior, I was able to get things started on the right foot. I’ll tell you more about this in a later post.

The results of the project have exceeded my expectations. It’s working even better than I anticipated, and has already proven it’s value when I had a hardware failure occur. We’ve migrated over 20 of our production systems to the new environment, and will have about 20 more up online within about 6 months. There is a tremendous amount of work yet to be completed, but the benefits are paying for themselves already.

It’s all about the name

Every once in a while you run into a way of doing things that makes you wonder why you ever did it any other way. For me, that was using DNS aliasing for referencing all servers, and services that they provide. I use them whenever possible.

Many years ago I had a catastrophic server failure. Looking back, it was a fascinating series of events that you would think would never happen, but it did. This server happened to be the primary storage server for our development team, and was a staple of our development system. It’s full server name was hardcoded on mount points and symbolic links of other *nix systems, as well as drive mappings from windows machines connecting to it via Samba. It’s name was buried in countless scripts owned by the Development and QA teams. Once the new hardware came in, provisioning a new server was relatively easy. Getting everything functioning again because of these broken links was not. Other factors prevented me from using the old approach, which was naming the new server the same name as the old server. So I knew there had to be a better way. There was. That was using DNS aliasing (cname records) on your internal DNS servers to decouple the server name itself from the service it was providing. This practice helps you design your server infrastructure for change.

Good candidates for aliasing are:

NTP/time servers (automated for domain joined machines, but not for non-joined machines, *nix systems, and network devices)
Email servers (primary email servers, as well as mail relay servers)
Source code control servers
Document management, wikis, or collaboration servers
Critical workstations/servers that perform source code compiling and/or validation testing.
Network devices and OOB management cards. I can’t remember what the FQDN’s of my switches are. Can you?
Log servers.
File Servers and their respective share names or NFS exports (ex. \\infostore\sales & infostore:/exports/sales respectively)

The practice is particularly interesting on file servers. If you start out with one file server that contains shares for your applications, your files, and your user home directories. You could have sharenames that reference aliases, all for the very same server.

\\appserv\applications
\\fileserv\operations
\\userserv\joesmith

Now, when you need to move user home directories over to a new server, or bring up a new server to perform that new role, just move the data, turn up the share name, and change the alias.

Now of course, there are some things that aliasing can’t be used on, or doesn’t work well on.

DNS clients that need to refer to DNS servers require IP addresses, and can’t use aliases
Some windows service that may use complex authentication methods.
Services that are relying on SSL certificates that are expecting to see the real name, not the alias. (ex. Exchange URL references)
Windows Server 2003 and earlier do not support aliases out of the box. It will support only \\realservername\sharename by default. You will need to add a registry key to disable strict name checking. More info found here: http://support.microsoft.com/kb/281308

Most recently, I made the transition from Exchange 2003 to Exchange 2007. Usually a project like that has pages of carefully planned out steps on the cut-over; what needed to be changed, and when. What I didn’t have to worry about this time is all of my internal hosts that reference the mail server by it’s DNS alias name; mailserver.mycompany.lan. Just one easy step to change the cname reference from the old server name to the new server name, and that was it. The same thing occurred when I transitioned to new Domain Controllers a few months ago. These serve as my internal time servers for all internal systems and devices.

What’s most surprising is that this practices is not done in IT environments as often as you’d think. There might be an occasional alias here and there, but not a calculated effort to help transitions to new servers and reduce downtime. Whether you are doing planned server transitions, or recovering from a server failure, this is a practice that is guaranteed to help almost any situation.