Replication with an EqualLogic SAN; Part 4

 

If you had asked me 6+ weeks ago how far along my replication project would be on this date, I would have thought I’d be basking in the glory of success, and admiring my accomplishments.

…I should have known better.

Nothing like several IT emergencies unrelated to this project to turn one’s itinerary into garbage.  A failed server (an old physical storage server that I don’t have room on my SAN for), a tape backup autoloader that tanked, some Exchange Server and Domain Controller problems, and a host of other odd things that I don’t even want to think about.  It’s overlooked how much work it takes to keep an IT infrastructure from not losing any ground from the day before.  At times, it can make you wonder how any progress is made on anything.

Enough complaining for now.  Lets get back to it.  

 

Replication Frequency

For my testing, all of my replication is set to occur just once a day.  This is to keep it simple, and to help me understand what needs to be adjusted when my offsite replication is finally turned up at the remote site.

I’m not overly anxious to turn up the frequency even if the situation allows.  Some pretty strong opinions exist on how best to configure the frequency of the replicas.  Do a little bit with a high frequency, or a lot with a low frequency.  What I do know is this.  It is a terrible feeling to lose data, and one of the more overlooked ways to lose data is for bad data to overwrite your good data on the backups before you catch it in time to stop it.  Tapes, disk, simple file cloning, or fancy replication; the principal is the same, and so is the result.   Since the big variable is retention period, I want to see how much room I have to play with before I decide on frequency.  My purpose of offsite replication is disaster recovery.  …not to make a disaster bigger.

 

Replication Sizes

The million dollar question has always been how much changed data, as perceived from the SAN will occur for a given period of time, on typical production servers.  It is nearly impossible to know this until one is actually able to run real replication tests.  I certainly had no idea.  This would be a great feature for Dell/EqualLogic to add to their solution suite.  Have a way for a storage group to run in a simulated replication where it simply collects statistics that would accurately reflect the amount of data that would be replicate during the test period.  What a great feature for those looking into SAN to SAN replication.

Below are my replication statistics for a 30 day period, where the replicas were created once per day, after the initial seed replica was created.

Average data per day per VM

  • 2 GB for general servers (service based)
  • 3 GB for servers with guest iSCSI attached volumes.
  • 5.2 GB for code compiling machines

Average data per day for guest iSCSI attached data volumes

  • 11.2 GB for Exchange DB and Transaction logs (for a 50GB database)
  • 200 MB for a SQL Server DB and Transaction logs
  • 2 GB for SharePoint DB and Transaction logs

The replica sizes for the VM’s were surprisingly consistent.  Our code compiling machines had larger replica sizes, as they write some data temporarily to the VM’s during their build processes.

The guest iSCSI attached data volumes naturally varied more from day-to-day activities.  Weekdays had larger amounts of replicated data than weekends.  This was expected.

Some servers, and how they generate data may stick out like sore thumbs.  For instance, our source code control server uses a crude (but important) way of an application layer backup.  The result is that for 75 GB worth of repositories, it would generate 100+ GB of changed data that it would want to replicate.  If the backup mechanism (which is a glorified file copy and package dump) is turned off, the amount of changed data is down to a very reasonable 200 MB per day.  This is a good example of how we will have to change our practices to accommodate replication.

 

Decreasing the amount of replicated data

Up to this point, the only step to reduce the amount of data replication is the adjustment made in vCenter to move the VM’s swap files off onto another VMFS volume that will not be replicated.  That of course only affects the VM’s paging files – not the guest VM’s paging files that are controlled by the OS.  I suspect that a healthy amount of changed data on the VMs are the paging files for the OS.  The amount of changed data on those VM’s looked suspiciously similar to the amount of RAM assigned to the VM.  There typically is some correlation to how much RAM an OS has to run with, and the size of the page file.  This is pure speculation at this point, but certainly worth looking into.

The next logical step would be to figure out what could be done to reconfigure VM’s to perhaps place their paging/swap files in a different, non-replicated location.   Two issues come to mind when I think about this step. 

1.)  This adds an unknown amount of complexity (for deploying, and restoring) to the systems running.  You’d have to be confident in the behavior of each OS type when it comes to restoring from a replica where it expects to see a page file in a certain location, but does not.  How scalable this approach is would also need to be asked.  It might be okay for a few machines, but how about a few hundred?  I don’t know.

2.)  It is unknown as to how much of a payoff there will be.  If the amount of data per VM gets reduced by say, 80%, then that would be pretty good incentive.  If it’s more like 10%, then not so much.  It’s disappointing that there seems to be only marginal documentation on making such changes.  I will look to test this when I have some time, and report anything interesting that I find along the way.

 

The fires… unrelated, and related

One of the first problems to surface recently were issues with my 6224 switches.  These were the switches that I put in place of our 5424 switches to provide better expandability.  Well, something wasn’t configured correctly, because the retransmit ratio was high enough that SANHQ actually notified me of the issue.  I wasn’t about to overlook this, and reported it to the EqualLogic Support Team immediately.

I was able to get these numbers under control by reconfiguring the NIC’s on my ESX hosts to talk to the SAN with standard frames.  Not a long term fix, but for the sake of the stability of the network, the most prudent step for now.

After working with the 6224’s, they do seem to behave noticeably different than the 5242’s.  They are more difficult to configure, and the suggested configurations from the Dell documentation seem were more convoluted and contradictory.  Multiple documents and deployment guides had inconsistent information.  Technical Support from Dell/EqualLogic has been great in helping me determine what the issue is.  Unfortunately some of the potential fixes can be very difficult to execute.  Firmware updates on a stacked set of 6224’s will result in the ENTIRE stack rebooting, so you have to shut down virtually everything if you want to update the firmware.  The ultimate fix for this would be a revamp of the deployment guides (or lets try just one deployment guide) for the 6224’s that nullifies any previous documentation.  By way of comparison, the 5424 switches were, and are very easy to deploy. 

The other issue that came up was some unexpected behavior regarding replication, and it’s use of free pool space.  I don’t have any empirical evidence to tie these two together, but this is what I had observed.

During this past month in which I had an old physical storage server fail on me, there was a moment where I had to provision what was going to be a replacement for this box, as I wasn’t even sure if the old physical server was going to be recoverable.  Unfortunately, I didn’t have a whole lot of free pool space on my array, so I had to trim things up a bit, to get it to squeeze on there.  Once I did, I noticed all sorts of weird behavior.

1.  Since my replication jobs (with ASM/ME and ASM/VE) leverage the free pool space for the creation of temporary replica/snap that is created on the source array, this caused problems.  The biggest one was that my Exchange server would completely freeze during it’s ASM/ME snapshot process.  Perhaps I had this coming to me, because I deliberately configured it to use free pool space (as opposed to a replica reserve) for it’s replication.  How it behaved caught me off guard, and made it interesting enough for me to never want to cut it close on free pool space again.

2.  ASM/VE replica jobs also seems to behave odd with very little free pool space.  Again, this was self inflicted because of my configuration settings.  It left me desiring a feature that would allow you to set a threshold so that in the event of x amount of free pool space remaining, replication jobs would simply not run.  This goes for ASM/VE and ASM/ME.

Once I recovered that failed physical system, I was able to remove that VM I set aside for emergency turn up.  That increased my free pool space back up over 1TB, and all worked well from that point on. 

 

Timing

Lastly, one subject matter came up that doesn’t show up in any deployment guide I’ve seen.  The timing of all this protection shouldn’t be overlooked.  One wouldn’t want to stack several replication jobs on top of each other that use the same free pool space, but haven’t had the time to replicate.  Other snapshot jobs, replicas, consistency checks, traditional backups, etc should be well coordinated to keep overlap to a minimum.  If you are limited on resources, you may also be able to use timing to your advantage.  For instance, set your daily replica of your Exchange database to occur at 5:00am, and your daily snapshot to occur at 5:00pm.  That way, you have reduced your maximum loss period from 24 hours to 12 hours, just by offsetting the times.

Replication with an EqualLogic SAN; Part 1

 

Behind every great virtualized infrastructure is a great SAN to serve everything up.  I’ve had the opportunity to work with the Dell/EqualLogic iSCSI array for a while now, taking advantage of all of the benefits that the iSCSI based SAN array offers.  One feature that I haven’t been able to use is the built in replication feature.  Why?  I only had one array, and I didn’t have a location offsite to replicate to.

I suppose the real “part 1” of my replication project was selling the idea to the Management Team.  When it came to protecting our data and the systems that help generate that data, it didn’t take long for them to realize it wasn’t a matter of what we could afford, but how much we could afford to lose.  Having a building less than a mile away burn to the ground also helped the proposal.  On to the fun part; figuring out how to make all of this stuff work.

Of the many forms of replication out there, the most obvious one for me to start with is native SAN to SAN replication.  Why?  Well, it’s built right into the EqualLogic PS arrays, with no additional components to purchase, or license keys or fees to unlock features.  Other solutions exist, but it was best for me to start with the one I already had.

For companies with multiple sites, replication using EqualLogic arrays seems pretty straight forward.  For a company with nothing more than a single site, there are a few more steps that need to occur before the chance to start replicating data can happen.

 

Decision:  Colocation, or hosting provider

One of the first decisions that had to be made was if we wanted our data to be replicated to a Colocation (CoLo) with equipment that we owned and controlled, or with a hosting provider that can provide native PS array space and replication abilities.  Most hosting providers use a mixed variety of metering of data replicated to charge.  Accurately estimating your replication costs assumes you have a really good understanding of how much data will be replicated.  Unfortunately, this is difficult to know until you start replicating.  The pricing models of these hosting providers reminded me too much of a cab fare; never knowing what you are going to pay until you get the big bill when you are finished.    A CoLo with equipment that we owned fit with our current and future objectives much better.  We wanted fixed costs, and the ability to eventually do some hosting of critical services at the CoLo (web, ftp, mail relay, etc.), so it was an easy decision for us.

Our decision was to go with a CoLo facility located in the Westin Building in downtown Seattle.  Commonly known as the Seattle Internet Exchange (SIX), this is an impressive facility not only in it’s physical infrastructure, but how it provides peered interconnects directly from one ISP to another.  Our ISP uses this facility, so it worked out well to have our CoLo there as well

 

Decision:  Bandwidth

Bandwidth requirements for our replication was, and is still unknown, but I knew our bonded T1’s probably weren’t going to be enough, so I started exploring other options for higher speed access.  The first thing to check was to see if we qualified for a Metro-E or “Ethernet over Copper” (award winner for the dumbest name ever).  Metro-E removes the element of T-carrier lines along with any proprietary signaling, and provides internet access of point-to-point connections at Layer 2, instead of Layer 3.  We were not close enough to the carriers central office to get adequate bandwidth, and even if we were, it probably wouldn’t scale up to our future needs.

Enter QMOE, or Qwest Metro Optical Ethernet.  This solution feeds Layer 2 Ethernet to our building via fiber, offering the benefit of high bandwidth, low latency, that can be scaled easily.

Our first foray using QMOE is running a 30mbps point-to-point feed to our CoLo, and uplinked to the Internet.  If we need more later, there is no need to add or change equipment.  Just have them turn up the dial, and bill you accordingly.

 

Decision:  Topology

Topology planning has been interesting to say the least.  The best decision here depends on the use-case, and lets not forget, what’s left in the budget. 

Two options immediately presented themselves.

1.  Replication data from our internal SAN would be routed (Layer 3) to the SAN at the CoLo.

2.  Replication data  from our internal SAN would travel by way of a VLAN to the SAN at the CoLo.

If my need was only to send replication data to the CoLo, one could take advantage of that layer 2 connection, and send replication data directly to the CoLo, without it being routed.  This would mean that it would have to bypass any routers/firewalls in place, and have to be running to the CoLo on it’s own VLAN.

The QMOE network is built off of Cisco Equipment, so in order to utilize any VLANing from the CoLo to the primary facility, you must have Cisco switches that will support their VLAN trunking protocol (VTP).  I don’t have the proper equipment for that right now.

In my case, here is a very simplified illustration as to how the two topologies would look:

Routed Topology

image

 

Topology using VLANs

image

One may introduce more overhead and less effective throughput when the traffic becomes routed.  This is where a WAN optimization solution could come into play.  These solutions (SilverPeak, Riverbed, etc.) appear to be extremely good at improving effective throughput across many types of WAN connections.  These of course must sit at the correct spot in the path to the destination.  The units are often priced on bandwidth speed, and while they are very effective, are also quite an investment.  But they work at layer 3, and must in between the source and a router at both ends of the communication path; something that wouldn’t exist on a Metro-E circuit where VLANing was used to transmit replicated data.

The result is that for right now, I have chosen to go with a routed arrangement with no WAN optimization.  This does not differ too much from a traditional WAN circuit, other than my latencies should be much better.  The next step if our needs are not sufficiently met would be to invest in a couple of Cisco switches, then send replication data over it’s own VLAN to the CoLo, similar to the illustration above.

 

The equipment

My original SAN array is an EqualLogic PS5000e connected to a couple of Dell PowerConnect 5424 switches.  My new equipment closely mirrors this, but is slightly better;  An EqualLogic PS6000e and two PowerConnect 6224 switches.  Since both items will scale a bit better, I’ve decided to change out the existing array and switches with the new equipment.

 

Some Lessons learned so far

If you are changing ISPs, and your old ISP has authoritative control of your DNS zone files, make sure your new ISP has the zone file EXACTLY the way you need it.  Then confirm it one more time.  Spelling errors and omissions in DNS zone files doesn’t work out very well, especially when you factor in the time it takes for the corrections to propagate through the net.  (Usually up to 72 hours, but can feel like a lifetime when your customers can’t get to your website) 

If you are going to go with a QMOE or Metro-E circuit, be mindful that you might have to force the external interface on your outermost equipment (in our case, the firewall/router, but could be a managed switch as well) to negotiate to 100mbps full duplex.  Auto negotiation apparently doesn’t work to well on many Metro-E implementations, and can cause fragmentation that will reduce your effective throughput by quite a bit.  This is exactly what we saw.  Fortunately it was an easy fix.

 

Stay tuned for what’s next…

Using OneNote in IT

 

It’s hard to believe that as an IT administrator, one of my favorite applications I use is one of the least technical.  Microsoft created an absolutely stellar application when they created OneNote.  If you haven’t used it, you should.

Most IT Administrators have high expectations of themselves.  Somehow we expect to remember pretty much everything.  Deployment planning, research, application specific installation steps and issues.  Information gathering for troubleshooting, and documenting as-built installations.  You might have information that you work with every day, and think “how could I ever forget that?” (you will), along with that obscure, required setting on your old phone system that hasn’t been looked at in years.

The problem is that nobody can remember everything. 

After years of using my share of spiral binders, backs of print outs, and Post-It notes to gather and manage systems and technologies, I’ve realized a few things.  1.)  I can’t read my own writing.  2.)  I never wrote enough down for the information to be valuable.  3.)  What I can’t fit on one physical page, I squeeze in on another page that makes no sense at all.  4.)  The more I have to do, the more I tried (and failed) to figure out a way to file it.  5.)  These notes eventually became meaningless, even though I knew I kept them for a reason.  I just couldn’t remember why.

Do you want to make a huge change in how you work?   Read on.

OneNote was first adopted by our Sales team several years ago, and while I knew what it was, I never bothered to use it for real IT projects until late in 2007, when a colleague of mine (thanks Glenn if you are reading) suggested that it was working well for him and his IT needs.  Ever since then, I wonder how I ever worked without it.

If you aren’t familiar with OneNote, there isn’t too much to understand.  It’s an electronic Notebook. 

image

It’s arranged just as you’d expect a real notebook.  The left side represents notebooks, the top area of tabs represent sections or earmarks, and the right side represents the pages in a notebook.  It’s that easy.   Just like it’s physical counterpart, it’s free-form formatting allows you to place object anywhere on a page (goodbye MS Word).

What has transpired since my experiment to use OneNote is how well it tackles every single need I have in information gathering and mining of that data after the fact.  Here are some examples.

Long term projects and Research

What better time to try out a new way of working on one of the biggest projects I’ve had to tackle in years, right?  Virtualizing my infrastructure was a huge undertaking, and I had what seemed like an infinite amount of information to learn in a very short period of time, under all different types of subject matters.  In a Notebook called “Virtualization” I had sections that narrowed subject matters down to things like ESX, SAN array, Blades, switchgear, UPS, etc.  Each one of those sections had pages (at least a few dozen for the ESX section, as there was a lot to tackle) that were specific subject matters of information I needed to gather to learn about, or to keep for reference.  Links, screen captures, etc.  I dumped everything in there, including my deployment steps before, during, and after.

 

Procedures

Our Linux code compiling machines have very specific package installations and settings that need to be set before deployment.  OneNote works great for this.  The no-brainer checkboxes offer nice clarity.

image

If you maintain different flavors of Unix or various distributions of Linux, you know how much the syntax can vary.  OneNote helps keep your sanity.  With so many Windows products going the way of Powershell, you’d better have your command line syntax down for that too.

This has also worked well with backend installations.  My Installations of VMware, SharePoint, Exchange, etc. have all been documented this way.  It takes just a bit longer, but is invaluable later on.  Below is a capture of part of my cutover plan from Exchange 2003 to Exchange 2007.

image

Migrations and Post migration outstanding issues

After big migrations, you have to be on your toes to address issues that are difficult to predict.  OneNote has allowed me to use a simple ISSUE/FIX approach.  So, in an “Apps” notebook, under an “E2007 Migration” section, I might have a page called “Postfix” and it might look something like this.

image

You can label these pages “Outstanding issues” or as I did for my ESX 3.5 to vSphere migration, “Postfix” pages.

image

As-builts

Those in the Engineering/Architectural world are quite familiar with As-built drawings.  Those are drawings that reflect how things were really built.  Many times in IT, deployment plans and documentation never go further than the day you deploy it.  OneNote allows for an easy way to turn that deployment plan into a living copy, or as-built configuration of the product you just deployed.  Configurations are as dynamic as the technologies that power them.  Its best to know what sort of monster you created, and how to recreate it if you need to.

 

Daily issues (fire fighting)

Emergencies, impediments, fires, or whatever you’d like to call them, come up all the time.  I’ve found OneNote to be most helpful in two specific areas on this type of task.  I use it as a quick way to gather data on an issue that I can look at later (copying and pasting screenshot and URLs into OneNote), and for comparing the current state of a system against past configurations.  Both ways help me solve the problems more quickly.

Searching text in bitmapped screen captures

One of the really interesting things about OneNote is that you can paste a screen capture of say, a dialog box in the notebook, then when searching later for a keyword, it will include those bitmaps in the search results!!!!  Below is one of the search results OneNote pulled up when I searched for “KDC”  This was a screen capture sitting in OneNote.  Neat.

image

 

Goodbye Browser Bookmarks

How many times have you spent trying to organize your web browser bookmarks or favorites, only to never look at them again, or try to figure out why you bookmarked it?  Its an exercise in futility.  No more!  Toss them all away.  Paste those links into the various locations in OneNote (where the subject matter is applicable, and enter a brief little description on top of it, and you can always find it later when searching for it.

 

Summary

I won’t ever go without using OneNote for projects large or small again.  It is right next to my email as my most used application.  OneNote users tend to be a loyal bunch, and after a few years of using it, I can see why.  At about $80 retail, you can’t go wrong.  And, lucky for you, it will be included in all versions of Office 2010.

Additional Links

New features coming in OneNote 2010
http://blogs.msdn.com/descapa/archive/2009/07/15/overview-of-onenote-2010-what-s-new-for-you.aspx

Using OneNote with SharePoint
http://blogs.msdn.com/mcsnoiwb/archive/2008/12/03/onenote-and-sharepoint-the-basics.aspx 

Interesting tips and tricks with OneNote
http://blogs.msdn.com/onenotetips/

Virtualization. Making it happen

 

It’s difficult to put into words how exciting, and how overwhelming the idea of moving to a virtualized infrastructure was for me.  In 12 months, I went from investigating solutions, to presenting our options to our senior management, onto the procurement process, followed by the design and implementation of the systems.  And finally, making the transition of our physical machines to a virtualized environment.

It has been an incredible amount of work, but equally as satisfying.  The pressure to produce results was even bigger than the investment itself.  With this particular project, I’ve taken away a few lessons I learned along the way, some of which had nothing to do with virtualization.  Rather than providing endless technical details on this post, I thought I’d share what I learned that has nothing to do with vswitches or CPU Utilization.

1.  The sell.  I never would have been able to achieve what I achieved without the  support of our Management Team.  I’m an IT guy, and do not have a gift of crafty PowerPoint slides, and fluid presentation skills.  But there was one slide that hit it out of the park for me.   It showed how much this crazy idea was going to cost, but more importantly, how it compared against what we were going to spend anyway under a traditional environment.  We had delayed server refreshing for a few years, and it was catching up to us.  Without even factoring in the projected growth of the company, the two lines intersected in less than one year.  I’m sure the dozen other slides helped support my proposal, but this one offered the clarity needed to get approval.

2.  Let go.  I tend to be self-reliant  and made a habit of leaning on my own skills to get things done.  At a smaller company, you get used to that.  Time simply didn’t allow for that approach to be taken on this project.  I needed help, and fast.  I felt very fortunate to establish a great working relationship with Mosaic Technologies.  They provided resources to me that gave me the knowledge I needed to make good purchasing decisions, then assisted with the high level design.  I had access to a few of the most knowledgeable folks in the industry to help me move forward on the project, minimizing the floundering on my part.  They also helped me with sorting out what could be done, versus real-world recommendations on deployment practices.  It didn’t excuse me from the learning that needed to occur, and making it happen, but rather, helped speed up the process, and apply a virtualization solution to our environment correctly.  There is no way I would have been able to do it in the time frame required without them.

3.  Ditch the notebook.  Consider the way you assemble what you’re learning.  I’ve never needed to gather as much information on a project as this.  I hated not knowing what I didn’t know. (take that Yogi Berra)  I was pouring through books, white papers, and blogs to give myself a crash course on a number of different subjects – all at the same time because they needed to work together.  Because of the enormity of the project, I decided from the outset that I needed to try something different.  This was the first project where I abandoned scratchpads and binders, highlighters (mostly) and printouts.  I documented ALL of my information in Microsoft OneNote.  This was a huge success, which I will describe more in another post.

4.  Tune into RSS feeds.  Virtualization was a great example of a topic that many smart people dedicate their entire focus towards, then are kind enough to post this information on their blogs.  Having feeds come right to your browser is the most efficient way to keep up on the content.  Every day I’d see my listing of feeds for a few dozen or so VMware related blogs I was keeping track of.  It was uncanny how timely, and how applicable some of the information posted was.  Not every bit of information could be unconditionally trusted, but hey, it’s the Internet.

5.  Understand the architecture.  Looking back, I spent an inordinate amount of time in the design phase.  Much of this was trying to fully understand what was being recommended to me by my resources at Mosaic, as well as other material, and how that compared to other environments.  At times, grass grew faster than I was moving on the project at the time, (exacerbated by other projects getting in the way) but I don’t regret my stubbornness to understand what was I was trying to absorb before moving forward.  We now have a scalable, robust system that helps avoid some of the common mistakes I see occur on user forums.

6.  Don’t be a renegade.  Learn from those who really know what they are doing, and choose proven technologies, while recognizing trends in the fast-moving virtualization industry.  For me there was a higher up front cost to this approach, but time didn’t allow for any experimentation.  It helped me settle on VMware ESX powered by Dell blades, running on a Dell/EqualLogic iSCSI SAN.  That is not a suggestion that a different, or lesser configuration will not work, but for me, it helped expedite my deployment.

7.  Just because you are a small shop, doesn’t mean you don’t have to think big.  Much of my design considerations surrounded planning for the future.  How the system could scale and change, and how to minimize the headaches with those changes.  I wanted my VLAN’s arranged logically, and address boundaries configured in a way that would make sense for growth.  For a company of about 50 employees/120 systems, I never had to deal with this very much.  Thanks to another good friend of mine whom I’d been corresponding with on a project a few months prior, I was able to get things started on the right foot.  I’ll tell you more about this in a later post.

The results of the project have exceeded my expectations.  It’s working even better than I anticipated, and has already proven it’s value when I had a hardware failure occur.  We’ve migrated over 20 of our production systems to the new environment, and will have about 20 more up online within about 6 months.  There is a tremendous amount of work yet to be completed, but the benefits are paying for themselves already.

It’s all about the name

Every once in a while you run into a way of doing things that makes you wonder why you ever did it any other way.  For me, that was using DNS aliasing for referencing all servers, and services that they provide.  I use them whenever possible.

Many years ago I had a catastrophic server failure.  Looking back, it was a fascinating series of events that you would think would never happen, but it did.  This server happened to be the primary storage server for our development team, and was a staple of our development system.  It’s full server name was hardcoded on mount points and symbolic links of other *nix systems, as well as drive mappings from windows machines connecting to it via Samba.  It’s name was buried in countless scripts owned by the Development and QA teams.  Once the new hardware came in, provisioning a new server was relatively easy.  Getting everything functioning again because of these broken links was not.  Other factors prevented me from using the old approach, which was naming the new server the same name as the old server.  So I knew there had to be a better way.  There was.  That was using DNS aliasing (cname records) on your internal DNS servers to decouple the server name itself from the service it was providing.  This practice helps you design your server infrastructure for change.

Good candidates for aliasing are:

  • NTP/time servers (automated for domain joined machines, but not for non-joined machines, *nix systems, and network devices)
  • Email servers (primary email servers, as well as mail relay servers)
  • Source code control servers
  • Document management, wikis, or collaboration servers
  • Critical workstations/servers that perform source code compiling and/or validation testing.
  • Network devices and OOB management cards.  I can’t remember what the FQDN’s of my switches are.  Can you?
  • Log servers.
  • File Servers and their respective share names or NFS exports (ex. \\infostore\sales & infostore:/exports/sales respectively)

The practice is particularly interesting on file servers.  If you start out with one file server that contains shares for your applications, your files, and your user home directories.  You could have sharenames that reference aliases, all for the very same server.

  • \\appserv\applications
  • \\fileserv\operations
  • \\userserv\joesmith

Now, when you need to move user home directories over to a new server, or bring up a new server to perform that new role, just move the data, turn up the share name, and change the alias.

Now of course, there are some things that aliasing can’t be used on, or doesn’t work well on.

  • DNS clients that need to refer to DNS servers require IP addresses, and can’t use aliases
  • Some windows service that may use complex authentication methods.
  • Services that are relying on SSL certificates that are expecting to see the real name, not the alias.  (ex.  Exchange URL references)
  • Windows Server 2003 and earlier do not support aliases out of the box.  It will support only \\realservername\sharename by default.  You will need to add a registry key to disable strict name checking.  More info found here:  http://support.microsoft.com/kb/281308 

Most recently, I made the transition from Exchange 2003 to Exchange 2007.  Usually a project like that has pages of carefully planned out steps on the cut-over;  what needed to be changed, and when.  What I didn’t have to worry about this time is all of my internal hosts that reference the mail server by it’s DNS alias name;  mailserver.mycompany.lan.  Just one easy step to change the cname reference from the old server name to the new server name, and that was it.  The same thing occurred when I transitioned to new Domain Controllers a few months ago.  These serve as my internal time servers for all internal systems and devices.

What’s most surprising is that this practices is not done in IT environments as often as you’d think.  There might be an occasional alias here and there, but not a calculated effort to help transitions to new servers and reduce downtime.  Whether you are doing planned server transitions, or recovering from a server failure, this is a practice that is guaranteed to help almost any situation.