Sustained power outages in the datacenter

Ask any child about a power outage, and you can tell it is a pretty exciting thing. Flashlights. Candles. The whole bit. The excitement is an unexplainable reaction to an inconvenient, if not frustrating event when seen through the eyes of adulthood. When you are responsible for a datacenter of any size, there is no joy that comes from a power outage. Depending on the facility the infrastructure lives in, and the tools put in place to address the issue, it can be a minor inconvenience, or a real mess.

Planning for failure is one of the primary tenants of IT. It touches as much on operational decisions as it does design. Mitigation steps from failure events follow in the wake of the actual design itself, and define if or when further steps need to be taken to become fully operational again. There are some events that require a series of well-defined actions (automated, manual, or somewhere in between) in order to ensure a predictable result. Classic DR scenarios generally come to mind most often, but shoring up steps on how to react to certain events should also include sustained power outages. The amount of good content on the matter is sparse at best, so I will share a few bits of information I have learned over the years.

The Challenges
One of the limitations with a physical design of redundancy when it comes to facility power is, well, the facility. It is likely served by a single utility district, and the customer simply doesn’t have options to bring in other power. The building also may have limited or no backup power. Generators may be sized large enough to keep the elevators and a few lights running, but that is about it. Many cannot, or do not provide power conditioned good enough that is worthy of running expensive equipment. The option to feed PDUs using different circuits from the power closet might also be limited.

Defining the intent of your UPS units is often an overlooked consideration. Are they sized in such a way just to provide enough time for a simple graceful shutdown? …And how long is that? Or are they sized to meet some SLA decided upon by management and budget line owners? Those are good questions, but inevitably, if the power it out for long enough, you have to deal with how a graceful shutdown will be orchestrated.

SMBs fall in a particularly risky category, as they often have a set of disparate, small UPS units supplying battery backed power, with no unified management system to orchestrate what should happen in an "on battery" event. It is not uncommon to see an SMB well down the road of virtualization, but their UPS units do not have the smarts to handle information from the items they are powering. Picking the winning number on a roulette wheel might give better odds than figuring out which is going to go first, and which is going to go last.

Not all power outages are a simple power versus no power issue. A few years back our building lost one leg of the three-phase power coming in from the electric vault under the nearby street. This caused a voltage "back feed" on one of the legs, which cut nominal voltage severely. This dirty power/brown-out scenario was one of the worst I’ve seen. It lasted for 7 very long hours during the middle of the night. While the primary infrastructure was able to be safely shutdown, workstations and other devices were toggling off and one due to this scenario. Several pieces of equipment were ruined, but many others ended up worse off than we were.

It’s all about the little mistakes
"Sometimes I lie awake at night, and I ask, ‘Where have I gone wrong?’  Then a voice says to me, ‘This is going to take more than one night" –Charlie Brown, Peanuts [Charles Schulz]

A sequence of little mistakes in an otherwise good plan can kill you. This transcends IT. I was a rock climber for many years, and a single tragic mistake was almost always the result of a series of smaller mistakes. It often stemmed from poor assumptions, bad planning, trivializing variables, or not acknowledging the known unknowns. Don’t let yourself be the IT equivalent to the climber that cratered on the ground.

One of the biggest potential risks is a running VM not fully committing I/Os from its own queues or anywhere in the data path (all the way down to the array controllers) before the batteries fully deplete. When the VMs are properly shutdown before the batteries deplete, you can be assured that all data has been committed, and the integrity of your systems and data remain intact.

So where does one begin? Properly dealing with a sustained outage is recognizing that it is a sequence driven event.

1. Determine what needs to stay on the longest. Often times it is not how long the a VM or system stays up on battery, but that they are gracefully shutoff before a hard power failure. Your UPS units buy you a finite amount of time. It takes more than "hope" to make your systems go down gracefully, and in the correct order.

2. Determine your hardware dependency chain. Work through what is the most logical order of shutdown for your physical equipment, and identify the last pieces of physical equipment that need to stay on. (Your answer better be switches).

3. Determine your software dependency chain. Many systems can be shut down at any time, but many others rely on other services to support their needs. Map it out. Also recognize that hardware can be affected by the lack of availability of software based services (e.g. DNS, SMTP, AD, etc.).

4. Determine what equipment might need a graceful shutdown, and what can drop when the UPS units run dry. Check with each Manufacturer for the answers.

Once you begin to make progress on better understanding the above, then you can look into how you can make it happen.

Making a retrospective work for you
It’s not uncommon to just be grateful that after the sustained power failure has ended, that you are grateful that everything came back up without issue. As a result, one leaves valuable information on the table on how to improve the process in the future. Seize the moment! Take notes during this event so that they can be remembered better during a retrospective. After all, the retrospective’s purpose is to define what went well and what didn’t. Stressful situations can play tricks on memory. Perhaps you couldn’t identify power cables easily, or wondered why your Exchange server took a long time to shut down, or didn’t know if or when vCenter shut down gracefully. This is a great method for capturing valuable information. In the "dirty power" story above, the UPS power did not last as long as I had anticipated because the server room’s dedicated AC unit shut down. The room heated up, and all of the variable speed fans kicked into high gear, draining the power faster than I thought. Lesson learned.

The planning process is served well by mocking up a power failure event on paper. Remember, thinking about it is free, and is a nice way to kick off the planning. Clearly, the biggest challenge around developing and testing power down and power up scenarios is that it has to be tested at some point. How do you test this? Very carefully. In fact, if you have any concerns at all, save it for a lab. Then introduce it into production in such a way that you can statically control or limit the shutdown event to just a few test machine, etc. The only scenario I can imagine on par with a sustained power outage is kicking off a domino-effect workflow that shuts down your entire datacenter.

The run book
Having a plan located only in your head will accomplish only two things.  It will be a guaranteed failure.  It can put your organization’s systems and data at risk.  This is why there is a need to define and publish a sustained power outage run book. Sometimes known as a "play chart" in the sports world, it is intended to define a reaction to an event under a given set of circumstances. The purpose is to 1.) vet out the process before hand, and 2.) avoid "heat of the moment" decisions under times of great stress that end up being the wrong decision.

The run book also serves as a good planning tool for determining if you have the tools or methods available to orchestrate a graceful, orderly shutdown of VMs and equipment based on the data provided by the UPS units. The run book is not just about graceful power down scenarios, but the steps required for a successful power-up. Sometimes this can be more well known, as an occasional lights out maintenance window may need to occur on some storage or firmware updates, replacement, etc. Power-up planning can also be important, including making sure you have some basic services available for the infrastructure as it powers up. For example, see "Using a Synology NAS as an emergency backup DNS server for vSphere" for a few tips on a simple way to serve up DNS to your infrastructure.

And don’t forget to make sure the run book is still accessible when you need it most (when there is no power). 🙂

Tools and tips
I’ve stayed away from discussing specific scripts or tools for this because each environment is different, and may have different tools available to them. For instance, I use Emerson-Liebert UPS units, and have a controlling VM that will orchestrate many of the automated shutdown steps of VMs. Using PowerCLI, Python, or bash can be a complementary, or a critical part of a shutdown process. It is up to you. The key is to have some entity that will be able to interpret how much power remains on battery, and how one can trigger event driven actions from that information.

1. Remember that graceful shutdowns can create a bit of their own CPU and storage I/O storm. While not as significant as some boot storm upon power up, and generally is only noticeable at the beginning of the shutdown process when all systems are up, but it can be noticeable.

2. Ask your coworkers or industry colleagues for feedback. Learn about what they have in place, and share some stories about what went wrong, and what went right. It’s good for the soul, and your job security.

3. Focus more on the correct steps, sequence, and procedure, before thinking about automating it. You can’t automate something when you do not clearly understand the workflow.

4. Determine how you are going to make this effort a priority, and important to key stakeholders. Take it to your boss, or management. Yes, you heard me right. It won’t ever be addressed until it is given visibility, and identified as a risk. It is not about potential self-incrimination. It is about improving the plan of action around these types of events. Help them understand the implications for not handling in the correct way.

It is a very strange experience to be in an server room that is whisper quiet from a sustained power outage. There is an opportunity to make it a much less stressful experience with a little planning and preparation. Good luck!

– Pete

Practical tips for a Veeam Backup and Recovery deployment

I’ve been using Veeam Backup and Recovery in my production environment for a while now, and in hindsight, it was one of the best investments we’ve ever made in our IT infrastructure. It has completely changed the operational overhead of protecting our VMs, and the data they serve up. Using a data protection solution that utilizes VMware’s APIs provides the simplicity and flexibility that was always desired. Moving away from array based features for protection has enabled the protection of VMs to better reflect desired RPO and RTO requirements – not by the limitations imposed by LUN sizes, array capacity, or functionality.

While Veeam is extremely simple in many respects, it is also a versatile, feature packed application that can be configured a variety of different ways. The versatility and the features can be a little confusing to the new user, so I wanted to share 25 tips that will help make for a quick and successful deployment of Veeam Backup and Recovery in your environment.

First lets go over a few assumptions that will be the basis for my recommendations:

  • There are two sites that need protection.
  • VMs and data need to be protected at each site, locally.
  • VMs and data need to be protected at each site, remotely.
  • A NAS target exists at each site.
  • Quick deployment is important.
  • You’ve already read all of the documentation. Winking smile

    Architecture
    There are a number of different ways to set up the architecture for Veeam. I will show a few of the simplest arrangements:

    In this arrangement below there would be no physical servers – only a NAS device. This is a simplified arrangement of what I use. If one wanted a rebuilt server (Windows or Linux) acting purely as a storage target, that could be in place of where you see the NAS. The architecture would stay the same.

    image

    Optionally, a physical server not just acting as a storage target, but also as a physical proxy would look something like this below:

    image

    Below is a combination of both, where a physical server is acting as the Proxy, but like the virtual proxy, is using an SMB share to house the data. In this case, a NAS unit.

     

    image

    Implementation tips
    These tips focus not so much on ultimately what may suite your environment best (only you know that) or leveraging all of the features inside the product, but rather, getting you up and running as quickly as possible so you can start returning great results.

    Job Manager Servers & Proxies

    1.  Have the job Manager server, any proxies, and the backup targets living on their own VLAN for a dedicated backup network.

    2.  Set up SNMP monitoring on any physical ports used in the backup arrangement.  It will be helpful to understand how utilized the physical links get, and for how long.

    3.  Make sure to give the Job Manager VM enough resources to play with – especially if it will have any data mover/proxy responsibilities.  The deployment documentation has good information on this, but for starters, make it 4vCPU with 5GB of RAM.

    4.  If there is more than one cluster to protect, consider building a virtual proxy inside each cluster that it will be responsible for protecting, then assign it to jobs that protect VMs in that cluster.  In my case, I use PernixData FVP in two clusters.  I have the data stores that house those VMs only accessible by their own cluster (a constraint of FVP).  Because of that, I have a virtual proxy living in each cluster, with backup jobs configured so that it will use a specific virtual proxy.  These virtual proxies have a special setting in FVP that will instruct the VMs being backed up to flush their write cache to the backing storage

    image

    Storage and Design

    5.  Keep the design simple, even if you know you will need to adjust at a later time.  Architectural adjustments are easy to do with Veeam, so  go ahead and get Veeam pointed to the target, and start running some jobs.  Use this time to get familiar with the product, and begin protecting the jewels as quickly as possible.

    6.  Let Veeam use the default SQL Server Express instance on the Veeam Job Manager VM.  This is a very reasonable, and simple configuration that should be adequate for a lot of environments.

    7.  Question whether a physical proxy is needed.  Typically physical proxies are used for one of three reasons.  1.)  They offload job processing CPU cycles from your cluster.  2.)  In simple arrangements a Windows based Physical proxy might also be the Repository (aka storage target).   3.) They allow for one to leverage a "direct-from-SAN" feature by plugging in the system to your SAN fabric.  The last one in my opinion introduces the most hesitation.  Here is why:

    • Some storage arrays do not have a "read-only" iSCSI connection type.  When this is the case, special care needs to be taken on the physical server directly attached to the SAN to ensure that it cannot initialize the data store.  The reality is that you are one mistake away from having a very long day in front of you.  I do not like this option when there is no secondary safety mechanism from the array on a "read-only" connection type.
    • Direct-from-SAN access can be a very good method for moving data to your target.  So good that it may stress your backing storage enough (via link saturation or physical disk limits) to perhaps interfere with your production I/O requirements.
    • Additional efforts must be taken when using write buffering mechanisms that do not live on the storage array (e.g. PernixData) .

    8.  Veeam has the ability to back up to an SMB share, or an NFS mount.  If an NFS mount is chosen, make sure that it is a storage target running native Linux.  Most NAS units like a Synology are indeed just a tweaked version of Linux, and it would be easy to conclude that one should just use NFS.  However, in this case, you may run into two problems.

    • The SMB connection to a NAS unit will likely be faster (which most certainly is the first time in history that an SMB connection is faster than an NFS connection) .
    • The Job Manager might not be able to manage the jobs on that NAS unit (connected via NFS) properly.  This is due to BusyBox and Perl on the Synology not really liking each other.  For me, this resulted in Veeam being unable to remove sun setting backups.  Changing over to an SMB connection on the NAS improved the performance significantly, and allowed for job handling to work as desired.

    9.  Veeam has a great new feature (version 7.x)  called a "Backup Copy" job, which allows for the backup made locally to be shipped to a remote site.  The "Backup Copy" job achieves one of the most basic requirements of data protection in the simplest of ways.  Two copies of the data at two different locations, but with the benefit of only processing the backup job once.  It is a new feature of Version 7, and although it is a great feature, it behaves differently, and warrants some time spent before putting into production.  For a speedy deployment, it might be best simply to configure two jobs.  One to a local target, and one to a remote target.  This will give you the time to experiment with the Backup Copy job feature.

    10.  There are compelling reasons for and against using a rebuilt server as a storage target, or using a NAS unit.  Both are attractive options.  I ended using a dedicated NAS unit.  It’s form factor, drive bay count, and the overall cost of provisioning was the only option that could match my requirements.

    Operations

    11.  In Veeam B&R, "Replication Jobs" are different than "Backup Jobs."  Instead of trying to figure out all of the nuances of both right away, use just the "Backup job" function with both local and remote targets.  This will give you time to better understand the characteristics of the replication functionality. One also might find that the "Backup Job" suites the environment and need better than the replication option.

    12.  If there are daily backups going to both local and offsite targets (and you are not using the "Backup Copy" option, have them run 12 hours apart from one another to reduce RPOs.

    13.  Build up a test VM to do your testing of a backup and restore.  Restore it in the many ways that Veeam has to offer.  Best to understand this now rather than when you really need to.

    14.  I like the job chaining/dependency feature, which allows you to chain multiple jobs together.  But remember that if a job is manually started, it will run through the rest of the jobs too.  The easiest way to accommodate this is to temporarily remove it from the job chain.

    15.  Your "Backup Repository" is just that, a repository for data.  It can be a Windows Server, a Linux Server, or an SMB share.  If you don’t have a NAS unit, stuff an old server (Windows or Linux) with some drives in it and it will work quite well for you.

    16.  Devise a simple, clear job naming scheme.  Something like [BackupType]-[Descriptive Name]-[TargetLocation] will quickly tell you what it is and where it is going to.  If you use folders in vCenter to organize your VMs, and your backups reflect the same, you could also  choose to use the folder name.  An example would be "Backup-SharePointFarm-LOCAL" which quickly and accurately describes the job.

    17.  Start with a simple schedule.  Say, once per day, then watch the daily backup jobs and the synthetic fulls to see what sort of RPO/RTOs are realistic.

    18.  Repository naming.  Be descriptive, but come up with some naming scheme that remains clear even if you aren’t in the application for several weeks.  I like indicating the location of the repository, if it is intended for local jobs, or remote jobs, and what kind of repository it is (Windows, Linux, or SMB).  For example:  VeeamRepo-[LOCATION]-for-Local(SMB)

    19.  Repository organization.  Create a good tree structure for organization and scalability.  Veeam will do a very good job at handling the organization of the backups once you assign a specific location (share name) on a repository.  However, create a structure that provides the ability to continue with the same naming convention as your needs evolve.  For instance, a logical share name assigned to a repository might be \\nas01\backups\veeam\local\cluster1  This arrangement allows for different types of backups to live in different branches.

    20.  Veeam might prevent the ability of creating more than one repository going to the same share name (it would see \\nas01\backups\veeam\local\cluster1 and \\nas01\backups\veeam\local\cluster2 as the same).  Create DNS aliases to fool it, then make those two targets something like \\nascluster1\backups\veeam\local\cluster1  and  \\nascluster2\backups\veeam\local\cluster2 

    21.  When in doubt, leave the defaults.  Veeam put in great efforts to make sure that you, or the software doesn’t trip over itself.  Uncertain of job number concurrency?  Stick to the default.  Wondering about which backup mode to use? (Reverse Incremental versus Incrementals with synthetic fulls). Stay with the defaults, and save the experimentation for later.

    22.  Don’t overcomplicate the schedule (at least initially).  Veeam might give you flexibility that you never had with array based protection tools, but at the same time, there is no need to make it complicated.  Perhaps group the VMs by something that you can keep track of, such as the folders they are contained in within vCenter.

    23.  Each backup job can be adjusted so that whatever target you are using, you can optimize it for preset storage optimization type.  WAN target, LAN target, or local target.  This can easily be overlooked, but will make a difference in backup performance.

    24.  How many backups you can keep is a function of change range, frequency, dedupe and compression, and the size of your target.  Yep, that is a lot of variables.  If nothing else, find some storage that can serve as the target for say, 2 weeks.  That should give a pretty good sampling of all of the above.

    25.  Take one item/feature once a week, and spend an hour or two looking into it.  This will allow you to find out more about say, Changed block tracking, or what the application aware image processing feature does.  Your reputation (and perhaps, your job) may rely on your ability to recover systems and data.  Come up with a handful of scenarios and see if they work.

    Veeam is an extremely powerful tool that will simplify your layers of protection in your environment. Features like SureBackup, Virtual Labs, and their Replication offerings are all very good. But more than likely, they do not need to be a part of your initial deployment plan. Stay focused, and get that new backup software up and running as quickly as possible. You, and your organization, will be better off for it.

    – Pete

    Zero to 32 Terabytes in 30 minutes. My new EqualLogic PS4000e

    Rack it up, plug it in, and away you go.  Those are basically the steps needed to expand a storage pool by adding another PS array using the Dell/EqualLogic architecture.  A few weeks ago I took delivery of a new PS4000e to compliment my PS6000e at my primary site.  The purpose of this additional array was really simple.  We needed raw storage capacity.  My initial proposal and deployment of my virtualized infrastructure a few years ago was a good one, but I deliberately did not include our big flat-file storage servers in this initial scope of storage space requirements.  There was plenty to keep me occupied between the initial deployment, and now.  It allowed me to get most of my infrastructure virtualized, and gave a chance for buy-in to the skeptics who thought all of this new-fangled technology was too good to be true.  Since that time, storage prices have fallen, and larger drive sizes have become available.  Delaying the purchase aligned well with “just-in-time” purchasing principals, and also gave me an opportunity to address the storage issue in the correct way.   At first, I thought all of this was a subject matter not worthy of writing about.  After all, EqualLogic makes it easy to add storage.  But that only addresses part of the problem.  Many of you face the same dilemma regardless of what your storage solution is; user facing storage growth.

    Before I start rambling about my dilemma, let me clarify what I mean by a few terms I’ll be using; “user facing storage” and “non user facing storage.” 

    • User Facing Storage is simply the storage that is presented to end users via file shares (in Windows) and NFS mounts (in Linux).  User facing storage is waiting there, ready to be sucked up by an overzealous end user. 
    • Non User Facing Storage is the storage occupied by the servers themselves, and the services they provide.  Most end users generally have no idea on how much space a server reserves for say, SQL databases or transaction logs (nor should they!)  Non user facing storage is easier to anticipate needs and manage because it is only exposed to system administrators. 

    Which array…

    I decided to go with the PS4000e because of the value it returns, and how it addresses my specific need.  If I had targeted VDI or some storage for other I/O intensive services, I would have opted for one of the other offerings in the EqualLogic lineup.  I virtualized the majority of my infrastructure on one PS6000e with 16, 1TB drives in it, but it wasn’t capable of the raw capacity that we now needed to virtualize our flat-file storage.  While the effective number of 1GB ports is cut in half on the PS4000e as compared to the PS6000e, I have not been able to gather any usage statistics against my traditional storage servers that suggest the throughput of the PS4000e will not be sufficient.  The PS4000e allowed me to trim a few dollars off of my budget line estimates, and may work well at our CoLo facility if we ever need to demote it.

    I chose to create a storage pool so that I could keep my volumes that require higher performance on the PS6000, and have the dedicated storage volumes on the PS4000.  I will do the same for when I eventually add other array types geared for specific roles, such as VDI.

    Truth be told, we all know that 16, 2 terabyte drives does not equal 32 Terabytes of real world space.  RAID50 penalty knocks that down to about 21TB.  Cut that by about half for average snapshot reserves, and it’s more like 11TB.  Keeping a little bit of free pools space available is always a good idea, so let’s just say it effectively adds 10TB of full fledged enterprise class storage.  This adds to my effective storage space of 5TB on my PS6000.  Fantastic.  …but wait, one problem.  No, several problems.

    The Dilemma

    Turning up the new array was the easy part.  In less than 30 minutes, I had it mounted, turned on, and configured to work with my existing storage group.  Now for the hard part; figuring out how to utilize the space in the most efficient way.  User facing storage is a wildcard; do it wrong and you’ll pay for it later.  While I didn’t know the answer, I did know some things that would help me come to an educated decision.

    • If I migrate all of the data on my remaining physical storage servers (two of them, one Linux, and one Windows) over to my SAN, it will consume virtually all of my newly acquired storage space.
    • If I add a large amount of user-facing storage, and present that to end users, it will get sucked up like a vacuum.
    • If I blindly add large amounts of great storage at the primary site without careful thought, I will not have enough storage at the offsite facility to replicate to.
    • Large volumes (2TB or larger) not only run into technical limitations, but are difficult to manage.  At that size, there may also be a co-mingling of data that is not necessarily business critical.  Doling out user facing storage in large volumes is easy to do.  It will come back to bite you later on.
    • Manipulating the old data in the same volume as new data does not bode well for replication and snapshots, which look at block changes.  Breaking them into separate volumes is more effective.
    • Users will not take the time or the effort clean up old data.
    • If data retention policies are in place, users will generally be okay with it after a substantial amount of complaining. It’s not too different than the complaining you might here when there are no data retention policies, but you have no space.  Pick your poison.
    • Your users will not understand data retention policies if you do not understand them.  Time for a plan.

    I needed a way to compartmentalize some of the data so that it could be identified as “less important” and then perhaps live on less important storage.  By “less important storage” this could mean that it lives on a part of the SAN that is not replicated, or in a worst case scenario, on even some old decommissioned physical servers, where it resides for a defined amount of time before it is permanently archived and removed from the probationary location.

    The Solution (for now)

    Data Lifecycle management.  For many this means some really expensive commercial package.  This might be the way to go for you too.  To me, this is really nothing more than determining what is important data, and what isn’t as important, and having a plan to help automate the demotion, or retirement of that data.  However, there is a fundamental problem of this approach.  Who decides what’s important?  What are the thresholds?  Last accessed time?  Last modified time?  What are the ramifications of cherry-picking files from a directory structure because they exceed policy thresholds?  What is this going to break?  How easy is it to recover data that has been demoted?  There are a few steps that I need to do to accomplish this. 

    1.  Poor man’s storage tiering.  If you are out of SAN space, re-provision an old server.  The purpose of this will be to serve up volumes that can be linked to the primary storage location through symbolic links.  These volumes can then be backed up at a less frequent interval, as it would be considered less important.  If you eventually have enough SAN storage space, these could be easily moved onto the SAN, but in a less critical role, or on a SAN array that has larger, slower disks.

    2.  Breaking up large volumes.  I’m convinced that giant volumes do nothing for you when it comes to understanding and managing the contents.  Turning larger blobs into smaller blobs also serves another very important role.  It allows the intelligence of the EqualLogic solutions to do their work on where the data should live in a collection of arrays.  A storage Group that consists of say, an SSD based array, a PS6000, and a PS4000 can effectively store the volumes in the correct array that best suites the demand.

    3.  Automating the process.  This will come in two parts; a.) deciding on structure, policies, etc. and b.) making or using tools to move the files from one location to another.  On the Linux side, this could mean anything from a bash script, or something written in python.  Then use cron to schedule the occurrence.  In Windows, you could leverage PowerShell, vbscript, or batch files.  This would be as simple, or as complex as your needs require.  However, if you are like me, you have limited time to tinker with scripting.  If there is something turn-key that does the job, go for it.  For me, that is an affordable little utility called “TreeSize Pro”  This gives you not only the ability to analyze the contents of NTFS volumes, but can easily automate the pruning of this data to another location.

    4.  Monitoring the result.  This one is easy to overlook, but you will need to monitor the fruits of your labor, and make sure it is doing what it should be doing; maintaining available storage space on critical storage devices.  There are a handful of nice scripts that have been written for both platforms that help you monitor free storage space at the server level.

    The result

    The illustration below helps demonstrate how this would work. 

    image

    As seen below, once a system is established to automatically move and house demoted data, you can more effectively use storage on the SAN.

    image

    Separation anxiety…

    In order to make this work, you will have to work hard in making sure that the all of this is pretty transparent to the end user.  If you have data that has complex external references, you would want to preserve the integrity of the data that relies on those dependent files.  Hey, I never said this was going to be easy. 

    A few things worth remembering…

    If 17 years in IT, and a little observation in human nature has taught me one thing, it is that we all undervalue our current data, and overvalue our old data.  You see it time and time again.  Storage runs out, and there are cries for running down to the local box store and picking up some $99 hard drives.  What needs to reside on there is mission critical (hence the undervaluing of the new data).  Conversely, efforts to have users clean up old data from 10+ years ago had users hiding files in special locations, even though it was recorded that it had not been modified, or even accessed in 4+ years.  All of this of course lives on enterprise class storage.  An all too common example of overvaluing old data.

    Tip.  Remember your Service Level Agreements.  It is common in IT to not only have SLAs for systems and data, but for one’s position.  These without doubt are tied to one another.  Make sure that one doesn’t compromise the other.  Stop gap measures to accommodate more storage will trigger desperate, affordable solutions.  (e.g. adding cheap non-redundant drives in an old server somewhere).  Don’t do it!  All of those arm-chair administrators in your organization will be nowhere to be found when those drives fail, and you are left to clean up the mess.

    Tip.  Don’t ever thin provision user facing storage.  Fortunately, I was lucky to be clued into this early on, but I could only imagine the well intentioned administrator who wanted to present a nice amount of storage space to the user, only to find it sucked up a few days later.  Save the thin provisioning for non user facing storage (servers with SQL databases and transaction logs, etc.)

    Tip.  If you are presenting proposals to management, or general information updates to users, I would suggest quoting only the amount of effective, usable space that will be added.  In other words, don’t say you are adding 32TB to your storage infrastructure, when in fact, it is closer to 10TB.  Say that it is 10TB of extremely sophisticated, redundant enterprise class storage that you can “bet the business” on.  It’s scalability, flexibility and robustness is needed for the 24/7 environments we insist upon.  It will just make it easier that way.

    Tip.  It may seem unnecessary to you, but continue to show off snapshots, replication, and other unique aspects of SAN storage, if you still have those who doubt the power of this kind of technology – especially when they see the cost per TB.  Repeat to them how long (if even possible) it would take to protect that same data under traditional storage.  Do everything you can to help those who approve these purchases.  More than likely, they won’t be as impressed by say, how quick a snapshot is, but rather, shocked how traditional storage can’t be protected very well.

    You may have noticed I do not have any rock-solid answers for managing the growth and sustainability of user facing data.  Situations vary, but the factors that help determine that path for a solution are quite similar.  Whether you decide on a turn-key solution, or choose to demonstrate a little ingenuity in times of tight budgets, the topic is one that you will probably have to face at some point.

     

    Finally. A practical solution to protecting Active Directory

     

    Active Directory.  It is the brains of most modern-day IT infrastructures, providing just about every conceivable control of how users, computers and information will interact with each other.  Authentication, user, group and computer access control, all help provide logical barriers that allow for secure access, but a seamless user experience with single sign-on access to resources.  While it has the ability to improve and integrate critical services such as DNS, DHCP, and NTP, in many ways those services become dependent on Active Directory.  These days, Active Directory controls more than just pure Windows environments.  Integration with non Microsoft Operating systems like Ubuntu, Suse, and VMWare’s vSphere is becoming more common thanks to products such as LikeWise.  The environment that I manage has Windows Servers and clients, most distributions of Linux, Macs, a few flavors of Unix, VMware, and iPhones.  All of them rely on Active Directory.  You quickly learn that if Active Directory goes down, so does your job security.

    Active Directory will run happily even under less than ideal circumstances.  It is incredibly resilient, and somehow can put up with server crashes, power outages, and all sorts of debauchery.  But neglect is not a required ingredient for things to go wrong.  When it does, the results can be devastating.  AD problems can be difficult to track down, and it’s tentacles will affect services you never considered.  A corrupt Active Directory, or the Controllers it runs on, can make your Exchange and SQL servers crumble around you.  I lived through this experience (barely) a while back, and even though my preparation for such scenarios looked very good on paper, I spent a healthy amount of time licking my wounds, and reassessing my backup strategy of Active Directory.  I never want to put myself in that position again.

    As important as Active Directory is, it can be quiet challenging to protect.  Why?  I believe the answer can be boiled down into two main factors; it’s distributed, and it’s transaction based.  In other words, the two traits that makes it robust also makes it difficult to protect.  Large enterprises usually have a well architected AD infrastructure, and at least understand the complexities of protecting their AD environment.  Many others are left with pondering the various ways to protect it.

    • File based backups using traditional backup methods.  This has never been enough, but my bet is that you’d find a number of smaller environments do this – if they do anything at all.  It has worked for them only because they’ve never had a failure of any sort.
    • AD backup agents that are a part of a commercial backup application.  Some applications like Symantec Backup Exec (what I previously relied on) seem like a good idea, but show their true colors when you actually try to use it for recovery.  While the agents should be extending the functionality of the backup software, they just add to an already complex solution that feels like a monstrosity geared for other purposes.
    • Exporting AD on Windows 2008 based Domain controllers by using NTDSUTIL and the like.  This is difficult at best, arguably incomplete, and if you have a mix of Windows 2008 and Windows 2003 DC’s, won’t work.
    • Those who have virtualized their domain controllers often think that the well timed independent snapshot or VCB backup will protect them.  This is not true either.  You will have a VM consistent backup of the VM itself, but it does nothing to coordinate the application with the other Domain Controllers and the integrity of it’s contents.  In theory, they could be backed up properly if every single DC was shut down at the same time, but most of us know that would not be a solution at all.
    • Dedicated Solutions exist to protect Active Directory, but can be overly complex, and outrageously expensive.  I’m sure they do their job well, but I couldn’t get the line item past our budget line owner to find out.

    The result can be a desire to want to protect AD, but uncertainty on what “protect” really means.  Is protecting the server good enough?  Is protecting AD itself enough?  Does one need both, and if so, how does one go about doing that?  Without fully understanding the answers to those questions, something inevitably goes wrong, and the Administrator is frantically flipping through the latest TechNet Article on Authoritative Restores, while attempting to figure out their backup software.  It’s particularly painful to the Administrator, who had the impression that they were protecting their Organization (and themselves) when in fact, they were not. 

    In my opinion, protecting the domain should occur at two different levels.

    • Application layer.  This is critical.  Among other things, the backup will coordinate Active Directory so that all of it’s Update Sequence Numbers (USN’s) are at an agreed upon state.  This will avoid USN’s that are out of sync, which can be the trouble of so many AD related problems.  Application layer protection should also honor these AD specific attributes so that granular recovery of individual objects is possible.  Good backup software should leverage API’s that take advantage of Volume Shadow Copy Services (VSS).
    • Physical layer.  This protects the system that the services may be running on.  If it’s a physical server, it could be using some disk imaging software such as Acronis, or Backup Exec System Recovery.  If it’s virtualized, an independent backup of the VM will do.  Some might suggest that protecting the actual machine isn’t technically required.  The idea behind that reasoning is that if there is a problem with the physical machine, or the OS, one can quickly decommission and commission another DC with “dcpromo.”  While protecting the system that AD runs on may not be required, it may help speed up your ability (in conjunction with Application layer protection) to correct issues from a previously known working state.

    I was introduced to CionSystems by a colleague of mine who suggested their “Active Directory Self-Service” product to help us with another need of ours.  Along the way, I couldn’t help but notice their AD backup offering.  Aptly named, “Active Directory Recovery” is a complete application layer solution.  I tried it out, and was sold.  It allows for a simple, coordinated backup and recovery of Active Directory.  A recovery can be either a complete point-in-time, or a granular restore of an object.  It is agentless, meaning that you don’t have to install software on the DCs.  The first impression after working with it is that it was designed for one purpose; to backup Active Directory.  It does it, and does it well.

    The solution will run on any spare machine running IIS and SQL.  Once installed, configuring it is just a matter of pointing it to your Domain Controller that runs the PDC Emulator role.  After a few configuration entries are made, the Administration console can be accessed with your web browser from anywhere on your network.

    image

    The next step is to set up a backup job, and let it run.  That’s it.  Fast, simple, and complete.  From the home page, there are a few different ways you can look at objects that you want to recover.

    If it’s a deleted object, you can click on the “Deleted Objects” section.  Objects with a backup to restore from will show up in green, and present itself below each object.  Below you will see a deleted computer object, and the backups that it can be restored from.

    image

    The “List Backups” simply shows the backups created in chronological order.  From there you can do full restores, or restore an individual object that still exists in AD.  Unlike authoritative restores, you do not have to do any system restarts.

    image

    During the restore process, “Active Directory Recovery” will expose individual attributes of the object that you want to restore – if you wish for the restore to be that granular.  If it’s restorable, there is a checkbox next to it.  Non-modifiable objects will not have a checkbox next to it.

    image

    One of my favorite features is that it provides a way for a true, portable backup.  One can export the backup to a single file (a proprietary .bin file) that is your entire AD backup, and save it onto a CD, or to a remote location.  This is a wish list item I’ve had for about as long as AD has been around.    There are many other nice features, such as email notifications, filtering and comparison tools, as well as backup retention settings. 

    I use this product to compliment my existing strategy for protecting my AD infrastructure.  While my virtualized Domain Controllers are replicated to a remote site (the physical protection, so to speak), I protect my AD environment at the application level with this product.  The server that “Active Directory Recovery” runs on is also replicated, but to be extra safe, I create a portable/exported backup that is also shipped off to the offsite location.  This way I have a fully independent backup of AD.  If I’m doing some critical updates to my Domain Controllers, I first make a backup using Active Directory Recovery, then make my snapshots on my virtualized DC’s  That way, I have a way to roll back the changes that are truly application consistent.

    After using the product for a while, I can appreciate that I don’t have to invest much time to keep my backups up and running.  I previously used Symantec’s Backup Exec to protect AD, but grew tired of agent issues, licensing problems, and the endless backup failure messages.  I lost confidence in its ability to protect AD, and am not interested in going back. 

    Hopefully this gives you a little food for thought on how you are protecting your Active Directory environment.  Good luck!