Sustained power outages in the datacenter

Ask any child about a power outage, and you can tell it is a pretty exciting thing. Flashlights. Candles. The whole bit. The excitement is an unexplainable reaction to an inconvenient, if not frustrating event when seen through the eyes of adulthood. When you are responsible for a datacenter of any size, there is no joy that comes from a power outage. Depending on the facility the infrastructure lives in, and the tools put in place to address the issue, it can be a minor inconvenience, or a real mess.

Planning for failure is one of the primary tenants of IT. It touches as much on operational decisions as it does design. Mitigation steps from failure events follow in the wake of the actual design itself, and define if or when further steps need to be taken to become fully operational again. There are some events that require a series of well-defined actions (automated, manual, or somewhere in between) in order to ensure a predictable result. Classic DR scenarios generally come to mind most often, but shoring up steps on how to react to certain events should also include sustained power outages. The amount of good content on the matter is sparse at best, so I will share a few bits of information I have learned over the years.

The Challenges
One of the limitations with a physical design of redundancy when it comes to facility power is, well, the facility. It is likely served by a single utility district, and the customer simply doesn’t have options to bring in other power. The building also may have limited or no backup power. Generators may be sized large enough to keep the elevators and a few lights running, but that is about it. Many cannot, or do not provide power conditioned good enough that is worthy of running expensive equipment. The option to feed PDUs using different circuits from the power closet might also be limited.

Defining the intent of your UPS units is often an overlooked consideration. Are they sized in such a way just to provide enough time for a simple graceful shutdown? …And how long is that? Or are they sized to meet some SLA decided upon by management and budget line owners? Those are good questions, but inevitably, if the power it out for long enough, you have to deal with how a graceful shutdown will be orchestrated.

SMBs fall in a particularly risky category, as they often have a set of disparate, small UPS units supplying battery backed power, with no unified management system to orchestrate what should happen in an "on battery" event. It is not uncommon to see an SMB well down the road of virtualization, but their UPS units do not have the smarts to handle information from the items they are powering. Picking the winning number on a roulette wheel might give better odds than figuring out which is going to go first, and which is going to go last.

Not all power outages are a simple power versus no power issue. A few years back our building lost one leg of the three-phase power coming in from the electric vault under the nearby street. This caused a voltage "back feed" on one of the legs, which cut nominal voltage severely. This dirty power/brown-out scenario was one of the worst I’ve seen. It lasted for 7 very long hours during the middle of the night. While the primary infrastructure was able to be safely shutdown, workstations and other devices were toggling off and one due to this scenario. Several pieces of equipment were ruined, but many others ended up worse off than we were.

It’s all about the little mistakes
"Sometimes I lie awake at night, and I ask, ‘Where have I gone wrong?’  Then a voice says to me, ‘This is going to take more than one night" –Charlie Brown, Peanuts [Charles Schulz]

A sequence of little mistakes in an otherwise good plan can kill you. This transcends IT. I was a rock climber for many years, and a single tragic mistake was almost always the result of a series of smaller mistakes. It often stemmed from poor assumptions, bad planning, trivializing variables, or not acknowledging the known unknowns. Don’t let yourself be the IT equivalent to the climber that cratered on the ground.

One of the biggest potential risks is a running VM not fully committing I/Os from its own queues or anywhere in the data path (all the way down to the array controllers) before the batteries fully deplete. When the VMs are properly shutdown before the batteries deplete, you can be assured that all data has been committed, and the integrity of your systems and data remain intact.

So where does one begin? Properly dealing with a sustained outage is recognizing that it is a sequence driven event.

1. Determine what needs to stay on the longest. Often times it is not how long the a VM or system stays up on battery, but that they are gracefully shutoff before a hard power failure. Your UPS units buy you a finite amount of time. It takes more than "hope" to make your systems go down gracefully, and in the correct order.

2. Determine your hardware dependency chain. Work through what is the most logical order of shutdown for your physical equipment, and identify the last pieces of physical equipment that need to stay on. (Your answer better be switches).

3. Determine your software dependency chain. Many systems can be shut down at any time, but many others rely on other services to support their needs. Map it out. Also recognize that hardware can be affected by the lack of availability of software based services (e.g. DNS, SMTP, AD, etc.).

4. Determine what equipment might need a graceful shutdown, and what can drop when the UPS units run dry. Check with each Manufacturer for the answers.

Once you begin to make progress on better understanding the above, then you can look into how you can make it happen.

Making a retrospective work for you
It’s not uncommon to just be grateful that after the sustained power failure has ended, that you are grateful that everything came back up without issue. As a result, one leaves valuable information on the table on how to improve the process in the future. Seize the moment! Take notes during this event so that they can be remembered better during a retrospective. After all, the retrospective’s purpose is to define what went well and what didn’t. Stressful situations can play tricks on memory. Perhaps you couldn’t identify power cables easily, or wondered why your Exchange server took a long time to shut down, or didn’t know if or when vCenter shut down gracefully. This is a great method for capturing valuable information. In the "dirty power" story above, the UPS power did not last as long as I had anticipated because the server room’s dedicated AC unit shut down. The room heated up, and all of the variable speed fans kicked into high gear, draining the power faster than I thought. Lesson learned.

The planning process is served well by mocking up a power failure event on paper. Remember, thinking about it is free, and is a nice way to kick off the planning. Clearly, the biggest challenge around developing and testing power down and power up scenarios is that it has to be tested at some point. How do you test this? Very carefully. In fact, if you have any concerns at all, save it for a lab. Then introduce it into production in such a way that you can statically control or limit the shutdown event to just a few test machine, etc. The only scenario I can imagine on par with a sustained power outage is kicking off a domino-effect workflow that shuts down your entire datacenter.

The run book
Having a plan located only in your head will accomplish only two things.  It will be a guaranteed failure.  It can put your organization’s systems and data at risk.  This is why there is a need to define and publish a sustained power outage run book. Sometimes known as a "play chart" in the sports world, it is intended to define a reaction to an event under a given set of circumstances. The purpose is to 1.) vet out the process before hand, and 2.) avoid "heat of the moment" decisions under times of great stress that end up being the wrong decision.

The run book also serves as a good planning tool for determining if you have the tools or methods available to orchestrate a graceful, orderly shutdown of VMs and equipment based on the data provided by the UPS units. The run book is not just about graceful power down scenarios, but the steps required for a successful power-up. Sometimes this can be more well known, as an occasional lights out maintenance window may need to occur on some storage or firmware updates, replacement, etc. Power-up planning can also be important, including making sure you have some basic services available for the infrastructure as it powers up. For example, see "Using a Synology NAS as an emergency backup DNS server for vSphere" for a few tips on a simple way to serve up DNS to your infrastructure.

And don’t forget to make sure the run book is still accessible when you need it most (when there is no power). 🙂

Tools and tips
I’ve stayed away from discussing specific scripts or tools for this because each environment is different, and may have different tools available to them. For instance, I use Emerson-Liebert UPS units, and have a controlling VM that will orchestrate many of the automated shutdown steps of VMs. Using PowerCLI, Python, or bash can be a complementary, or a critical part of a shutdown process. It is up to you. The key is to have some entity that will be able to interpret how much power remains on battery, and how one can trigger event driven actions from that information.

1. Remember that graceful shutdowns can create a bit of their own CPU and storage I/O storm. While not as significant as some boot storm upon power up, and generally is only noticeable at the beginning of the shutdown process when all systems are up, but it can be noticeable.

2. Ask your coworkers or industry colleagues for feedback. Learn about what they have in place, and share some stories about what went wrong, and what went right. It’s good for the soul, and your job security.

3. Focus more on the correct steps, sequence, and procedure, before thinking about automating it. You can’t automate something when you do not clearly understand the workflow.

4. Determine how you are going to make this effort a priority, and important to key stakeholders. Take it to your boss, or management. Yes, you heard me right. It won’t ever be addressed until it is given visibility, and identified as a risk. It is not about potential self-incrimination. It is about improving the plan of action around these types of events. Help them understand the implications for not handling in the correct way.

It is a very strange experience to be in an server room that is whisper quiet from a sustained power outage. There is an opportunity to make it a much less stressful experience with a little planning and preparation. Good luck!

– Pete

Leave a comment