Understanding PernixData FVP’s clustered read caching functionality

When PernixData debuted FVP back in August 2013, for me there was one innovation in particular that stood out above the rest.  The ability to accelerate writes (known as “Write Back” caching) on the server side, and do so in a fault tolerant way.  Leverage fast media on the server side to drive microsecond write latencies to a VM while enjoying all of the benefits of VMware clustering (vMotion, HA, DRS, etc.).  Give the VM the advantage of physics by presenting a local acknowledgement of the write, but maintain all of the benefits of keeping your compute and storage layers separate.

But sometimes overlooked with this innovation is the effectiveness that comes with how FVP clusters acceleration devices to create a pool of resources for read caching (known as “Write Through” caching with FVP). For new and existing FVP users, it is good to get familiar with the basics of how to interpret the effectiveness of clustered read caching, and how to look for opportunities to improve the results of it in an environment. For those who will be trying out the upcoming FVP Freedom edition, this will also serve as an additional primer for interpreting the metrics. Announced at Virtualization Field Day 5, the Freedom Edition is a free edition of FVP with a few limitations, such as read caching only, and a maximum of 128GB tier size using RAM.

The power of read caching done the right way
Read caching alone can sometimes be perceived as a helpful way to improve performance, but temporary, and only addressing one side of the I/O dialogue. Unfortunately, this assertion tells an incomplete story. It is often criticized, but let’s remember that caching in some form is used by almost everyone, and everything.  Storage arrays of all types, Hyper Converged solutions, and even DAS.  Dig a little deeper, and you realize its perceived shortcomings are most often attributed to how it has been implemented. By that I mean:

  • Limited, non-adjustable cache sizes in arrays or Hyper Converged environments.
  • Limited to a single host in server side solutions.  (operations like vMotion undermining its effectiveness)
  • Not VM or workload aware.

Existing solutions address some of these shortcomings, but fall short in addressing all three in order to deliver read caching in a truly effective way. FVP’s architecture address all three, giving you the agility to quickly adjust the performance tier while letting your centralized storage do what it does best; store data.

Since FVP allows you to choose the size of the acceleration tier, this impact alone can be profound. For instance, current NVMe based Flash cards are 2TB in size, and are expected to grow dramatically in the near future. Imagine a 10 node cluster that would have perhaps 20-40TB of an acceleration tier that may be serving up just 50TB of persistent storage. Compare this to a hybrid array that may only put in a few hundred GB of flash devices in an array serving up that same 50TB, and funneling through a pair of array controllers. Flash that the I/Os would still have traverse the network and storage stack to get to, and cached data that is arbitrarily evicted for new incoming hot blocks.

Unlike other host side caching solutions, FVP treats the collection of acceleration devices on each host as a pool. As workloads are being actively moved across hosts in the vSphere cluster, those workloads will still be able to fetch the cached content from that pool using a light weight protocol. Traditionally host based caching would have to re-warm the data from the backend storage using the entire storage stack and traditional protocols if something like a vMotion event occurred.

FVP is also VM aware. This means it understands the identity of each cached block – where it is coming from, and going to -  and has many ways to maintain cache coherency (See Frank Denneman’s post Solving Cache Pollution). Traditional approaches to providing a caching tier meant that they were largely unaware of who the blocks of data were associated with. Intelligence was typically lost the moment the block exits the HBA on the host. This sets up one of the most common but often overlooked scenarios in a real environment. One or more noisy neighbor VMs can easily pollute, and force eviction of hot blocks in the cache used by other VMs. The arbitrary nature of this means potentially unpredictable performance with these traditional approaches.

How it works
The logic behind FVP’s clustered read caching approach is incredibly resilient and efficient. Cached reads for a VM can be fetched from any host participating in the cluster, which allows for a seamless leveraging of cache content regardless of where the VM lives in the cluster. Frank Denneman’s post on FVP’s remote cache access describes this in great detail.

Adjusting the charts
Since we will be looking at the FVP charts to better understand the benefit of just read caching alone, let’s create a custom view. This will allow us to really focus on read I/Os and not get them confused with any other write I/O activity occurring at the same time.



Note that when you choose a "Custom Breakdown", the same colors used to represent both reads and writes in the default "Storage Type" view will now be representing ONLY reads from their respective resource type. Something to keep in mind as you toggle between the default "Storage Type" view, and this custom view.


Looking at Offload
The goal for any well designed storage system is to deliver optimal performance to the applications.  With FVP, I/Os are offloaded from the array to the acceleration tier on the server side.  Read requests will be delivered to the VMs faster, reducing latency, and speeding up your applications. 

From a financial investment perspective, let’s not forget the benefit of I/O “offload.”  Or in other words, read requests that were satisfied from the acceleration tier. Using FVP, offload from the storage arrays serving the persistent storage tier, from the array controllers, from the fabric, and the HBAs. The more offload there is, the less work for your storage arrays and fabric, which means you can target more affordable backend storage. The hero numbers showcase the sum of this offload nicely.image

Looking at Network acceleration reads
Unlike other host based solutions, FVP allows for common activities such as vMotions, DRS, and HA to work seamlessly without forcing any sort of rewarming of the cache from the backend storage. Below is an example of read I/O from 3 VMs in a production environment, and their ability to access cached reads on an acceleration device on a remote host.


Note how the Latency maintains its low latency on those read requests that came from a remote acceleration device (the green line).

How good is my read caching working?
Regardless of which write policy (Write Through or Write Back) is being used in FVP, the cache is populated in the same way.

  • All read requests from the backing array will place the data into the acceleration tier as it fetches it from the backing storage.
  • All write I/O is placed in the cache as it is written to the physical storage.
    Therefore, it is easy to conclude that if read I/Os did NOT come from acceleration tier, it is from one of three reasons.
  • A block of data had been requested that had never been requested before.
  • The block of data had not been written recently, and thus, not residing in cache.
  • A block of data had once lived in the cache (via a read or write), but had been evicted due to cache size.

The first two items reflect the workload characteristics, while the last one is a result of a design decision – that being the cache size. With FVP you get to choose how large the devices are that make up the caching tier, so you can determine ultimately how much the solution will benefit you. Cache size can have a dramatic impact on performance because there is less pressure to evict previous data that have already been cached to make room for new data.

Visualizing the read cache usage
This is where the FVP metrics can tell the story. When looking at the "Custom Breakdown" view described earlier in this post, you can clearly see on the image below that while a sizable amount of reads were being serviced from the caching tier, the majority of reads (3,500+ IOPS sustained) in this time frame (1 week) came from the backing datastore.


Now, let’s contrast this to another environment and another workload. The image below clearly shows a large amount of data over the period of 1 day that is served from the acceleration tier. Nearly all of the read I/Os and over 60MBps of throughput that never touched the array.


When evaluating read cache sizing, this is one of the reasons why I like this particular “Custom Breakdown” view so much. Not only does it tell you how well FVP is working at offloading reads. It tells you the POTENTIAL of all reads that *could* be offloaded from the array.  You get to choose how much offload occurs, because you decide on how large your tier size is, or how many VMs participate in that tier.

Hit Rate will also tell you the percentage of reads that are coming from the acceleration tier at any point and time. This can be an effective way to few cache hit frequency, but to gain more insight, I often rely on this "Custom Breakdown" to get better context of how much data is coming from the cache and backing datastores at any point in time. Eviction rate can also provide complimentary information if it shows the eviction rate creeping upward.  But there can be cases were lower eviction percentages may evict enough cached data over time that it can still impact if it is still in cache.  Thus the reason why this particular "Custom Breakdown" is my favorite for evaluating reads.

What might be a scenario for seeing a lot of reads coming from a backing datastore, and not from cache? Imagine running 500 VMs in an acceleration tier size of just a few GB. The working set sizes are likely much larger than the cache size, and will result in churning through the cache and not show significant demonstrable benefit. Something to keep in mind if you are trying out FVP with a very small amount of RAM as an acceleration resource. Two effective ways to make this more efficient would be to 1.) increase the cache size or 2.) decrease the number of VMs participating in acceleration. Both will achieve the same thing; providing more potential cache tier size for each VM accelerated. The idea for any caching layer is to have it large enough to hold most of the active data (aka "working set") in the tier. With FVP, you get to easily adjust the tier size, or the VMs participating in it.

Don’t know what your working set sizes are?  Stay tuned for PernixData Architect!

Once you have a good plan for read caching with FVP, and arrange for a setup with maximum offload, you can drive the best performance possible from clustered read caching. On it’s own, clustered read caching implemented the way FVP does it can change the architectural discussion of how you design and spend those IT dollars.  Pair this with write-buffering with the full edition of FVP, and it can change the game completely.

Dogs, Rush hour traffic, and the history of storage I/O benchmarking–Part 1

Evaluating performance of x86 based servers and workstations has had a history of deficiency. Twenty years ago, Administrators who tested system performance usually did little more than run a simple CPU benchmark to see how much faster a 50MHz system was than a 25 MHz system. Rarely did testing go beyond this. Nostalgia aside, it really was a simpler time.

Fast forward a few years, and testing became slightly more sophisticated. Someone figured out it might be good to test the slowest part of the system (storage), so methods and tools were created to accommodate. Storage moved beyond the physical confines of the server by using dedicated LUNS in a SAN array. The LUNS may have not been shared, but the fabric, and entry points to the array were. However, testing storage generally marched forward with little change. Virtualization changed the landscape even further by changing the notion of a dedicated LUN for a single system. Now, the fabric and every component on the storage system was shared.

Testing tools came and went, with some being nothing more than orphaned side projects. Some tools have more dials to turn, but many still run under the assumption that they are testing a physical host on local spinning disk. They do little to try to emulate a real workload, as they have no idea what that means. Many times these tools try to combine load generation with a single, final number for performance measurement. Almost as if whatever happened in between the start and finish didn’t matter.

Testing methods didn’t evolve much either. The quest for "top speed" was never supplanted by any other method. Noteworthy considering a critical measurement of anything shared is performance under load or contention. Storage architectures and the media used has evolved, but rarely is it properly accounted for in testing. Often lost in the speeds and feeds discussion is the part that really counts – The performance of the applications and the VMs they live on.

This post will point out the flaws of synthetic testing of storage performance (the tools, and the techniques), but it may incorrectly give the impression that they are useless.  Quite the contrary actually.  They can be very helpful when used the correct way, and for the right reasons.  More on this later.

Deficiencies of benchmarks as a meaningful measuring stick
"I don’t use benchmarks. I have users" — Ancient Twitter Proverb

There is no substitute for the value of observing real world performance characteristics, but it does little to address the difficulty with measuring that performance in a repeatable way. Real workloads are a collection of widely moving variables that all have different types of impact on an environment and a user experience. Testing system performance is important, but only when it is properly understood what the testing tools are producing.

Synthetic benchmarks offer a number of benefits. They are typically very easy to run, and often produce some dashboard result that can be referenced later. But these tools and test methods share common characteristics that rarely generate anything resembling real data patterns. Among those distinctions are.

  • I/O generated from them is not a closed loop dialogue
  • They do not mimic dynamic variables of real workloads
  • Improper testing practices in a clustered compute environment

All of these warrant more detail, so let’s elaborate on each one.

I/O generated from them is not a closed loop dialogue
A simplified way to describe typical I/O dialogue be this; Data is fetched, it is processed in some way, then is sent on its way. The I/O “signature” of a workload could be described as to what pattern and degree this dialogue occurs in.  It can be a pattern that is often repeated frequently if you observe workloads long enough.

Consider the fetching of some data, the processing of some data, and the writing of data. One might liken this process to a dog fetching a ball, you wiping the slobber off, and throwing it out again. Over and over again, and in that order. Single threaded of course.


Synthetic load generators attack this quite differently. The one and only job of a synthetic I/O generator is to fill up the queues as fast as possible using every CPU cycle. The I/O generator has no regard for anything. The data is not processed in any way because that is not what was asked of it. By comparison, Synthetic I/O pretty much looks like this:


With a Synthetic I/O generator, every CPU cycle will be pushed to perform a singular action. Reads are requested and writes are issued in an all or nothing fashion. Sure, some generators allow you to mix reads and writes, but the problem still remains.  They do not reflect any meaningful dialogue, and cannot mimic a real workload.

They do not mimic dynamic variables of real workloads
Real workloads consume resources (CPU, Memory, Storage) far differently than their synthetic counterparts. At any given time, storage I/Os will be varying mixes of reads versus writes, I/O sizes, and coming from one or many CPU threads. The two images below shows a 6 1/2 minute snippet of real I/O taken from a single VM in a production environment (using vscsiStats and Excel surface charts). During this time of heavy activity, notice how much the type of I/O that is in play varies.

Below you see the number of read I/Os, and the respective I/O sizes. Typically between 4K and 32K in size.


Below you see the number of write I/Os, and the respective I/O sizes. The majority of sizes range from 32K to 512K in size. This is occurring at the same time on the same VM.


Here you can see that read/write ratios vary for just this single VM, and more importantly, the size and the number of I/Os are all over the place. I/O sizes can have an enormous impact on storage performance, so one can imagine the difficulty in emulating it accurately. VMware’s I/O Analyzer attempts to simulate courtesy of a trace file created and replayed from a real workload, but it still will not behave in the same way as multiple VMs from multiple hosts generating widely varying I/O patterns.  Analytics from storage arrays doesn’t help much either, as they are unable to see the pattern of data in this way.

Improper testing practices in a clustered compute environment
A typical Administrator who tests storage performance usually does so by setting up a single system (VM) to test peak performance (IOPS, Throughput & Latency) on a shared storage backend. It sounds logical at first, but this method doesn’t reflect the way data is handled in a clustered compute environment. Storage I/O from a clustered compute arrangement behaves in a way that is not unlike congestion on an interstate freeway. The performance of a freeway cannot be evaluated by a single car driving on it. It’s performance measurement is derived when it is under load with multiple cars, with different intentions, destinations, sizes, and all of the other variables that introduce congestion. Modern traffic simulation and modeling solutions account for all of these variables to measure and improve what matters most – real traffic.

Unfortunately, most testers and tools take this same "single car" approach, and do not account for one of the most important elements in modern virtualized infrastructures; the clustered compute layer. A fast storage infrastructure needs to be able to handle the given number of compute nodes (physical hosts) now and in the future. After all, the I/O’s are ALWAYS generated by the VMs and the collection of hosts they live on – not the backend storage. Painfully obvious, but often overlooked.

Take a look below. This is an illustration of I/O activity in a real environment (traditional clustered compute with SAN architecture). Green lines represent read I/Os and red lines representing write I/Os.


Now, let’s look at I/O activity from a traditional synthetic benchmarking approach. As you can see below, it looks pretty different.


Storage I/O generated in a real environment is a result of the number of nodes in a cluster and the workloads running on them. So a better (but far from perfect) way to test, at the very minimum, would be:


In a traditional architecture with an array that exceeds the capabilities of I/O generated from a single host, this is the method most commonly used to measure the absolute high water mark numbers of that array. Typically storage manufactures quote the numbers from the array because it is a single point of measurement that fits nicely in Marketing materials. Unfortunately it doesn’t measure what really counts; the reported numbers as seen by the VMs – a fact that must not be forgotten when performing any sort of performance testing. The array of course always hits a limit at some point due to the characteristics of the array, or the fabric it has to traverse.

In any sort of clustered compute system, you cannot recognize the full power of the compute platform by testing off of one VM. The same thing goes for any type of distributed storage architecture. With clustered host based acceleration solutions like PernixData FVP, or even Hyper Converged solutions, the approach will have to be similar to above in order to measure correctly. These are different architectures that reshape the traditional data path, and with the testing recommendations above, should help in evaluating their performance. This approach also puts the focus back where it should be; the performance of the VMs, and not some irrelevant numbers from a physical storage array.


Proper simulation of I/O across all hosts will allow you to adequately factor in the performance of the storage fabric and all connection points. Most fabrics are quite fast when there isn’t any traffic on them. Unfortunately, that isn’t very realistic. It is important to understand the impact the fabric introduces as the environment is scaled. Since the fabric is what connects all of the hosts fetching and committing data to a storage array, we need to simulate how everything (HBAs, storage array controllers, switches, etc.) performs under contention. If you haven’t already done so, take a few moments to read one of my favorite posts from Frank Denneman, Data Path is not managed as a clustered resource.

Testing with multiple VMs on multiple hosts with FVP also allows you to takes advantage of the per VM acceleration (write buffering & read caching) capabilities across a clustered compute environment. It is one of the reasons why the FVP’s decoupled architecture can scale so well, and why real workloads become a such a beneficiary of the architecture.

You thought we were finished, didn’t you
There are just too many ways Synthetic Benchmarks are misused to cover in just one post. Stay tuned for Part 2 for more observations on why they are inadequate as a single test for modern environments, and most importantly, when and how they can actually be useful.

Interpreting Performance Metrics in PernixData FVP

In the simplest of terms, performance charts and graphs are nothing more than lines with pretty colors.  They exist to provide insight and enable smart decision making.  Yet, accurate interpretation is a skill often trivialized, or worse, completely overlooked.  Depending on how well the data is presented, performance graphs can amaze or confuse, with hardly a difference between the two.

A vSphere environment provides ample opportunity to stare at all types of performance graphs, but often lost are techniques in how to interpret the actual data.  The counterpoint to this is that most are self-explanatory. Perhaps a valid point if they were not misinterpreted and underutilized so often.  To appeal to those averse to performance graph overload, many well intentioned solutions offer overly simplified dashboard-like insights.  They might serve as a good starting point, but this distilled data often falls short in providing the detail necessary to understand real performance conditions.  Variables that impact performance can be complex, and deserve more insight than a green/yellow/red indicator over a large sampling period.

Most vSphere Administrators can quickly view the “heavy hitters” of an environment by sorting the VMs by CPU in order to see the big offenders, and then drill down from there.  vCenter does not naturally provide good visual representation for storage I/O.  Interesting because storage performance can be the culprit for so many performance issues in a virtualized environment.  PernixData FVP accelerates your storage I/O, but also fills the void nicely in helping you understand your storage I/O.

FVP’s metrics leverage VMkernel statistics, but in my opinion make them more consumable.  These statistics reported by the hypervisor are particularly important because they are the measurements your VMs and applications feel.  Something to keep in mind when your other components in your infrastructure (storage arrays, network fabrics, etc.) may advertise good performance numbers, but don’t align with what the applications are seeing.

Interpreting performance metrics is a big topic, so the goal of this post is to provide some tips to help you interpret PernixData FVP performance metrics more accurately.

Starting at the top
In order to quickly look for the busiest VMs, one can start at the top of the FVP cluster.  Click on the “Performance Map” which is similar to a heat map. Rather than projecting VM I/O activity by color, the view will project each VM on their respective hosts at different sizes proportional to how much I/O they are generating for that given time period.  More active VMs will show up larger than less active VMs.

(click on images to enlarge)


From here, you can click on the targets of the VMs to get a feel for what activity is going on – serving as a convenient way to drill into the common I/O metrics of each VM; Latency, IOPS, and Throughput.


As shown below, these same metrics are available if the VM on the left hand side of the vSphere client is highlighted, and will give a larger view of each one of the graphs.  I tend to like this approach because it is a little easier on the eyes.


VM based Metrics – IOPS and Throughput
When drilling down into the VM’s FVP performance statistics, it will default to the Latency tab.  This makes sense considering how important latency is, but I find it most helpful to first click on the IOPS tab to get a feel for how many I/Os this VM is generating or requesting.  The primary reason why I don’t initially look at the Latency tab is that latency is a metric that requires context.  Often times VM workloads are bursty, and there may be times where there is little to no IOPS.  The VMkernel can sometimes report latency against little or no I/O activity a bit inaccurately, so looking at the IOPS and Throughput tabs first bring context to the Latency tab.

The default “Storage Type” breakdown view is a good view to start with when looking at IOPs and Throughput. To simplify the view even more tick the boxes so that only the “VM Observed” and the “Datastore” lines show, as displayed below.


The predefined “read/write” breakdown is also helpful for the IOPS and Throughput tabs as it gives a feel of the proportion of reads versus writes.  More on this in a minute.

What to look for
When viewing the IOPS and Throughput in an FVP accelerated environment, there may be times when you see large amounts of separation between the “VM Observed” line (blue) and the “Datastore” (magenta). Similar to what is shown below, having this separation where the “VM Observed” line is much higher than the “Datastore” line is a clear indication that FVP is accelerating those I/Os and driving down the latency.  It doesn’t take long to begin looking for this visual cue.


But there are times when there may be little to no separation between these lines, such as what you see below.


So what is going on?  Does this mean FVP is no longer accelerating?  No, it is still working.  It is about interpreting the graphs correctly.  Since FVP is an acceleration tier only, cached reads come from the acceleration tier on the hosts – creating the large separation between the “Datastore” and the “VM Observed” lines.  When FVP accelerates writes, they are synchronously buffered to the acceleration tier, followed by destaging to the backing datastore as soon as possible – often within milliseconds.  The rate at which data is sampled and rendered onto the graph will report the “VM Observed” and “Datastore” statistics that are at very similar times.

By toggling the “Breakdown” to “read/write” we can confirm in this case that the change in appearance in the IOPS graph above came from the workload transitioning from mostly reads to mostly writes.  Note how the magenta “Datastore” line above matches up with the cyan “Write” line below.


The graph above still might imply that the performance went down as the workload transition from reads to writes. Is that really the case?  Well, let’s take a look at the “Throughput” tab.  As you can see below, the graph shows that in fact there was the same amount of data being transmitted on both phases of the workload, yet the IOPS shows much fewer I/Os at the time the writes were occurring.


The most common reason for this sort of behavior is OS file system buffer caching inside the guest VM, which will assemble writes into larger I/O sizes.  The amount of data read in this example was the same as the amount of data that was written, but measuring that by only IOPS (aka I/O commands per second) can be misleading. I/O sizes are not only underappreciated for their impact on storage performance, but this is a good example of how often the I/O sizes can change, and how IOPS can be a misleading measurement if left on its own.

If the example above doesn’t make you question conventional wisdom on industry standard read/write ratios, or common methods for testing storage systems, it should.

We can also see from the Write Back Destaging tab that FVP destages the writes as aggressively as the array will allow.  As you can see below, all of the writes were delivered to the backing datastore in under 1 second.  This ties back to the previous graphs that showed the “VM Observed” and the “Datastore” lines following very closely to each other during period with several writes.


The key to understanding the performance improvement is to look at the Latency tab.  Notice on the image below how that latency for the VM dropped way down to a low, predictable level throughout the entire workload.  Again, this is the metric that matters.


Another way to think of this is that the IOPS and Throughput performance charts can typically show the visual results for read caching better than write buffering.  This is because:

  • Cached reads never come from the backing datastore, where buffered writes always hit the backing datastore.
  • Reads may be smaller I/O sizes than writes, which visually skews the impact if only looking at the IOPS tab.

Therefore, the ultimate measurement for both reads and writes is the latency metric.

VM based Metrics – Latency
Latency is arguably one of the most important metrics to look at.  This is what matters most to an active VM and the applications that live on it.  Now that you’ve looked at the IOPS and Throughput, take a look at the Latency tab. The “Storage type” breakdown is a good place to start, as it gives an overall sense of the effective VM latency against the backing datastore.  Much like the other metrics, it is good to look for separation between the “VM Observed” and “Datastore” where “VM Observed” latency should be lower than the “Datastore” line.

In the image above, the latency is dramatically improved, which again is the real measurement of impact.  A more detailed view of this same data can be viewed by selecting a “custom ” breakdown.  Tick the following checkboxes as shown below


Now take a look at the latency for the VM again. Hover anywhere on the chart that you might find interesting. The pop-up dialog will show you the detailed information that really tells you valuable information:

  • Where would have the latency come from if it had originated from the datastore (datastore read or write)
  • What has contributed to the effective “VM Observed” latency.


What to look for
The desired result for the Latency tab is to have the “VM Observed” line as low and as consistent as possible.  There may be times where the VM observed latency is not quite as low as you might expect.  The causes for this are numerous, and subject for another post, but FVP will provide some good indications as to some of the sources of that latency.  Switching over to the “Custom Breakdown” described earlier, you can see this more clearly.  This view can be used as an effective tool to help better understand any causes related to an occasional latency spike.

Hit & Eviction rate
Hit rate is the percentage of reads that are serviced by the acceleration tier, and not by the datastore.  It is great to see this measurement high, but is not the exclusive indicator of how well the environment is operating.  It is a metric that is complimentary to the other metrics, and shouldn’t be looked at in isolation.  It is only focused on displaying read caching hit rates, and conveys that as a percentage; whether there are 2,000 IOPS coming from the VM, or 2 IOPS coming from the VM.

There are times where this isn’t as high as you’d think.  Some of the causes to a lower than expected hit rate include:

  • Large amounts of sequential writes.  The graph is measuring read “hits” and will see a write as a “read miss”
  • Little or no I/O activity on the VM monitored.
  • In-guest activity that you are unaware of.  For instance, an in-guest SQL backup job might flush out the otherwise good cache related to that particular VM.  This is a leading indicator of such activity.  Thanks to the new Intelligent I/O profiling feature in FVP 2.5, one has the ability to optimize the cache for these types of scenarios.  See Frank Denneman’s post for more information about this feature.

Lets look at the Hit Rate for the period we are interested in.


You can see from above that the period of activity is the only part we should pay attention to.  Notice on the previous graphs that outside of the active period we were interested in, there was very little to no I/O activity

A low hit rate does not necessarily mean that a workload hasn’t been accelerated. It simply provides and additional data point for understanding.  In addition to looking at the hit rate, a good strategy is to look at the amount of reads from the IOPS or Throughput tab by creating the custom view settings of:


Now we can better see how many reads are actually occurring, and how many are coming from cache versus the backing datastore.  It puts much better context around the situation than relying entirely on Hit Rate.


Eviction Rate will tell us the percentage of blocks that are being evicted at any point and time.  A very low eviction rate indicates that FVP is lazily evicting data on an as needed based to make room for new incoming hot data, and is a good sign that the acceleration tier size is sized large enough to handle the general working set of data.  If this ramps upward, then that tells you that otherwise hot data will no longer be in the acceleration tier.  Eviction rates are a good indicator to help you determine of your acceleration tier is large enough.

The importance of context and the correlation to CPU cycles
When viewing performance metrics, context is everything.  Performance metrics are guilty of standing on their own far too often.  Or perhaps, it is human nature to want to look at these in isolation.  In the previous graphs, notice the relationship between the IOPS, Throughput, and Latency tabs.  They all play a part in delivering storage payload.

Viewing a VM’s ability to generate high IOPS and Throughput are good, but this can also be misleading.  A common but incorrect assumption is that once a VM is on fast storage that it will start doing some startling number of IOPS.  That is simply untrue. It is the application (and the OS that it is living on) that is dictating how many I/Os it will be pushing at any given time. I know of many single threaded applications that are very I/O intensive, and several multithreaded applications that aren’t.  Thus, it’s not about chasing IOPS, but rather, the ability to deliver low latency in a consistent way.  It is that low latency that lets the CPU breath freely, and not wait for the next I/O to be executed.

What do I mean by “breath freely?”  With real world workloads, the difference between fast and slow storage I/O is that CPU cycles can satisfy the I/O request without waiting.  A given workload may be performing some defined activity.  It may take a certain number of CPU cycles, and a certain number of storage I/Os to accomplish this.  An infrastructure that allows those I/Os to complete more quickly will let more CPU cycles to take part in completing the request, but in a shorter amount of time.


Looking at CPU utilization can also be a helpful indicator of your storage infrastructure’s ability to deliver the I/O. A VM’s ability to peak at 100% CPU is often a good thing from a storage I/O perspective.  It means that VM is less likely to be storage I/O constrained.

The only thing better than a really fast infrastructure for your workloads is understanding how well it is performing.  Hopefully this post offers up a few good tips when you look at your own workloads leveraging PernixData FVP.

Your Intel NUC Home Lab questions answered

With my recent post on what’s currently running in my vSphere Home Lab, I received a number of questions about one particular part of the lab; that being my Management Cluster built with Intel NUCs. So here is a quick compilation of those questions (with answers) I’ve had around this topic.

Why did you go with a NUC?  There are cheaper options.
My approach for a Home Lab Management Cluster was a bit different than my regular Lab Cluster. I wanted to take a minimalist approach, and provide just enough resources to get my primary VMs off of my other two hosts that I do a majority of my testing against. In other words, less is more. There is a bit of a price premium with a NUC, but there also is a distinct payoff with them that often gets overlooked. If they do not keep up with your needs in the Home Lab, they can be easily repurposed as a workstation or a media PC. The same can’t be said for most Home Lab gear.

Is there anything special you have to do to run ESXi on a NUC?
Nothing terribly difficult. The buildup of ESXi on the NUC is relatively straightforward, and there is a growing number of posts that walk through this nicely. The primary steps are:

  1. Build a customized ISO by packing up an Intel NIC driver, and a SATA Controller driver, and place it on a bootable USB.
  2. Temporarily disable AHCI in the BIOS for just the installation process
  3. Install ESXi
  4. Re-enable AHCI in the BIOS after the installation of ESXi is complete.

How many cores?
The 3rd generation NUC is built off of the Intel Core i5-4250U (Haswell) chipset. It has two physical cores, and will present 4CPUs with Hyper-Threading. After managing and watching real workloads for years, my position on Hyper-Threading is a bit more conservative than many others. It is certainly better than nothing, but many times the effective performance gain is limited, and varies with workload characteristics. It’s primary benefit with the NUC is that you can run a 4vCPU VM if you need to. Utilization of the CPU from a cluster perspective are often hovering below 10%.

Is working with 16GB of RAM painful?
Having just 16GB of RAM might be a more visible pain if it were serving something other than Management VMs. The biggest issue with a "skinny" two node Management Cluster usually comes in when you have to throw one into maintenance mode. But much like having a single switch in a Home Lab, you just deal with it. Below is what the Memory usage on these NUCs look like.


There are a few options to improve this situation.

1. Trimming up some of your VMs might be a good start. Virtual Appliances like the VCSA are built with a healthy chunk of RAM configured by default (supporting all of that Java goodness). Here is a way to trim up memory resources on the VCSA, although, I have not done this yet because I haven’t needed to. Just don’t use the Active Memory metric as your sole data point to trim up a VM Memory configuration. See Observations with the Active Memory metric in vSphere on how easily that metric can be misinterpreted.

2. Look into a new option for increasing RAM density on the NUC. Yeah, this might blow your budget, but if you really want 32GB of RAM in a NUC, you can do it. At this time, the cost delta for making a modification like this is pretty steep, and it may make more sense to purchase a 3rd NUC for a Management Cluster.

3. Adjust expectations and accept the fact that you might have a little memory ballooning or swapping going on. This is by far the easiest, and most affordable way to go.

How is ESXi on a single NIC?
Well, there is no getting around the fact that the NUC comes with a single 1GbE NIC. This means no redundancy, and limited bandwidth. The good news is that with just one NIC, you can monitor this quite easily in vCenter!   Since you are running all services and data across a single uplink, it may be in your best interest to run a Virtual Distributed Switch (VDS) to properly control ingress and egress traffic, and make sure something like a vMotion isn’t going to wreak havoc on your environment. However, transitioning a vCenter VM to a VDS with a single uplink can sometimes be a little adventurous, so you might want to plan ahead.

If you must have a 2nd NIC to the host, take a look here. Nicholas Farmer showed quite a bit of ingenuity in coming up with a second uplink. Also, don’t forget to look at his great mini-rack he made for the NUCs out of Legos. Great stuff.

How do they perform?
Exactly the way I want them to perform. Out of sight, and out of mind.  Again, my primary lab work is performed on my Micro-ATX style hosts, so as long as the NUCs can keep the various Management and infrastructure VMs running, then that is good with me. Some VMs are easy to trim up to provide minimal resources (Linux based syslog servers, DNS, etc.) while others are more difficult or not worth the hassle.

Why did you put two SSDs in them?
This was for flexibility. I wanted one (mSATA) drive for the possibility of local storage if I decided to place any VMs locally, as well as another device (2.5" SSD) for other uses. In this case, I decided to apply a little PernixData FVP magic on them and use one of them in each host to accelerate the VMs. The image below shows the latency of the VCSA, which has about 99% writes. Note how the latency dropped after transitioning the VM from Write-Through (read caching) to Write-Back (read and write caching) to a consistently low level. Not bad considering all traffic is riding across a single link, and the flash device is an old SSD.

(click on image to enlarge)



Would you recommend them?
I think the Intel NUCs serve as a great little host for a Management Cluster in a Home Lab. They won’t be replacing my Micro-ATX style boxes any time soon, nor should they ever be part of a real environment, but they allow me the freedom to experiment and test on the primary part of the Lab, which is what a Home Lab is all about.

Thanks for reading.

– Pete

My vSphere Home Lab. 2015 edition

The vSphere Home Lab. For some, it is a tool for learning. For others it is a hobby. And for the rest of us, it is a weird addiction rationalized as one of the first two reasons. Home Labs come in all shapes and sizes, and there really is no right or wrong way to create one. Apparently interest in vSphere Home Labs hasn’t waned, as there are now countless resources available online illustrating various designs. At one of our recent Seattle VMUG meetings, we gave a presentation on Home Lab arrangements and ideas. There was great interaction from the audience, and we received several comments afterward on how much they enjoyed the discussion and learning about what others were doing. If you are a VMUG leader and are looking for ideas for presentations, I’d recommend this topic at one of your own local meetings.

Much like a real Data Center, Home Labs are a continual work in progress. Shiny new gear often sits by the warts. Replacing the old equipment with new gear usually correlates to how much time and money you wish to dedicate to the effort. I marvel at some setups by others in the industry. A few of the more recent ones to keep an eye on is the work Erik Bussink does with his high speed networking, and the cool setup Jason Langer has with his half height cabinet and rack mounted hosts all on 10GbE. Pretty funny considering how many companies are still running 3 hosts with 1GbE networking.

In my conversations with others in the community, I realized I didn’t have a post I could direct someone to when they would ask what I used in my own environment. Well, let me lay it out for you, as of February, 2015.

Primary vSphere cluster
2 hosts currently make up this cluster, and consist of the following:

  • Lian LI PC-V351B chassis paired with a Scythe SY 1225SL 12L 120mm case fan.
  • SuperMicro MBD-X9SCM-F-O LGA Motherboard with IPMI (a must!)
  • Intel E3-1230 Sandy Bridge 3.2Ghz CPU (single socket, 4 physical cores)
  • 32GB RAM
  • Seasonic X series SS-400FL Power supply
  • Qty 3, Intel E1G42ETBLK dual port NIC
  • Mellanox MT25418 DDR 2 port InfiniBand HCA (10Gb per connection)
  • 8GB USB drive (boot)
  • 2TB SATA disk for local storage (testing)
  • Qty 2.  SATA based SSDs.  (varies with testing)

Management Cluster
At this time, a single host makes up this cluster, but intend to add a second unit.

  • Intel NUC BOXD54250WYKH1 Intel Core i5-4250U
  • Intel 530 240GB mSATA SSD
  • Crucial 16GB Kit (2x8GB)
  • Extra drive bay (for additional 2.5" SSD if needed)

The ATX style hosts have served quite well over the last 2 1/2 years. They are starting to show their age, but are quiet, and power efficient (read: low heat). Unfortunately they max out at just 32GB of RAM, which gets eaten up pretty quickly these days. The chassis started out very empty at first, but as I started to add SSDs and spinning disks for additional testing, InfiniBand cards, along with the occasional PCIe flash card or storage controller, I don’t have much room to spare anymore.

The Intel NUC is an interesting solution. In a vSphere Home Lab, the biggest constraints are that they are limited to 16GB of RAM, and a single 1GbE NIC. Since these units will serve as my management cluster, it should be fine, and it allows me to be more destructive on the primary two host cluster. They also fit into the small server rack quite nicely. I prefer the slightly thicker D54250WYKH as opposed to the traditional Intel D54250WYK model. It’s slightly thicker, but allows for an additional internal 2.5" drive. This offers a lot of flexibility if you wanted to keep some VMs on local storage, or possibly do some limited testing with host based caching. If they ever become too underpowered, they will always find use as a media server or workstation.

Most of my networking needs flow through a Cisco SG300-20. This is a feature rich, layer 3 switch that I’ve written about in the past (Using the Cisco SG300-20 Layer 3 switch in a home lab) I’ve used up all 20 ports, and really need another one. However, with the advent of other good layer 3 switches out there, and with the possibility of eventually moving my Lab to 10GbE, I’ve been making do with what I have.

As noted on my post Testing Infiniband in the home lab with PernixData FVP I introduced InfiniBand as a relatively affordable way to test high speed interconnects between hosts. I avoid the need to have an InfiniBand switch because with only two hosts, I can simply directly connect them. They are only passing vMotion and PernixData FVP traffic, so there is no need worry about routing. A desire to add a 3rd or 4th host gets complex, as I’d have to take the plunge and invest in an IB switch. (loud and not cheap).

Persistent storage comes from a Synology DS1512+ and a Synology DS1514+ NAS. unit. Both are 5 bay units, and have a mix of spinning disk, and SSDs. The primary difference being that the DS1514+ has four, 1GbE ports on it versus the older DS1512+ has two. One unit is used for housing the majority of my Lab VMs, and non-lab based file storage, while the other is used for experimentation and performance testing. Realistically I only need one Synology unit, but I was able to pick up the newer model for a great price, and I couldn’t refuse. My plan is to split out lab duties and general storage needs to the separate units.

Synology has seemed to have won the battle of storage in the home labs. Those who own them know that while they are a little pricey, they are well worth it, and offer so many other benefits beyond just serving up block or file storage for a vSphere cluster.

Battery Backup
My luck with UPS units in the home has not been anything to brag about. It’s usually a case of looking like they work until you really need them. So far the best luck I’ve had is with the unit I’m currently using. It is a CyberPower 1500AVR. With the entire lab drawing around 200 watts, this means that there is only about a 25% load on the UPS.

Server rack
A two shelf wire utility rack from Lowe’s fit the bill quite nicely. It is small, affordable ($25), and seemed to house the goofy form factor of the Lian Li ATX chassis. The only problem is that if I add a another ATX style host, I may have to come up with a better rack solution.

While I had a good lab environment to test with, up until a few months ago, the workstation sitting next to the lab was old, tired, and no longer functional. I found myself not even using it. So I replaced it with an Intel NUC as well. There is a bit of a price premium when buying the NUC, but the form factor, performance, simplicity, and power consumption all make it a no-brainer in my book. The limitations it has as a vSphere host (single NIC, and 16GB of RAM max) is not an issue when used as a workstation. It performs great, and powers a dual Monitor setup really well.

What it looks like
Standing at just 35” high, you can see that it is pretty self contained.



    The Home Lab Road Map / Wish List
    When you have a Home Lab, you have plenty of time to think about what you want next.  The "what you have" is never quite the same as the "what you want."  So here is the path I’ll probably be taking:

  • A second Intel NUC to serve as a 2 node Management cluster. (Done.  See here)
  • 10GbE switch.  My primary hesitation on this is cost, and noise. 
  • New hosts.  I tempted to go the route of a 2U rack mounted chassis so that I can go to three or four hosts more efficiently. With SuperMicro offering some motherboards with a built-in 10GbE port, that is pretty enticing.
  • New gateway.  As the lab grows more sophisticated, the more the network topology looks like a small production environment.  That is why a proper router/firewall on this wish list.
  • New wireless AP.  Not technically part of the Home Lab, but plays an important role for obvious reasons.  I need a wireless AP that is not prone to memory leaks and manual reboots every three or four days.
  • Affordable PCIe based flash is really making inroads in the enterprise, but it’s still not affordable enough for the home.  I hope this changes, as PCIe avoids so many headaches with flash that runs through a traditional storage controller.

Lessons learned over the years
A few takeaways have come from spending many hours working with my Home Lab.  These reflect personal preferences more than anything, but they might save you some effort along the way as well.

1. The best Home Lab is the one that you use.  For quite some time, I used a nested lab on a burley laptop, in addition to the physical setup.  Ultimately the physical Home Lab won out because it fit more of what I wanted to test and work on.  If my interests were more focused on scripting or workflow automation, perhaps a nested lab would be fine.  But I’m a bit too much of a gear-head, and my job now focuses in performance on top of real hardware.  I also didn’t care to power up and power down the entire nested lab each time I wanted to work on the laptop.

2.  While "lab" implies all things experimental, it is common to have a desire for some services to be running all the time.  Perhaps your lab has some responsibilities as a media server.  Or in my case, it also runs my Horizon View environment that I use for remote access.  This makes the idea of tearing down a lab on a whim a bit more complex.  It’s where a Management cluster can come in handy. Having it physically segregated helps to keep things operational when you want to do a complete rebuild, or experiment with a beta version of vSphere.

3.  I stay away from the cheap SSDs.  They have no place in a real Data Center, and aren’t much better in the home.  When it comes to flash, you get what you pay for.  And sometimes, even when you pay, you still don’t get good performing SSDs.  Spend your money wisely.  Buying something multiple times over doesn’t save much money in the end.  And remember, controllers matter too.

4.  Initially I wanted to configure an arrangement that consumed as little power as possible.  Keeping the power down means keeping the heat generated down, and thus the noise.  Since my entire sits just an arms-length from where I work, it was important in the beginning, and important now.  The entire setup draws about 200 watts of power and makes 38dB of noise 3 feet away.  I’ve refused to add anything loud or hot, and if I’m forced to, the lab will have to be relocated into a new area.

5.  There is always a way to do things a little cheaper.  But consider what your time is worth, and remember the reason why you have a Home Lab in the first place.  That has driven several of my purchasing decisions, and helps remove some of the petty obstacles that can sidetrack the best of us from working on what we intended to.

6.  While some technologies and practices trickle down from production environments to the Home Lab.  Sometimes the opposite happens.  Two good examples of this might be the use of the VCSA (vSphere 5.5 or later), and letting ESXi run on a USB or MicroSD card.  And that is the beauty of a lab.  It invites experimentation, and filters out what looks good on paper, versus what actually works.  Keep an open mind, and use it for what it is good for; making mistakes, and learning .

Thanks for reading

– Pete


Applying Innovation in the Workplace


Those of us in this IT industry need not be reminded that IT is as much of a consumer of solutions as it is a provider of services.  Advancing technologies attempt to provide services that are faster, more resilient, and feature rich. Sometimes advancement brings unintended consequences while simultaneously creating even more room to innovate. Exciting for sure, but all of this can become a bit precarious if you hold decision making responsibilities on what technologies best suite your environment. Go too deep on the unproven edge, and risk the well-being of your company, and possibly your career. Arguably more dangerous is to stay too conservative, and risk being a Luddite holding onto unused, outdated or failed vestiges of your IT past. It is a delicate balance, but rarely does it reward the status quo. There is an IT administrator out there somewhere that still doesn’t trust x86 servers, let alone virtualization.

Nobody in this industry is the sole proprietor of innovation, and we are all better off for it. Good ideas are everywhere, and it is fun to see them materialize into functional products. IT departments get a front row seat in seeing how different solutions solve problems. Some ideas are better than others. Others create more problems than they fix. Many companies providing a solution are victims of bad timing, bad marketing, or poor execution. In the continuum of progress and changing market conditions, others fail to acknowledge the change and course correct, or simply lost sight of why they exist in the first place.

Then, there are some solutions that show up as a gem. Perhaps they are transformational to how a problem is solved. Maybe they win you over by their elegance in masking the terribly complex with clean and simple. Maybe the solution has a bit of both. Those who are responsible for running environments get pretty good at recognizing the standouts.

Innovation’s impact on thinking differently
"If I had asked my customers what they wanted they would have said a faster horse" — Henry Ford

It started out as a simple trial of some beta software. It ended up as an integral component of my infrastructure; viewed in the same way as my compute, switchgear, storage, and hypervisor. That is basically the story of how PernixData FVP became a part of my environment. For the next 18 months I would watch daily, hourly, even by the minute as to how my very demanding workloads were improved because of this new approach to solving a common problem. The results were immediate, and obvious. Faster code compiling times. Lower latencies and more predictable performance for all of our applications. All while gaining better visibility to the behavior and needs of our workloads. And of course, storage arrays that were no longer paralyzed by I/O requests. Even the best of slide decks couldn’t convey what I was seeing. I got to see it happen every day, and much like the magic of virtualization in general, it never got old.

It is for that reason that I’ve joined the team at PernixData. I get the chance to help others understand how the PernixData approach can help their environment, and is more than just a faster horse. I’m no longer responsible for my own workloads, but now get to help people better understand their own. Since I’ve always had a passion for virtualization, IT infrastructures, and how real application workloads impact an environment, I think it’s going to be a great fit. I look forward to working with an unbelievably talented group of people. It is quite an honor.

A tip of the hat
I leave an organization that is top notch. Tecplot is a market leader in data visualization, and is routinely voted in the top 100 companies to work for. This doesn’t happen by accident. It comes from great people, great leadership, and has resulted in trusted, innovative products. I would like to thank the ownership group for allowing me the opportunity to be a part of their team, as it has been an absolute pleasure to work there. I’ve learned a lot from smart, principled folks that make up that company, and am better off for it. I leave behind the day to day administrative duties and challenges of a virtualized environment, but I am very excited to join a great team of really smart people who have helped change how challenges in modern IT infrastructures are viewed, and addressed.

Happy New Year.

– Pete

Sustained power outages in the datacenter

Ask any child about a power outage, and you can tell it is a pretty exciting thing. Flashlights. Candles. The whole bit. The excitement is an unexplainable reaction to an inconvenient, if not frustrating event when seen through the eyes of adulthood. When you are responsible for a datacenter of any size, there is no joy that comes from a power outage. Depending on the facility the infrastructure lives in, and the tools put in place to address the issue, it can be a minor inconvenience, or a real mess.

Planning for failure is one of the primary tenants of IT. It touches as much on operational decisions as it does design. Mitigation steps from failure events follow in the wake of the actual design itself, and define if or when further steps need to be taken to become fully operational again. There are some events that require a series of well-defined actions (automated, manual, or somewhere in between) in order to ensure a predictable result. Classic DR scenarios generally come to mind most often, but shoring up steps on how to react to certain events should also include sustained power outages. The amount of good content on the matter is sparse at best, so I will share a few bits of information I have learned over the years.

The Challenges
One of the limitations with a physical design of redundancy when it comes to facility power is, well, the facility. It is likely served by a single utility district, and the customer simply doesn’t have options to bring in other power. The building also may have limited or no backup power. Generators may be sized large enough to keep the elevators and a few lights running, but that is about it. Many cannot, or do not provide power conditioned good enough that is worthy of running expensive equipment. The option to feed PDUs using different circuits from the power closet might also be limited.

Defining the intent of your UPS units is often an overlooked consideration. Are they sized in such a way just to provide enough time for a simple graceful shutdown? …And how long is that? Or are they sized to meet some SLA decided upon by management and budget line owners? Those are good questions, but inevitably, if the power it out for long enough, you have to deal with how a graceful shutdown will be orchestrated.

SMBs fall in a particularly risky category, as they often have a set of disparate, small UPS units supplying battery backed power, with no unified management system to orchestrate what should happen in an "on battery" event. It is not uncommon to see an SMB well down the road of virtualization, but their UPS units do not have the smarts to handle information from the items they are powering. Picking the winning number on a roulette wheel might give better odds than figuring out which is going to go first, and which is going to go last.

Not all power outages are a simple power versus no power issue. A few years back our building lost one leg of the three-phase power coming in from the electric vault under the nearby street. This caused a voltage "back feed" on one of the legs, which cut nominal voltage severely. This dirty power/brown-out scenario was one of the worst I’ve seen. It lasted for 7 very long hours during the middle of the night. While the primary infrastructure was able to be safely shutdown, workstations and other devices were toggling off and one due to this scenario. Several pieces of equipment were ruined, but many others ended up worse off than we were.

It’s all about the little mistakes
"Sometimes I lie awake at night, and I ask, ‘Where have I gone wrong?’  Then a voice says to me, ‘This is going to take more than one night" –Charlie Brown, Peanuts [Charles Schulz]

A sequence of little mistakes in an otherwise good plan can kill you. This transcends IT. I was a rock climber for many years, and a single tragic mistake was almost always the result of a series of smaller mistakes. It often stemmed from poor assumptions, bad planning, trivializing variables, or not acknowledging the known unknowns. Don’t let yourself be the IT equivalent to the climber that cratered on the ground.

One of the biggest potential risks is a running VM not fully committing I/Os from its own queues or anywhere in the data path (all the way down to the array controllers) before the batteries fully deplete. When the VMs are properly shutdown before the batteries deplete, you can be assured that all data has been committed, and the integrity of your systems and data remain intact.

So where does one begin? Properly dealing with a sustained outage is recognizing that it is a sequence driven event.

1. Determine what needs to stay on the longest. Often times it is not how long the a VM or system stays up on battery, but that they are gracefully shutoff before a hard power failure. Your UPS units buy you a finite amount of time. It takes more than "hope" to make your systems go down gracefully, and in the correct order.

2. Determine your hardware dependency chain. Work through what is the most logical order of shutdown for your physical equipment, and identify the last pieces of physical equipment that need to stay on. (Your answer better be switches).

3. Determine your software dependency chain. Many systems can be shut down at any time, but many others rely on other services to support their needs. Map it out. Also recognize that hardware can be affected by the lack of availability of software based services (e.g. DNS, SMTP, AD, etc.).

4. Determine what equipment might need a graceful shutdown, and what can drop when the UPS units run dry. Check with each Manufacturer for the answers.

Once you begin to make progress on better understanding the above, then you can look into how you can make it happen.

Making a retrospective work for you
It’s not uncommon to just be grateful that after the sustained power failure has ended, that you are grateful that everything came back up without issue. As a result, one leaves valuable information on the table on how to improve the process in the future. Seize the moment! Take notes during this event so that they can be remembered better during a retrospective. After all, the retrospective’s purpose is to define what went well and what didn’t. Stressful situations can play tricks on memory. Perhaps you couldn’t identify power cables easily, or wondered why your Exchange server took a long time to shut down, or didn’t know if or when vCenter shut down gracefully. This is a great method for capturing valuable information. In the "dirty power" story above, the UPS power did not last as long as I had anticipated because the server room’s dedicated AC unit shut down. The room heated up, and all of the variable speed fans kicked into high gear, draining the power faster than I thought. Lesson learned.

The planning process is served well by mocking up a power failure event on paper. Remember, thinking about it is free, and is a nice way to kick off the planning. Clearly, the biggest challenge around developing and testing power down and power up scenarios is that it has to be tested at some point. How do you test this? Very carefully. In fact, if you have any concerns at all, save it for a lab. Then introduce it into production in such a way that you can statically control or limit the shutdown event to just a few test machine, etc. The only scenario I can imagine on par with a sustained power outage is kicking off a domino-effect workflow that shuts down your entire datacenter.

The run book
Having a plan located only in your head will accomplish only two things.  It will be a guaranteed failure.  It can put your organization’s systems and data at risk.  This is why there is a need to define and publish a sustained power outage run book. Sometimes known as a "play chart" in the sports world, it is intended to define a reaction to an event under a given set of circumstances. The purpose is to 1.) vet out the process before hand, and 2.) avoid "heat of the moment" decisions under times of great stress that end up being the wrong decision.

The run book also serves as a good planning tool for determining if you have the tools or methods available to orchestrate a graceful, orderly shutdown of VMs and equipment based on the data provided by the UPS units. The run book is not just about graceful power down scenarios, but the steps required for a successful power-up. Sometimes this can be more well known, as an occasional lights out maintenance window may need to occur on some storage or firmware updates, replacement, etc. Power-up planning can also be important, including making sure you have some basic services available for the infrastructure as it powers up. For example, see "Using a Synology NAS as an emergency backup DNS server for vSphere" for a few tips on a simple way to serve up DNS to your infrastructure.

And don’t forget to make sure the run book is still accessible when you need it most (when there is no power). :-)

Tools and tips
I’ve stayed away from discussing specific scripts or tools for this because each environment is different, and may have different tools available to them. For instance, I use Emerson-Liebert UPS units, and have a controlling VM that will orchestrate many of the automated shutdown steps of VMs. Using PowerCLI, Python, or bash can be a complementary, or a critical part of a shutdown process. It is up to you. The key is to have some entity that will be able to interpret how much power remains on battery, and how one can trigger event driven actions from that information.

1. Remember that graceful shutdowns can create a bit of their own CPU and storage I/O storm. While not as significant as some boot storm upon power up, and generally is only noticeable at the beginning of the shutdown process when all systems are up, but it can be noticeable.

2. Ask your coworkers or industry colleagues for feedback. Learn about what they have in place, and share some stories about what went wrong, and what went right. It’s good for the soul, and your job security.

3. Focus more on the correct steps, sequence, and procedure, before thinking about automating it. You can’t automate something when you do not clearly understand the workflow.

4. Determine how you are going to make this effort a priority, and important to key stakeholders. Take it to your boss, or management. Yes, you heard me right. It won’t ever be addressed until it is given visibility, and identified as a risk. It is not about potential self-incrimination. It is about improving the plan of action around these types of events. Help them understand the implications for not handling in the correct way.

It is a very strange experience to be in an server room that is whisper quiet from a sustained power outage. There is an opportunity to make it a much less stressful experience with a little planning and preparation. Good luck!

– Pete


Get every new post delivered to your Inbox.

Join 46 other followers