Architecture/planning

Interpreting Performance Metrics in PernixData FVP

In the simplest of terms, performance charts and graphs are nothing more than lines with pretty colors. They exist to provide insight and enable smart decision making. Yet, accurate interpretation is a skill often trivialized, or worse, completely overlooked. Depending on how well the data is presented, performance graphs can amaze or confuse, with hardly a difference between the two.

A vSphere environment provides ample opportunity to stare at all types of performance graphs, but often lost are techniques in how to interpret the actual data. The counterpoint to this is that most are self-explanatory. Perhaps a valid point if they were not misinterpreted and underutilized so often. To appeal to those averse to performance graph overload, many well intentioned solutions offer overly simplified dashboard-like insights. They might serve as a good starting point, but this distilled data often falls short in providing the detail necessary to understand real performance conditions. Variables that impact performance can be complex, and deserve more insight than a green/yellow/red indicator over a large sampling period.

Most vSphere Administrators can quickly view the “heavy hitters” of an environment by sorting the VMs by CPU in order to see the big offenders, and then drill down from there. vCenter does not naturally provide good visual representation for storage I/O. Interesting because storage performance can be the culprit for so many performance issues in a virtualized environment. PernixData FVP accelerates your storage I/O, but also fills the void nicely in helping you understand your storage I/O.

FVP’s metrics leverage VMkernel statistics, but in my opinion make them more consumable. These statistics reported by the hypervisor are particularly important because they are the measurements your VMs and applications feel. Something to keep in mind when your other components in your infrastructure (storage arrays, network fabrics, etc.) may advertise good performance numbers, but don’t align with what the applications are seeing.

Interpreting performance metrics is a big topic, so the goal of this post is to provide some tips to help you interpret PernixData FVP performance metrics more accurately.

Starting at the top
In order to quickly look for the busiest VMs, one can start at the top of the FVP cluster. Click on the “Performance Map” which is similar to a heat map. Rather than projecting VM I/O activity by color, the view will project each VM on their respective hosts at different sizes proportional to how much I/O they are generating for that given time period. More active VMs will show up larger than less active VMs.

(click on images to enlarge)

From here, you can click on the targets of the VMs to get a feel for what activity is going on – serving as a convenient way to drill into the common I/O metrics of each VM; Latency, IOPS, and Throughput.

As shown below, these same metrics are available if the VM on the left hand side of the vSphere client is highlighted, and will give a larger view of each one of the graphs. I tend to like this approach because it is a little easier on the eyes.

VM based Metrics – IOPS and Throughput
When drilling down into the VM’s FVP performance statistics, it will default to the Latency tab. This makes sense considering how important latency is, but I find it most helpful to first click on the IOPS tab to get a feel for how many I/Os this VM is generating or requesting. The primary reason why I don’t initially look at the Latency tab is that latency is a metric that requires context. Often times VM workloads are bursty, and there may be times where there is little to no IOPS. The VMkernel can sometimes report latency against little or no I/O activity a bit inaccurately, so looking at the IOPS and Throughput tabs first bring context to the Latency tab.

The default “Storage Type” breakdown view is a good view to start with when looking at IOPs and Throughput. To simplify the view even more tick the boxes so that only the “VM Observed” and the “Datastore” lines show, as displayed below.

The predefined “read/write” breakdown is also helpful for the IOPS and Throughput tabs as it gives a feel of the proportion of reads versus writes. More on this in a minute.

What to look for
When viewing the IOPS and Throughput in an FVP accelerated environment, there may be times when you see large amounts of separation between the “VM Observed” line (blue) and the “Datastore” (magenta). Similar to what is shown below, having this separation where the “VM Observed” line is much higher than the “Datastore” line is a clear indication that FVP is accelerating those I/Os and driving down the latency. It doesn’t take long to begin looking for this visual cue.

But there are times when there may be little to no separation between these lines, such as what you see below.

So what is going on? Does this mean FVP is no longer accelerating? No, it is still working. It is about interpreting the graphs correctly. Since FVP is an acceleration tier only, cached reads come from the acceleration tier on the hosts – creating the large separation between the “Datastore” and the “VM Observed” lines. When FVP accelerates writes, they are synchronously buffered to the acceleration tier, followed by destaging to the backing datastore as soon as possible – often within milliseconds. The rate at which data is sampled and rendered onto the graph will report the “VM Observed” and “Datastore” statistics that are at very similar times.

By toggling the “Breakdown” to “read/write” we can confirm in this case that the change in appearance in the IOPS graph above came from the workload transitioning from mostly reads to mostly writes. Note how the magenta “Datastore” line above matches up with the cyan “Write” line below.

The graph above still might imply that the performance went down as the workload transition from reads to writes. Is that really the case? Well, let’s take a look at the “Throughput” tab. As you can see below, the graph shows that in fact there was the same amount of data being transmitted on both phases of the workload, yet the IOPS shows much fewer I/Os at the time the writes were occurring.

The most common reason for this sort of behavior is OS file system buffer caching inside the guest VM, which will assemble writes into larger I/O sizes. The amount of data read in this example was the same as the amount of data that was written, but measuring that by only IOPS (aka I/O commands per second) can be misleading. I/O sizes are not only underappreciated for their impact on storage performance, but this is a good example of how often the I/O sizes can change, and how IOPS can be a misleading measurement if left on its own.

If the example above doesn’t make you question conventional wisdom on industry standard read/write ratios, or common methods for testing storage systems, it should.

We can also see from the Write Back Destaging tab that FVP destages the writes as aggressively as the array will allow. As you can see below, all of the writes were delivered to the backing datastore in under 1 second. This ties back to the previous graphs that showed the “VM Observed” and the “Datastore” lines following very closely to each other during period with several writes.

The key to understanding the performance improvement is to look at the Latency tab. Notice on the image below how that latency for the VM dropped way down to a low, predictable level throughout the entire workload. Again, this is the metric that matters.

Another way to think of this is that the IOPS and Throughput performance charts can typically show the visual results for read caching better than write buffering. This is because:

Cached reads never come from the backing datastore, where buffered writes always hit the backing datastore.
Reads may be smaller I/O sizes than writes, which visually skews the impact if only looking at the IOPS tab.

Therefore, the ultimate measurement for both reads and writes is the latency metric.

VM based Metrics – Latency
Latency is arguably one of the most important metrics to look at. This is what matters most to an active VM and the applications that live on it. Now that you’ve looked at the IOPS and Throughput, take a look at the Latency tab. The “Storage type” breakdown is a good place to start, as it gives an overall sense of the effective VM latency against the backing datastore. Much like the other metrics, it is good to look for separation between the “VM Observed” and “Datastore” where “VM Observed” latency should be lower than the “Datastore” line.

In the image above, the latency is dramatically improved, which again is the real measurement of impact. A more detailed view of this same data can be viewed by selecting a “custom ” breakdown. Tick the following checkboxes as shown below

Now take a look at the latency for the VM again. Hover anywhere on the chart that you might find interesting. The pop-up dialog will show you the detailed information that really tells you valuable information:

Where would have the latency come from if it had originated from the datastore (datastore read or write)
What has contributed to the effective “VM Observed” latency.

What to look for
The desired result for the Latency tab is to have the “VM Observed” line as low and as consistent as possible. There may be times where the VM observed latency is not quite as low as you might expect. The causes for this are numerous, and subject for another post, but FVP will provide some good indications as to some of the sources of that latency. Switching over to the “Custom Breakdown” described earlier, you can see this more clearly. This view can be used as an effective tool to help better understand any causes related to an occasional latency spike.

Hit & Eviction rate
Hit rate is the percentage of reads that are serviced by the acceleration tier, and not by the datastore. It is great to see this measurement high, but is not the exclusive indicator of how well the environment is operating. It is a metric that is complimentary to the other metrics, and shouldn’t be looked at in isolation. It is only focused on displaying read caching hit rates, and conveys that as a percentage; whether there are 2,000 IOPS coming from the VM, or 2 IOPS coming from the VM.

There are times where this isn’t as high as you’d think. Some of the causes to a lower than expected hit rate include:

Large amounts of sequential writes. The graph is measuring read “hits” and will see a write as a “read miss”
Little or no I/O activity on the VM monitored.
In-guest activity that you are unaware of. For instance, an in-guest SQL backup job might flush out the otherwise good cache related to that particular VM. This is a leading indicator of such activity. Thanks to the new Intelligent I/O profiling feature in FVP 2.5, one has the ability to optimize the cache for these types of scenarios. See Frank Denneman’s post for more information about this feature.

Lets look at the Hit Rate for the period we are interested in.

You can see from above that the period of activity is the only part we should pay attention to. Notice on the previous graphs that outside of the active period we were interested in, there was very little to no I/O activity

A low hit rate does not necessarily mean that a workload hasn’t been accelerated. It simply provides and additional data point for understanding. In addition to looking at the hit rate, a good strategy is to look at the amount of reads from the IOPS or Throughput tab by creating the custom view settings of:

Now we can better see how many reads are actually occurring, and how many are coming from cache versus the backing datastore. It puts much better context around the situation than relying entirely on Hit Rate.

Eviction Rate will tell us the percentage of blocks that are being evicted at any point and time. A very low eviction rate indicates that FVP is lazily evicting data on an as needed based to make room for new incoming hot data, and is a good sign that the acceleration tier size is sized large enough to handle the general working set of data. If this ramps upward, then that tells you that otherwise hot data will no longer be in the acceleration tier. Eviction rates are a good indicator to help you determine of your acceleration tier is large enough.

The importance of context and the correlation to CPU cycles
When viewing performance metrics, context is everything. Performance metrics are guilty of standing on their own far too often. Or perhaps, it is human nature to want to look at these in isolation. In the previous graphs, notice the relationship between the IOPS, Throughput, and Latency tabs. They all play a part in delivering storage payload.

Viewing a VM’s ability to generate high IOPS and Throughput are good, but this can also be misleading. A common but incorrect assumption is that once a VM is on fast storage that it will start doing some startling number of IOPS. That is simply untrue. It is the application (and the OS that it is living on) that is dictating how many I/Os it will be pushing at any given time. I know of many single threaded applications that are very I/O intensive, and several multithreaded applications that aren’t. Thus, it’s not about chasing IOPS, but rather, the ability to deliver low latency in a consistent way. It is that low latency that lets the CPU breath freely, and not wait for the next I/O to be executed.

What do I mean by “breath freely?” With real world workloads, the difference between fast and slow storage I/O is that CPU cycles can satisfy the I/O request without waiting. A given workload may be performing some defined activity. It may take a certain number of CPU cycles, and a certain number of storage I/Os to accomplish this. An infrastructure that allows those I/Os to complete more quickly will let more CPU cycles to take part in completing the request, but in a shorter amount of time.

Looking at CPU utilization can also be a helpful indicator of your storage infrastructure’s ability to deliver the I/O. A VM’s ability to peak at 100% CPU is often a good thing from a storage I/O perspective. It means that VM is less likely to be storage I/O constrained.

Summary
The only thing better than a really fast infrastructure for your workloads is understanding how well it is performing. Hopefully this post offers up a few good tips when you look at your own workloads leveraging PernixData FVP.

Sustained power outages in the datacenter

Ask any child about a power outage, and you can tell it is a pretty exciting thing. Flashlights. Candles. The whole bit. The excitement is an unexplainable reaction to an inconvenient, if not frustrating event when seen through the eyes of adulthood. When you are responsible for a datacenter of any size, there is no joy that comes from a power outage. Depending on the facility the infrastructure lives in, and the tools put in place to address the issue, it can be a minor inconvenience, or a real mess.

Planning for failure is one of the primary tenants of IT. It touches as much on operational decisions as it does design. Mitigation steps from failure events follow in the wake of the actual design itself, and define if or when further steps need to be taken to become fully operational again. There are some events that require a series of well-defined actions (automated, manual, or somewhere in between) in order to ensure a predictable result. Classic DR scenarios generally come to mind most often, but shoring up steps on how to react to certain events should also include sustained power outages. The amount of good content on the matter is sparse at best, so I will share a few bits of information I have learned over the years.

The Challenges
One of the limitations with a physical design of redundancy when it comes to facility power is, well, the facility. It is likely served by a single utility district, and the customer simply doesn’t have options to bring in other power. The building also may have limited or no backup power. Generators may be sized large enough to keep the elevators and a few lights running, but that is about it. Many cannot, or do not provide power conditioned good enough that is worthy of running expensive equipment. The option to feed PDUs using different circuits from the power closet might also be limited.

Defining the intent of your UPS units is often an overlooked consideration. Are they sized in such a way just to provide enough time for a simple graceful shutdown? …And how long is that? Or are they sized to meet some SLA decided upon by management and budget line owners? Those are good questions, but inevitably, if the power it out for long enough, you have to deal with how a graceful shutdown will be orchestrated.

SMBs fall in a particularly risky category, as they often have a set of disparate, small UPS units supplying battery backed power, with no unified management system to orchestrate what should happen in an "on battery" event. It is not uncommon to see an SMB well down the road of virtualization, but their UPS units do not have the smarts to handle information from the items they are powering. Picking the winning number on a roulette wheel might give better odds than figuring out which is going to go first, and which is going to go last.

Not all power outages are a simple power versus no power issue. A few years back our building lost one leg of the three-phase power coming in from the electric vault under the nearby street. This caused a voltage "back feed" on one of the legs, which cut nominal voltage severely. This dirty power/brown-out scenario was one of the worst I’ve seen. It lasted for 7 very long hours during the middle of the night. While the primary infrastructure was able to be safely shutdown, workstations and other devices were toggling off and one due to this scenario. Several pieces of equipment were ruined, but many others ended up worse off than we were.

It’s all about the little mistakes
"Sometimes I lie awake at night, and I ask, ‘Where have I gone wrong?’ Then a voice says to me, ‘This is going to take more than one night" –Charlie Brown, Peanuts [Charles Schulz]

A sequence of little mistakes in an otherwise good plan can kill you. This transcends IT. I was a rock climber for many years, and a single tragic mistake was almost always the result of a series of smaller mistakes. It often stemmed from poor assumptions, bad planning, trivializing variables, or not acknowledging the known unknowns. Don’t let yourself be the IT equivalent to the climber that cratered on the ground.

One of the biggest potential risks is a running VM not fully committing I/Os from its own queues or anywhere in the data path (all the way down to the array controllers) before the batteries fully deplete. When the VMs are properly shutdown before the batteries deplete, you can be assured that all data has been committed, and the integrity of your systems and data remain intact.

So where does one begin? Properly dealing with a sustained outage is recognizing that it is a sequence driven event.

1. Determine what needs to stay on the longest. Often times it is not how long the a VM or system stays up on battery, but that they are gracefully shutoff before a hard power failure. Your UPS units buy you a finite amount of time. It takes more than "hope" to make your systems go down gracefully, and in the correct order.

2. Determine your hardware dependency chain. Work through what is the most logical order of shutdown for your physical equipment, and identify the last pieces of physical equipment that need to stay on. (Your answer better be switches).

3. Determine your software dependency chain. Many systems can be shut down at any time, but many others rely on other services to support their needs. Map it out. Also recognize that hardware can be affected by the lack of availability of software based services (e.g. DNS, SMTP, AD, etc.).

4. Determine what equipment might need a graceful shutdown, and what can drop when the UPS units run dry. Check with each Manufacturer for the answers.

Once you begin to make progress on better understanding the above, then you can look into how you can make it happen.

Making a retrospective work for you
It’s not uncommon to just be grateful that after the sustained power failure has ended, that you are grateful that everything came back up without issue. As a result, one leaves valuable information on the table on how to improve the process in the future. Seize the moment! Take notes during this event so that they can be remembered better during a retrospective. After all, the retrospective’s purpose is to define what went well and what didn’t. Stressful situations can play tricks on memory. Perhaps you couldn’t identify power cables easily, or wondered why your Exchange server took a long time to shut down, or didn’t know if or when vCenter shut down gracefully. This is a great method for capturing valuable information. In the "dirty power" story above, the UPS power did not last as long as I had anticipated because the server room’s dedicated AC unit shut down. The room heated up, and all of the variable speed fans kicked into high gear, draining the power faster than I thought. Lesson learned.

The planning process is served well by mocking up a power failure event on paper. Remember, thinking about it is free, and is a nice way to kick off the planning. Clearly, the biggest challenge around developing and testing power down and power up scenarios is that it has to be tested at some point. How do you test this? Very carefully. In fact, if you have any concerns at all, save it for a lab. Then introduce it into production in such a way that you can statically control or limit the shutdown event to just a few test machine, etc. The only scenario I can imagine on par with a sustained power outage is kicking off a domino-effect workflow that shuts down your entire datacenter.

The run book
Having a plan located only in your head will accomplish only two things. It will be a guaranteed failure. It can put your organization’s systems and data at risk. This is why there is a need to define and publish a sustained power outage run book. Sometimes known as a "play chart" in the sports world, it is intended to define a reaction to an event under a given set of circumstances. The purpose is to 1.) vet out the process before hand, and 2.) avoid "heat of the moment" decisions under times of great stress that end up being the wrong decision.

The run book also serves as a good planning tool for determining if you have the tools or methods available to orchestrate a graceful, orderly shutdown of VMs and equipment based on the data provided by the UPS units. The run book is not just about graceful power down scenarios, but the steps required for a successful power-up. Sometimes this can be more well known, as an occasional lights out maintenance window may need to occur on some storage or firmware updates, replacement, etc. Power-up planning can also be important, including making sure you have some basic services available for the infrastructure as it powers up. For example, see "Using a Synology NAS as an emergency backup DNS server for vSphere" for a few tips on a simple way to serve up DNS to your infrastructure.

And don’t forget to make sure the run book is still accessible when you need it most (when there is no power). 🙂

Tools and tips
I’ve stayed away from discussing specific scripts or tools for this because each environment is different, and may have different tools available to them. For instance, I use Emerson-Liebert UPS units, and have a controlling VM that will orchestrate many of the automated shutdown steps of VMs. Using PowerCLI, Python, or bash can be a complementary, or a critical part of a shutdown process. It is up to you. The key is to have some entity that will be able to interpret how much power remains on battery, and how one can trigger event driven actions from that information.

1. Remember that graceful shutdowns can create a bit of their own CPU and storage I/O storm. While not as significant as some boot storm upon power up, and generally is only noticeable at the beginning of the shutdown process when all systems are up, but it can be noticeable.

2. Ask your coworkers or industry colleagues for feedback. Learn about what they have in place, and share some stories about what went wrong, and what went right. It’s good for the soul, and your job security.

3. Focus more on the correct steps, sequence, and procedure, before thinking about automating it. You can’t automate something when you do not clearly understand the workflow.

4. Determine how you are going to make this effort a priority, and important to key stakeholders. Take it to your boss, or management. Yes, you heard me right. It won’t ever be addressed until it is given visibility, and identified as a risk. It is not about potential self-incrimination. It is about improving the plan of action around these types of events. Help them understand the implications for not handling in the correct way.

It is a very strange experience to be in an server room that is whisper quiet from a sustained power outage. There is an opportunity to make it a much less stressful experience with a little planning and preparation. Good luck!

– Pete

A look at FVP 2.0’s new features in a production environment

I love a good benchmark as much as the next guy. But success in the datacenter is not solely predicated on the results of a synthetic benchmark, especially those that do not reflect a real workload. This was the primary motivation in upgrading my production environment to FVP 2.0 as quickly as possible. After plenty of testing in the lab, I wanted to see how the new and improved features of FVP 2.0 impacted a production workload. The easiest way to do this is to sit back and watch, then share some screen shots.

All of the images below are from my production code compiling machines running at random points of the day. The workloads will always vary somewhat, so take them as more "observational differences" than benchmark results. Also note that these are much more than the typical busy VM. The code compiling VMs often hit the triple crown in the "difficult to design for" department.

Large I/O sizes. (32K to 512K, with most being around 256K)
Heavy writes (95% to 100% writes during a full compile)
Sustained use of compute, networking, and storage resources during the compiling.

The characteristics of flash under these circumstances can be a surprise to many. Heavy writes with large I/Os can turn flash into molasses, and is not uncommon to have sporadic latencies well above 50ms. Flash has been a boon for the industry, and has changed almost everything for the better. But contrary to conventional wisdom, it is not a panacea. The characteristics of flash need to be taken into consideration, and expectations should be adjusted, whether it be used as an acceleration resource, or for persistent data storage. If you think large I/O sizes do not apply to you, just look at the average I/O size when copying some files to a file server.

One important point is that the comparisons I provide did not include any physical changes to my infrastructure. Unfortunately, my peering network for replica traffic is still using 1GbE, and my blades are only capable of leveraging Intel S3700 SSDs via embedded SAS/SATA controllers. The VMs are still backed by a near end-of-life 1GbE based storage array.

Another item worth mentioning is that due to my workload, my numbers usually reflect worst case scenarios. You may have latencies that are drastically lower than mine. The point being that if FVP can adequately accelerate my workloads, it will likely do even better with yours. Now let’s take a look and see the results.

Adaptive Network Compression
Specific to customers using 1GbE as their peering network, FVP 2.0 offers a bit of relief in the form of Adaptive Network Compression. While there is no way for one to toggle this feature off or on for comparison, I can share what previous observations had shown.

FVP 1.x
Here is an older image a build machine during a compile. This was in WB+1 mode (replicating to 1 peer). As you can see, the blue line (Observed VM latency) shows the compounding effect of trying to push large writes across a 1GbE pipe, to SATA/SAS based Flash devices was not as good as one would hope. The characteristics of flash itself, along with the constraints of 1GbE were conspiring with each other to make acceleration difficult.

FVP 2.0 using Adaptive Network Compression
Before I show the comparison of effective latencies between 1.x and 2.0, I want to illustrate the workload a bit better. Below is a zoomed in view (about a 20 minute window) showing the throughput of a single VM during a compile job. As you can see, it is almost all writes.

Below shows the relative number of IOPS. Almost all are write IOPS, and again, the low number of IOPS relative to the throughput is an indicator of large I/O sizes. Remember that with 512K I/O sizes, it only takes a couple of hundred IOPS to nearly saturate a 1GbE link – not to mention the problems that flash has with it.

Now let’s look at latency on that same VM, during that same time frame. In the image below, the blue line shows that the VM observed latency has now improved to the 6 to 8ms range during heavy writes (ignore the spike on the left, as that was from a cold read). The 6 to 8ms of latency is very close to the effective latency of a WB+0, local flash device only configuration.

Using the same accelerator device (Intel S3700 on embedded Patsburg controllers) as in 1.x, the improvements are dramatic. The "penalty" for the redundancy is greatly reduced to the point that the backing flash may be the larger contributor to the overall latency. What has really been quite an eye opener is how well the compression is helping. In just three business days, it has saved 1.5 TB of data running over the peer network. (350 GB of savings coming from another FVP cluster not shown)

Distributed Fault Tolerant Memory
If there is one thing that flash doesn’t do well with, it is writes using large I/O sizes. Think about all of the overhead that comes from flash (garbage collection, write amplification, etc.), and that in my case, it still needs to funnel through an overwhelmed storage controller. This is where I was looking forward to seeing how Distributed Fault Tolerant Memory (DFTM) impacted performance in my environment. For this test, I carved out 96GB of RAM on each host (384GB total) for the DFTM Cluster.

Let’s look at a similar build run accelerated using write-back, but with DFTM. This VM is configured for WB+1, meaning that it is using DFTM, but still must push the replica traffic across a 1GbE pipe. The image below shows the effective latency of the WB+1 configuration using DFTM.

The image above shows that using DFTM in a WB+1 mode eliminated some of that overhead inherent with flash, and was able to drop latencies below 4ms with just a single 1GbE link. Again, these are massive 256K and 512K I/Os. I was curious to know how 10GbE would have compared, but didn’t have this in my production environment.

Now, let’s try DFTM in a WB+0 mode. Meaning that it has no peering traffic to send it to. What do the latencies look like then for that same time frame?

If you can’t see the blue line showing the effective (VM observed) latencies, it is because it is hovering quite close to 0 for the entire sampling period. Local acceleration was 0.10ms, and the effective latency to the VM under the heaviest of writes was just 0.33ms. I’ll take that.

Here is another image of when I turned a DFTM accelerated VM from WB+1 to WB+0. You can see what happened to the latency.

Keep in mind that the accelerated performance I show in the images above come from a VM that is living on a very old Dell EqualLogic PS6000e. Just fourteen 7,200 RPM SATA drives that can only serve up about 700 IOPS on a good day.

An unintended, but extremely useful benefit of DFTM is to troubleshoot replica traffic that has higher than expected latencies. A WB+1 configuration using DFTM eliminates any notion of latency introduced by flash devices or offending controllers, and limits the possibilities to NICs on the host, or switches. Something I’ve already found useful with another vSphere cluster.

Simply put, DFTM is a clear winner. It can address all of the things that flash cannot do well. It avoids storage buses, drive controllers, NAND overhead, and doesn’t wear out. And it sits as close to the CPU with as much bandwidth as anything. But make no mistake, memory is volatile. With the exception of some specific use cases such as non persistent VDI, or other ephemeral workloads, one should take advantage of the "FT" part of DFTM. Set it to 1 or more peers. You may give back a bit of latency, but the superior performance is perfect for those difficult tier one workloads.

When configuring an FVP cluster, the current implementation limits your selection to a single acceleration type per host. So, if you have flash already installed in your servers, and want to use RAM for some VMs, what do you do? …Make another FVP cluster. Frank Denneman’s post: Multi-FVP cluster design – using RAM and FLASH in the same vSphere Cluster describes how to configure VMs in the same vSphere cluster to use different accelerators. Borrowing those tips, this is how my FVP clusters inside of a vSphere cluster look.

Write Buffer and destaging mechanism
This is a feature not necessarily listed on the bullet points of improvements, but deserves a mention. At Storage Field Day 5, Satyam Vaghani mentioned the improvements with the destaging mechanism. I will let the folks at PernixData provide the details on this, but there were corner cases in which VMs could bump up against some limits of the destager. It was relatively rare, but it did happen in my environment. As far as I can tell, this does seem to be improved.

Destaging visibility has also been improved. Ever since the pre 1.0, beta days, I’ve wanted more visibility on the destaging buffer. After all, we know that all writes eventually have to hit the backing physical datastore (see Effects of introducing write-back caching with PernixData FVP) and can be a factor in design. FVP 2.0 now gives two key metrics; the amount of writes to destage (in MB), and the time to the backing datastore. This will allow you to see if your backing storage can or cannot keep up with your steady state writes. From my early impressions, the current mechanism doesn’t quite capture the metric data at a high enough frequency for my liking, but it’s a good start to giving more visibility.

Honorable mentions
NFS support is a fantastic improvement. While I don’t have it currently in production, it doesn’t mean that I may not have it in the future. Many organizations use it and love it. And I’m quite partial to it in the old home lab. Let us also not dismiss the little things. One of my favorite improvements is simply the pre-canned 8 hour time window for observing performance data. This gets rid of the “1 day is too much, 1 hour is not enough” conundrum.

Conclusion
There is a common theme to almost every feature evaluation above. The improvements I showcase cannot by adequately displayed or quantified with a synthetic workload. It took real data to appreciate the improvements in FVP 2.0. Although 10GbE is the minimum ideal, Adaptive Network Compression really buys a lot of time for legacy 1GbE networks. And DFTM is incredible.

The functional improvements to FVP 2.0 are significant. So significant that with an impending refresh of my infrastructure, I am now taking a fresh look at what is actually needed for physical storage on the back end. Perhaps some new compute with massive amounts of PCIe based flash, and RAM to create large tiered acceleration pools. Then backing spindles supporting our capacity requirements, with relatively little data services, and just enough performance to keep up with the steady-state writes.

Working at a software company myself, I know all too well that software is never "complete." But FVP 2.0 is a great leap forward for PernixData customers.

Getting the big IT purchase approved

IT organizations are faced with a tantalizing array of options when it comes to hardware and software solutions. But long before anything can ever be deployed, it has to be purchased, which means at some point it had to be approved. Sometimes deploying a solution is easy compared to getting it approved. But how does one go about getting the big ticket item through? Well, here is my attempt at demystifying the process.

First, lets just say that "big purchase" is without a doubt a relative term. For an SMB, $10,000 might be a show stopper, while seven figures for a large enterprise may be part of the routine. Both offer unique challenges, but share similar tactics. Getting a big IT purchase approved typically consists of a unique set of skills and experience. A mix of preparation, clarity, delivery, timing, and attitude make up the chaotic formula that when done well, will improve the odds of success. It is a skill that can be equally important to anything you bring in your technical arsenal.

Preparation
You will serve yourself well if you think and deliver like a consultant. Life in Ops can get muddied down by internal strife, whack-a-mole fire fighting, and the occasional "look at this new feature" deployment even though nobody asked for it. Take notice of how a good consultant does things. Step back to understand the desired result, then build out your own statement defining the typical design inputs like requirements, constraints, assumptions and risks.

At some point, you will need to prioritize your own wants, and pick your battles. You typically can’t have everything, so start from the ground up of what IT’s mission statement is, and work from there. Start with bet-the-business elements like high availability, and data/system protection that won’t be spoken up for by anyone but IT. Then, if there are other needs, they may in fact be a departmental need that impacts productivity and revenue. While IT may be the enabler of the request, make sure the identity of the requester is clear.

It’s not uncommon for an SMB to have very little money allocated to IT, but this isn’t an excuse for lack of diligence in preparation. Large organizations have more money, but proportionally much more complex problems to solve, SLAs to adhere to, and regulations to comply with. If you have no idea how your organization’s IT spending compares to peers in your industry, it is time to learn, and communicate that as a part of your presentation if your funds are abnormally low.

This is also an opportunity for you to project yourself as the "solution provider" in your organization. Embrace this. Help them understand why technology costs have increased over the past 10 years. If someone says, "Why don’t we just use the cloud for this?" Rather than let smoke pour out of your ears, respond with "That is a great question Joe. IT is constantly looking for the best ways to deliver services that meets the requirements of the organization." And then go into an appropriate level of detail on why it may or may not be a good fit. (If it is a good fit, then say so!). The point here is to embrace the solution provider role for the organization.

Your biggest competitor to your proposal will be, you guessed it, doing nothing. But there is a cost of doing nothing. The key stakeholders might look at this proposed expenditure and compare it to $0. In most cases, this is completely wrong, and it is up to you to help them understand what the real cost comparison is.

One opportunity sometimes overlooked is the power of a cost deferral. Does the unbudgeted solution you are proposing delay a much larger budgeted purchase until perhaps next year? Showcase this. Good proposals typically show a TCO of 3 to 5 years. But do not underestimate the allure an immediate cost deferral has to your friendly CFO.

Get input on defining the "what" of a problem, and it’s impacts. The "how" is usually reserved for the Subject Matter Expert (e.g. you). This will minimize silly ideas from others suggesting your storage capacity issues can be solved by the Friday flier for Best Buy.

Learn to prime the pump. Do a little one-on-one campaigning. This is a common method suggested in many books on successful leadership. It is your chance to win over your constituents before any formal proposal. Trying holding an internal "Lunch and Learn" about trends in technology. Share a little about how amazing virtualization is, and help them understand some basic challenges of IT. These techniques will engage key personnel, and help in establishing a trusting relationship with IT.

The presentation – IT Shark Tank
I’m a big fan of the show, ‘Shark Tank.’ If you aren’t familiar with it, four very successful investors hear pitches by would-be entrepreneurs who are looking for investment funds in exchange for a stake in equity. The investors bring their own wealth, smarts and competitive nature to the table, and can be quite tough on prospective entrepreneurs. A few things can be gleaned from this, and applied directly to your ability to deliver a successful proposal.

Come prepared. Nothing kills a proposal like lack of preparation, and not knowing your facts. Lets say you are requesting more storage: You’d better believe some of the simplest questions will be asked. Many that you may overlook when entering a room. "How much storage do we have?" "How much do we have left?" "How much do we need?" "Why does it cost so much?" "what are the alternatives?"
Clearly state the problem, the impacts to the business, the options, and your recommendations.
Learn to answer the simplest of questions in the simplest of ways. "Does this proposal save us money?" "Is there a less expensive way to do this?"
Craft your message to your audience and appeal to their sensibilities. Flog yourself upside the head if you use any IT acronyms, or assume that technical gymnastics is going to impress them. It won’t. What will is being concise. Every word has a purpose.
Provide a little (but not too much) context to the problem that you are trying to solve. Leverage an analogy if you need to.
Know the counterpoints, and how to respond. Know how you are going to answer a question you don’t know the answer to.
Seek to understand their position. What might they dislike (e.g. unpredictable expenses, obligated debt, investments they don’t understand, etc.)
Respect everyone’s time. Make it quick, make it concise, and if they would like more detail, you can certainly do that, but don’t make it a part of the pitch.

How to deal with everyone else in the food chain
Be honest with your vendors. They have a job to do, and are trying to help you. If you show interest in a solution that is 10x more than what you can afford, it isn’t going to do anyone good to bring them in for an onsite demonstration. They will appreciate your honesty so they can perhaps focus on more cost appropriate solutions. Believe it or not, most want the right solution for you in the first place, as repeat business is the most important value they can bring back to their own organization.

If you are someone who doesn’t have deep-dive knowledge on the solution you are proposing, take advantage of the SE for the VAR or channel partner as a resource. Many of my friends in the industry are SEs and are some of the best and the brightest folks I know, and they all came from the Ops side at some point. Use them as a resource to learn about the solutions they are proposing, and ask them challenging questions.

Be honest with your organization. This isn’t about what you want. Your value will increase when you can demonstrate repeatedly that you have their best interests in mind.

After the decision
If the proposal was approved, focus on delivering at least some results fast. Then showcase the win and how IT can help solve organizational challenges. This may sound like self promotion, but it is not if done right. The wins are for the organization, not you. This establishes trust, and lays the groundwork for the future. Use company newsletters, or establish a monthly IT Review to share updates.

If it was denied, don’t take it personal. It is great to show passion, but don’t confuse passion for what you are really trying to do; helping your organization make the best strategic and financial decision for them. Would it be gratifying to get a new Datacenter revamp through only to realize it was the financial tipping point of the organization just a few months later? Keep it all in perspective. Besides, some of the best purchasing decisions I’ve been involved with were the ones that were ultimately rejected, which gave solutions a chance to mature, and me an opportunity to find a different way to solve a problem.

Try doing your own proposal or presentation retrospective. What went well and what didn’t. Ask for feedback on how it went. You might be surprised at the responses you get.

Conclusion
You have the unique opportunity to be the technology advocate for the organization rather than simply a burden to the budget. Do I get everything approved? Of course I don’t, but a well prepared proposal will allow you, and your organization to make the smartest decisions possible, and help IT deliver great results.

Practical tips for a Veeam Backup and Recovery deployment

I’ve been using Veeam Backup and Recovery in my production environment for a while now, and in hindsight, it was one of the best investments we’ve ever made in our IT infrastructure. It has completely changed the operational overhead of protecting our VMs, and the data they serve up. Using a data protection solution that utilizes VMware’s APIs provides the simplicity and flexibility that was always desired. Moving away from array based features for protection has enabled the protection of VMs to better reflect desired RPO and RTO requirements – not by the limitations imposed by LUN sizes, array capacity, or functionality.

While Veeam is extremely simple in many respects, it is also a versatile, feature packed application that can be configured a variety of different ways. The versatility and the features can be a little confusing to the new user, so I wanted to share 25 tips that will help make for a quick and successful deployment of Veeam Backup and Recovery in your environment.

First lets go over a few assumptions that will be the basis for my recommendations:

There are two sites that need protection.
VMs and data need to be protected at each site, locally.
VMs and data need to be protected at each site, remotely.
A NAS target exists at each site.
Quick deployment is important.
You’ve already read all of the documentation.

Architecture
There are a number of different ways to set up the architecture for Veeam. I will show a few of the simplest arrangements:

In this arrangement below there would be no physical servers – only a NAS device. This is a simplified arrangement of what I use. If one wanted a rebuilt server (Windows or Linux) acting purely as a storage target, that could be in place of where you see the NAS. The architecture would stay the same.

Optionally, a physical server not just acting as a storage target, but also as a physical proxy would look something like this below:

Below is a combination of both, where a physical server is acting as the Proxy, but like the virtual proxy, is using an SMB share to house the data. In this case, a NAS unit.

Implementation tips
These tips focus not so much on ultimately what may suite your environment best (only you know that) or leveraging all of the features inside the product, but rather, getting you up and running as quickly as possible so you can start returning great results.

Job Manager Servers & Proxies

1. Have the job Manager server, any proxies, and the backup targets living on their own VLAN for a dedicated backup network.

2. Set up SNMP monitoring on any physical ports used in the backup arrangement. It will be helpful to understand how utilized the physical links get, and for how long.

3. Make sure to give the Job Manager VM enough resources to play with – especially if it will have any data mover/proxy responsibilities. The deployment documentation has good information on this, but for starters, make it 4vCPU with 5GB of RAM.

4. If there is more than one cluster to protect, consider building a virtual proxy inside each cluster that it will be responsible for protecting, then assign it to jobs that protect VMs in that cluster. In my case, I use PernixData FVP in two clusters. I have the data stores that house those VMs only accessible by their own cluster (a constraint of FVP). Because of that, I have a virtual proxy living in each cluster, with backup jobs configured so that it will use a specific virtual proxy. These virtual proxies have a special setting in FVP that will instruct the VMs being backed up to flush their write cache to the backing storage

Storage and Design

5. Keep the design simple, even if you know you will need to adjust at a later time. Architectural adjustments are easy to do with Veeam, so go ahead and get Veeam pointed to the target, and start running some jobs. Use this time to get familiar with the product, and begin protecting the jewels as quickly as possible.

6. Let Veeam use the default SQL Server Express instance on the Veeam Job Manager VM. This is a very reasonable, and simple configuration that should be adequate for a lot of environments.

7. Question whether a physical proxy is needed. Typically physical proxies are used for one of three reasons. 1.) They offload job processing CPU cycles from your cluster. 2.) In simple arrangements a Windows based Physical proxy might also be the Repository (aka storage target). 3.) They allow for one to leverage a "direct-from-SAN" feature by plugging in the system to your SAN fabric. The last one in my opinion introduces the most hesitation. Here is why:

Some storage arrays do not have a "read-only" iSCSI connection type. When this is the case, special care needs to be taken on the physical server directly attached to the SAN to ensure that it cannot initialize the data store. The reality is that you are one mistake away from having a very long day in front of you. I do not like this option when there is no secondary safety mechanism from the array on a "read-only" connection type.
Direct-from-SAN access can be a very good method for moving data to your target. So good that it may stress your backing storage enough (via link saturation or physical disk limits) to perhaps interfere with your production I/O requirements.
Additional efforts must be taken when using write buffering mechanisms that do not live on the storage array (e.g. PernixData) .

8. Veeam has the ability to back up to an SMB share, or an NFS mount. If an NFS mount is chosen, make sure that it is a storage target running native Linux. Most NAS units like a Synology are indeed just a tweaked version of Linux, and it would be easy to conclude that one should just use NFS. However, in this case, you may run into two problems.

The SMB connection to a NAS unit will likely be faster (which most certainly is the first time in history that an SMB connection is faster than an NFS connection) .
The Job Manager might not be able to manage the jobs on that NAS unit (connected via NFS) properly. This is due to BusyBox and Perl on the Synology not really liking each other. For me, this resulted in Veeam being unable to remove sun setting backups. Changing over to an SMB connection on the NAS improved the performance significantly, and allowed for job handling to work as desired.

9. Veeam has a great new feature (version 7.x) called a "Backup Copy" job, which allows for the backup made locally to be shipped to a remote site. The "Backup Copy" job achieves one of the most basic requirements of data protection in the simplest of ways. Two copies of the data at two different locations, but with the benefit of only processing the backup job once. It is a new feature of Version 7, and although it is a great feature, it behaves differently, and warrants some time spent before putting into production. For a speedy deployment, it might be best simply to configure two jobs. One to a local target, and one to a remote target. This will give you the time to experiment with the Backup Copy job feature.

10. There are compelling reasons for and against using a rebuilt server as a storage target, or using a NAS unit. Both are attractive options. I ended using a dedicated NAS unit. It’s form factor, drive bay count, and the overall cost of provisioning was the only option that could match my requirements.

Operations

11. In Veeam B&R, "Replication Jobs" are different than "Backup Jobs." Instead of trying to figure out all of the nuances of both right away, use just the "Backup job" function with both local and remote targets. This will give you time to better understand the characteristics of the replication functionality. One also might find that the "Backup Job" suites the environment and need better than the replication option.

12. If there are daily backups going to both local and offsite targets (and you are not using the "Backup Copy" option, have them run 12 hours apart from one another to reduce RPOs.

13. Build up a test VM to do your testing of a backup and restore. Restore it in the many ways that Veeam has to offer. Best to understand this now rather than when you really need to.

14. I like the job chaining/dependency feature, which allows you to chain multiple jobs together. But remember that if a job is manually started, it will run through the rest of the jobs too. The easiest way to accommodate this is to temporarily remove it from the job chain.

15. Your "Backup Repository" is just that, a repository for data. It can be a Windows Server, a Linux Server, or an SMB share. If you don’t have a NAS unit, stuff an old server (Windows or Linux) with some drives in it and it will work quite well for you.

16. Devise a simple, clear job naming scheme. Something like [BackupType]-[Descriptive Name]-[TargetLocation] will quickly tell you what it is and where it is going to. If you use folders in vCenter to organize your VMs, and your backups reflect the same, you could also choose to use the folder name. An example would be "Backup-SharePointFarm-LOCAL" which quickly and accurately describes the job.

17. Start with a simple schedule. Say, once per day, then watch the daily backup jobs and the synthetic fulls to see what sort of RPO/RTOs are realistic.

18. Repository naming. Be descriptive, but come up with some naming scheme that remains clear even if you aren’t in the application for several weeks. I like indicating the location of the repository, if it is intended for local jobs, or remote jobs, and what kind of repository it is (Windows, Linux, or SMB). For example: VeeamRepo-[LOCATION]-for-Local(SMB)

19. Repository organization. Create a good tree structure for organization and scalability. Veeam will do a very good job at handling the organization of the backups once you assign a specific location (share name) on a repository. However, create a structure that provides the ability to continue with the same naming convention as your needs evolve. For instance, a logical share name assigned to a repository might be \\nas01\backups\veeam\local\cluster1 This arrangement allows for different types of backups to live in different branches.

20. Veeam might prevent the ability of creating more than one repository going to the same share name (it would see \\nas01\backups\veeam\local\cluster1 and \\nas01\backups\veeam\local\cluster2 as the same). Create DNS aliases to fool it, then make those two targets something like \\nascluster1\backups\veeam\local\cluster1 and \\nascluster2\backups\veeam\local\cluster2

21. When in doubt, leave the defaults. Veeam put in great efforts to make sure that you, or the software doesn’t trip over itself. Uncertain of job number concurrency? Stick to the default. Wondering about which backup mode to use? (Reverse Incremental versus Incrementals with synthetic fulls). Stay with the defaults, and save the experimentation for later.

22. Don’t overcomplicate the schedule (at least initially). Veeam might give you flexibility that you never had with array based protection tools, but at the same time, there is no need to make it complicated. Perhaps group the VMs by something that you can keep track of, such as the folders they are contained in within vCenter.

23. Each backup job can be adjusted so that whatever target you are using, you can optimize it for preset storage optimization type. WAN target, LAN target, or local target. This can easily be overlooked, but will make a difference in backup performance.

24. How many backups you can keep is a function of change range, frequency, dedupe and compression, and the size of your target. Yep, that is a lot of variables. If nothing else, find some storage that can serve as the target for say, 2 weeks. That should give a pretty good sampling of all of the above.

25. Take one item/feature once a week, and spend an hour or two looking into it. This will allow you to find out more about say, Changed block tracking, or what the application aware image processing feature does. Your reputation (and perhaps, your job) may rely on your ability to recover systems and data. Come up with a handful of scenarios and see if they work.

Veeam is an extremely powerful tool that will simplify your layers of protection in your environment. Features like SureBackup, Virtual Labs, and their Replication offerings are all very good. But more than likely, they do not need to be a part of your initial deployment plan. Stay focused, and get that new backup software up and running as quickly as possible. You, and your organization, will be better off for it.

– Pete

Effects of introducing write-back caching with PernixData FVP

Implementing new technology that solves real problems is great. It is exciting, and you get to stand on the shoulders of the smart folks who dreamed up the solution. But with all of that glory comes new design and operation elements that may have been introduced. This isn’t a bad thing. It is just different. The magic of virtualization didn’t excuse the requirement of needing to understand the design and operational considerations of the new paradigm. The same goes for implementing host based caching in a virtualized environment.

Implementing FVP is simple and the results can be impressive. For many, that is about all the effort they may end up putting into it. But there are design considerations that will help maximize the investment, and minimize false impressions, or costly mistakes. I want to share what has been learned against my real world workloads, so that you can understand what to look for, and possibly how to get more out of your investment. While FVP accelerates both reads and writes, it is the latter that warrants the most consideration, so that will be the focus of this post.

When accelerating storage using FVP, the factors that I’ve found to have the most influence on how much your storage I/O is accelerated are:

Interconnect speed between hosts of your pooled flash
Performance delta between your flash tier, and your storage tier.
Working set size of your data
Duty cycle write I/O profile of your VMs (including peak writes, and duration)
I/O size of your writes (which can vary within each workload)
Likelihood or frequency of DRS or manual vMotion activities
Native speed and consistency of your flash (the flash itself, and the bus speed)
Capacity of your flash (more of an influence on read caching, but can have some impact on writes too)

Write-back caching & vMotion
Most know by now that to guard against any potential data loss in the event of a host failure, FVP provides redundancy of write-back caching through the use of one or more peers. The interconnect used is the vMotion network. While FVP does a good job of decoupling the VM’s need to wait for the backing datastore, a VM configured for write-back with redundancy must acknowledge the write I/O of the VM from it’s local flash, AND the one or more peers before it returns the write ACK to the VM.

What does this mean to your environment? More traffic on your vMotion network. Take a look at the image below. In a cluster NOT accelerated by FVP, the host uplinks that serve a vMotion network might see relatively little traffic, with bursts of traffic only during vMotion activities. That would also be the case if you were running FVP in write-back mode with no peers (WB+0). This image below is what the activity on the vMotion network looks like as perceived by one of the hosts after the VMs had write-back with redundancy of one peer. In this case the writes were averaging about 12MBps across the vMotion network. You will see that the spike is where a vMotion kicked off: The spike is the peak output of a 1GbE interface; about 125MBps.

Is this bad that the traffic is running over your vMotion network? No, not necessarily. It has to run over something. But with this knowledge, it is easy to see that bandwidth for inter-server communication will be more important than ever before. Your infrastructure design may need to be tweaked to accommodate the new role that the vMotion network plays.

Can one get away with a 1GbE link for cross server communication? Perhaps. It really depends on the factors above, which can sometimes be hard to determine. So with all of the variables to consider, it is sometimes easiest to circle back to what we do know:

Redundant write back caching with FVP will be using network connectivity (via vMotion network) for every single write that occurs for an accelerated VM.
Redundant write back caching writes are multiplied by the number of peers that are configured per accelerated VM.
The write accelerated I/O commit time (latency) will be as fast as the slowest connection. Your vMotion network will likely be slower than the local bus. A poor quality SSD or an older generation bus could be a bottleneck too.
vMotion activities enjoy using every bit of bandwidth it has available to it.
VM’s that are committing a lot of writes might also be taxing CPU resources, which may kick in DRS rules to rebalance the load – thus creating more vMotion traffic. Those busy VMs may be using more active memory pages as well, which may increase the amount of data to move during the vMotion process.

The multiplier of redundancy
Lets run through a simple scenario to better understand the potential impact an undersized vMotion network can have on the performance of write-back caching with redundancy. The example is addressing writes only.

4 hosts each have a group of 6 VM’s that consistently write 5MBps per VM. Traditionally, these 24 VMs would be sending a total of 120MBps to the backing physical storage.
When write back is enabled without any redundancy (WB+0), the backing storage will still see the same amount of writes committed, but it will be in a slightly different way. Sequential, and smoothed out as data is flushed to the backing physical storage.
When write back is enabled and a write redundancy of “local flash and 1 network flash device” (WB+1) is chosen, the backing storage will still see 120MBps go to it eventually, but there will be an additional 120MBPs of data going to the host peers, traversing the vMotion network.
When write back is enabled and a write redundancy of “local flash and 2 network flash devices” (WB+2) is chosen, the backing storage will still see 120MBps to it, but there will be an additional 240MBps of data going to the host peers, traversing the vMotion network.

The write-back redundancy configuration is a per-VM setting, so there not necessarily a need to change them all to one setting. Your VMs will most likely not have the same write workload either. But this is to illustrate the point that as the example shows, it is not hard to saturate a 1GbE interface. Assuming an approximate 125MBps on a single 1GbE interface, under the described arrangement, saturation would occur with each VM configured for write-back with redundancy of one peer (WB+1). This leaves little headroom for other traffic that might be traversing that network, such as vMotions, or heartbeats.

Fortunately FVP has the smarts built in to ensure that vMotion activities and write-back caching get along. However, there is no denying the physics associated with the matter. If you have a lot of writes, and you really want to leverage the full beauty of FVP, you are best served by fast interconnects between hosts. It is a small price to pay for supreme performance. FVP might expose the fact that 1GbE not be ideal in an accelerated environment, but consider what else has changed over the years. Standard memory sizes of deployed VMs have increased significantly (The vOpenData Public Dashboard confirms this). That 1GbE vMotion network might have been good for VM’s with 512MB of RAM, but what about those with 4, 8, or 12GB of RAM? That 1GbE vMotion network has become outdated even for what it was originally designed for.

Destaging
One characteristic unique with any type of write-back caching is that eventually, the data needs to be destaged to the backing physical datastore. The server-side flash that is now decoupled from the backing storage has the potential to accommodate a lot of write I/Os with minimal latency. One may or may not have the backing spindles, or conduit large enough to be sending your write I/O to the backing physical storage if this high write I/O lasts long enough. Destaging issues can occur on an arrangement like FVP, or with storage arrays and DAS arrangements that front performance I/O with flash that get pushed to slower spindles.

Knowing the impact of this depends on the workload and the environment it runs in.

If the duty cycle of the write workload that is above the physical storage I/O limit allows for enough “rest time” (defined as any moment that the max I/O to the backing physical storage is below 100%) to destage before the next over commitment begins, then you have effectively increased your ability to deliver more write I/Os with less latency.
If the duty cycle of the write workload that is above the physical storage I/O limit is sustained for too long, the destager of that given VM will fill to capacity, and will not be able to accelerate any faster than it’s ability to destage.

Huh? Okay, a picture might be a better way to describe this. The callouts below point to the two scenarios described.

So when looking at this write I/O duty cycle, there becomes a concept of amplitude of the maximum write I/O, and frequency of those times in which is it overcommitting. When evaluating an environment, you might see this crude sine-wave show up. This write I/O duty cycle, coupled with your physical components is the key to how much FVP can accelerate your environment.

What happens when the writes to the destager surpass the ability of your backing storage to keep up with the writes? Once the destager for that given VM fills up, it’s acceleration will reduce to the rate that it can evacuate the data to the backing storage. One may never see this in production, but it is possible. It really depends on the factors listed at the beginning of the post. The only way to clearly see this is from a synthetic workload, where I show it was able to push 5 times the write I/Os (blue line) before eventually filling up the destager to the point where it was throttled back to the rate of the datastore (purple line)

This will have an impact on the effective latency, shown below (blue line). While the destager is full, it will not be able to fulfill the write at the low latency typically associated with flash, reflecting latency closer to the backing datastore (purple line).

Many workloads would never see this behavior, but those that are very write intensive (like mine), and that have a big delta between their acceleration tier and their backing storage may run into this.

The good news is that workloads have a tendency to be bursty, which is a perfect match for an acceleration tier. In a clustered arrangement, this is much harder to predict, and bursty can be changed to steady-state quite quickly. What this demonstrates is that if there is enough of a performance delta between your acceleration tier, and your storage tier, under cases of sustained writes, there may be times where it doesn’t have the opportunity to flush enough writes to maintain it’s ability to accelerate.

Recommendations
My recommendations (and let me clarify that these are my opinions only) on implementing FVP would include.

Initially, run the VMs in write-through mode so that you can leverage the FVP analytics to better understand your workload (duty cycles, read/write ratios, maximum write throughput for a VM, IOPS, latency, etc.)
As you gain a better understanding of the behavior of these workloads, introduce write-back caching to see how it helps the systems changed.
Keep and eye on your vMotion network (in particular, those with 1GbE environments and limited physical ports) and see if one ever comes close to saturation. Other leading indicators will be increased latency on accelerated writes.
Run out and buy some 10GbE NICs for your vMotion network. If you are in a situation with a total 1GbE legacy fabric for your SAN, and your vMotion network, and perhaps you have limits on form factors that may make upgrading difficult (think blades here), consider investing in 10GbE for your vMotion network, as opposed to your backing storage. Your read caching has probably already relieved quite a bit of I/O pressure on your storage, and addressing your cross server bandwidth is ultimately a more affordable, and simpler task.
If possible, allocate more than one link and configure for Multi-NIC vMotion. At this time, FVP will not be able to leverage this, but it will allow vMotion to use another link if the other link is busy. Another possible option would be to bond multiple 1GbE links for vMotion. This may or may not be suitable for your environment.

So if you haven’t done so already, plan to incorporate 10GbE for cross-server communication for your vMotion Network. Not only will your vMotioning VM’s thank you, so will the performance of FVP.

– Pete

Helpful links:

Fault Tolerant Write acceleration
http://frankdenneman.nl/2013/11/05/fault-tolerant-write-acceleration/

Destaging Writes from Acceleration Tier to Primary Storage
http://voiceforvirtual.com/2013/08/14/destaging-writes-i/

Using a new tool to discover old problems

It is interesting what can be discovered when storage is accelerated. Virtual machines that were previously restricted by the underperforming arrays now get to breath freely. They are given the ability to pass storage I/O as quickly as the processor needs. In other words, the applications that need the CPU cycles get to dictate your storage requirements, rather than your storage imposing artificial limits on your CPU.

With that idea in mind, a few things revealed themselves during the process of implementing PernixData FVP. Early on, it was all about implementing and understanding the solution. However, once the real world workloads began accelerating, there was intrigue on the analytics that FVP was providing. What was generating the I/O that was being accelerated? What processes were associated with the other traffic not being accelerated, and why? What applications were behind the changing I/O sizes? And what was causing the peculiar I/O patterns that were showing up? Some of these were questions raised at an earlier time (see: Hunting down unnecessary I/O before you buy that next storage solution ). The trouble was, the tools I had to discover the pattern of data I/O were limited.

Why is this so important? In the spirit of reminding ourselves that no resource is an island, here is an example of a production code compile run, as looking from the perspective of the guest CPU. The first screen capture is the code with adequate storage I/O to support the application’s needs. A full build and is running nearly perfect CPU utilization of all 8 of it’s vCPUs. (screen shots taken from my earlier post; Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements-Part 2)

Below is that very same code compile, under stressed backend storage. It took 46% longer to complete, and as you can see, changes the CPU utilization of the build run.

The primary goal for this environment was to accelerate the storage. However, it would have been a bit presumptuous to conclude that all existing storage traffic is good, useful I/O. There is a significant amount of traffic originating from outside of IT, and the I/O generated needed to be understood better. With the traffic passing more freely thanks to FVP acceleration, patterns that previously could not expose themselves should be more visible. This was the basis for the first discovery

A little “CSI” work on the IOPS
Many continuous build systems use some variation of a polling mechanism to understand when there is new checked in code that needs to be compiled. This should be a very light weight process. However, once storage performance was allowed to breath better, the following patterns started showing up on all of the build VMs.

The image below shows the IOPS for one build VM during a one hour period of no compiling for that particular VM. The VM’s were polling for new builds every 5 minutes. Yep, that “build heartbeat” was as high as 450 IOPS on each VM.

Why wasn’t this noticed before? These spikes were being suppressed by my previously overtaxed storage, which made them more difficult to see. These were all writes, and were translating into 500 to 600 steady state IOPS just to sit idle (as seen below from the perspective of the backing storage)

So what was the cause? As it turned out, the polling mechanism was using some source code control (SVN) calls to help the build machines understand if it needed to execute a build. Programmatically, the Development Team has no idea that the script that they develop is going to be efficient, or not efficient. They are separated by that layer of the infrastructure. (Sadly, I have a feeling this happens more often than not in general Application Development). This resulted in a horribly inefficient method. After helping them understand the matter, it was revamped, and now polling for each VM only takes 1 to 2 IOPS every 5 minutes.

The image below shows how the accelerated cluster of 30 build VMs looks when there are no builds running.

The inefficient polling mechanism wasn’t the only thing found. A few of the Linux build VMs had a rouge “Beagle” search daemon running on them. This crawler did just that, indexing data on these Linux machines, and creating unnecessary I/O. With Windows, Indexers and other CPU and I/O hogs are typically controlled quite easily by GPO, but the equivalent services can creep into Linux systems if not careful. It was an easy fix at least.

The cumulative benefit
Prior to the efforts of accelerating the storage, and looking to make it more efficient, the utilization of the arrays looked as the image shows. (6 hour period, from the perspective of the arrays)

Now, with the combination of understanding my workload better, and acceleration through FVP, that same workload looks like this (6 hour period, from the perspective of the arrays):

Notice that the estimated workload is far under the 100% it was regularly pegged at for 24 hours a day, 6 days a week. In fact, during the workday, the arrays might only peak at 50% to 60% utilization. When no builds are running, the continuous build system may only be drawing 25 IOPS from the VMFS volumes that contain the build machines, which is much more reasonable than where it was at.

With the combination of less pressure on the backing physical storage, and the magic of pooled flash on the hosts, the applications and CPU get to dictate how much storage I/O is needed. Below is a screen capture of IOPS on a production build VM while compiling was being performed. It was not known up until this point that a single build VM needed as much as 4,000 IOPS to compile code because the physical storage was never capable of satisfying that type of need.

Conclusion
Could some of these discoveries have been made without FVP? Yes, perhaps some of it. But good analysis comes from being able to interpret data in a consumable way. Its why various methods of data visualization such as bar graphs, pie charts, and X-Y-Z plots exist. FVP certainly has been doing a good job of accelerating workloads, but it is also helps the administrator understand the I/O better. I look forward to seeing how the analytics might expand in future tools or releases from PernixData.

A friend once said to me that the only thing better than a new tractor is a reason to use it. In many ways, the same thing goes for technology. Virtualization might not even be that fascinating unless you had real workloads to run on top of it. Ditto for for PernixData FVP. When applied to real workloads, the magic begins to happen, and you learn a lot about your data in the process.

Accelerating storage using PernixData’s FVP. A perspective from customer #0001

Recently, I described in "Hunting down unnecessary I/O before you buy that next storage solution" the efforts around addressing "technical debt" that was contributing to unnecessary I/O. The goal was to get better performance out of my storage infrastructure. It’s been a worthwhile endeavor that I would recommend to anyone, but at the end of the day, one might still need faster storage. That usually means, free up another 3U of rack space, and open checkbook

Or does it? Do I have to go the traditional route of adding more spindles, or investing heavily in a faster storage fabric? Well, the answer was an unequivocal "yes" not too long ago, but times are a changing, and here is my way to tackle the problem in a radically different way.

I’ve chosen to delay any purchases of an additional storage array, or the infrastructure backing it, and opted to go PernixData FVP. In fact, I was customer #0001 after PernixData announced GA of FVP 1.0. So why did I go this route?

1. Clustered host based caching. Leveraging server side flash brings compute and data closer together, but thanks to FVP, it does so in such a way that works in a highly available clustered fashion that aligns perfectly with the feature sets of the hypervisor.

2. Write-back caching. The ability to deliver writes to flash is really important. Write-through caching, which waits for the acknowledgement from the underlying storage, just wasn’t good enough for my environment. Rotational latencies, as well as physical transport latencies would still be there on over 80% of all of my traffic. I needed true write-back caching that would acknowledge the write immediately, while eventually de-staging it down to the underlying storage.

3. Cost. The gold plated dominos of upgrading storage is not fun for anyone on the paying side of the equation. Going with PernixData FVP was going to address my needs for a fraction of the cost of a traditional solution.

4. It allows for a significant decoupling of "storage for capacity" versus "storage for performance" dilemma when addressing additional storage needs.

5. Another array would have been to a certain degree, more of the same. Incremental improvement, with less than enthusiastic results considering the amount invested. I found myself not very excited to purchase another array. With so much volatility in the storage market, it almost seemed like an antiquated solution.

6. Quick to implement. FVP installation consists of installing a VIB via Update Manager or the command line, installing the Management services and vCenter plugin, and you are off to the races.

7. Hardware independent. I didn’t have to wait for a special controller upgrade, firmware update, or wonder if my hardware would work with it. (a common problem with storage array solutions). Nor did I have to make a decision to perhaps go with a different storage vendor if I wanted to try a new technology. It is purely a software solution with the flexibility of working with multiple types of flash; SSDs, or PCIe based.

A different way to solve a classic problem
While my write intensive workload is pretty unique, my situation is not. Our storage performance needs outgrew what the environment was designed for; capacity at a reasonable cost. This is an all too common problem. With the increased capacities of spinning disks, it has actually made this problem worse, not better. Fewer and fewer spindles are serving up more and more data.

My goal was to deliver the results our build VMs were capable of delivering with faster storage, but unable to because of my existing infrastructure. For me it was about reducing I/O contention to allow the build system CPU cycles to deliver the builds without waiting on storage. For others it might delivering lower latencies to their SQL backed ERP or CRM servers.

The allure of utilizing flash has been an intriguing one. I often found myself looking at my vSphere hosts and all of it’s processing goodness, but disappointed those SSD sitting in the hosts couldn’t help to augment my storage performance needs. Being an active participant in the PernixData beta program allowed me to see how it would help me in my environment, and if it would deliver the needs of the business.

Lessons learned so far
Don’t skimp on quality SSDs. Would you buy an ESXi host with one physical core? Of course you wouldn’t. Same thing goes with SSDs. Quality flash is a must! I can tell you from first hand experience that it makes a huge difference. I thought the Dell OEM SSDs that came with my M620 blades were fine, but by way of comparison, they were terrible. Don’t cripple a solution by going with cheap flash. In this 4 node cluster, I went with 4 EMLC based, 400GB Intel S3700s. I also had the opportunity to test some Micron P400M EMLC SSDs, which also seemed to perform very well.

While I went with 400GB SSDs in each host (giving approximately 1.5TB of cache space for a 4 node cluster), I did most of my testing using 100GB SSDs. They seemed adequate in that they were not showing a significant amount of cache eviction, but I wanted to leverage my purchasing opportunity to get larger drives. Knowing the best size can be a bit of a mystery until you get things in place, but having a larger cache size allows for a larger working set of data available for future reads, as well as giving head room for the per-VM write-back redundancy setting available.

An unexpected surprise is how FVP has given me visibility into the one area of I/O monitoring that is traditional very difficult to see; I/O patterns. See Iometer. As good as you want to make it. Understanding this element of your I/O needs is critical, and the analytics in FVP has helped me discover some very interesting things about my I/O patterns that I will surely be investigating in the near future.

In the read-caching world, the saying goes that the fastest storage I/O is the I/O the array never will see. Well, with write caching, it eventually needs to be de-staged to the array. While FVP will improve delivery of storage to the array by absorbing the I/O spikes and turning random writes to sequential writes, the I/O will still eventually have to be delivered to the backend storage. In a more write intensive environment, if the delta between your fast flash and your slow storage is significant, and your duty cycle of your applications driving the I/O is also significant, there is a chance it might not be able to keep up. It might be a corner case, but it is possible.

What’s next
I’ll be posting more specifics on how running PernixData FVP has helped our environment. So, is it really "disruptive" technology? Time will ultimately tell. But I chose to not purchase an array along with new SAN switchgear because of it. Using FVP has lead to less traffic on my arrays, with higher throughput and lower read and write latencies for my VMs. Yeah, I would qualify that as disruptive.

Helpful Links

Frank Denneman – Basic elements of the flash virtualization platform – Part 1
http://frankdenneman.nl/2013/06/18/basic-elements-of-the-flash-virtualization-platform-part-1/

Frank Denneman – Basic elements of the flash virtualization platform – Part 2
http://frankdenneman.nl/2013/07/02/basic-elements-of-fvp-part-2-using-own-platform-versus-in-place-file-system/

Frank Denneman – FVP Remote Flash Access
http://frankdenneman.nl/2013/08/07/fvp-remote-flash-access/

Frank Dennaman – Design considerations for the host local FVP architecture
http://frankdenneman.nl/2013/08/16/design-considerations-for-the-host-local-architecture/

Satyam Vaghani introducing PernixData FVP at Storage Field Day 3
http://www.pernixdata.com/SFD3/

Write-back deepdive by Frank and Satyam
http://www.pernixdata.com/files/wb-deepdive.html

Iometer. As good as you want to make it.

Most know Iometer as the go-to synthetic I/O measuring tool used to simulate real workload conditions. Well, somewhere, somehow, someone forgot the latter part of that sentence, which is why it ends up being so misused and abused. How many of us have seen a storage solution delivering 6 figure IOPS using Iometer, only to find that they are running a 100% read, 512 byte 100% sequential access workload simulation. Perfect for the two people on the planet that those specifications might apply to. For the rest of us, it doesn’t help much. So why would they bother running that sort of unrealistic test? Pure, unapologetic number chasing.

The unfortunate part is that sometimes this leads many to simply dismiss Iometer results. That is a shame really, as it can provide really good data if used in the correct way. Observing real world data will tell you a lot of things, but the sporadic nature of real workloads make it difficult to use for empirical measurement – hence the need for simulation.

So, what are the correct settings to use in Iometer? The answer is completely dependent on what you are trying to accomplish. The race for a million IOPS by your favorite storage vendor really means nothing if their is no correlation between their simulated workload, and your real workload. Maybe IOPS isn’t even an issue for you. Perhaps your applications are struggling with poor latency. The challenge is to emulate your environment with a synthetic workload that helps you understand how a potential upgrade, new array, or optimization might be of benefit.

The mysteries of workloads
Creating a synthetic workload representing your real workload assumes one thing; that you know what your real workload really is. This can be more challenging that one might think, as many storage monitoring tools do not help you understand the subtleties of patterns to the data that is being read or written.

Most monitoring tools tend to treat all I/O equally. By that I mean, if over a given period of time, say you have 10 million I/Os occur. Let’s say your monitoring tells you that you average 60% reads and 40% writes. What is not clear is how many of those reads are multiple reads of the same data or completely different, untouched data. It also doesn’t tell you if the writes are overwriting existing blocks (which might be read again shortly thereafter) or generating new data. As more and more tiered storage mechanisms comes into play, understanding this aspect of your workload is becoming extremely important. You may be treating your I/Os equally, but the tiered storage system using sophisticated caching algorithms certainly do not.

How can you gain more insight? Use every tool at your disposal. Get to know your applications, and the duty cycles around them. What are your peak hours? Are they in the middle of the day, or in the middle of the night when backups are running?

Suggestions on Iometer settings
You may find that the settings you choose for Iometer yields results from your shared storage that isn’t nearly as good as you thought. But does it matter? If it is an accurate representation of your real workload, not really. What matters is if are you able to deliver the payload from point a to point b to meet your acceptance criteria (such as latency, throughput, etc.). The goal would be to represent that in a synthetic workload for accurate measurement and comparison.

With that in mind, here are some suggestions for the next time you set up those Iometer runs.

1. Read/write ratio. Choose a realistic read/write ratio representing your workload. With writes, RAID penalties can hurt your effective performance by quite a bit, so if you don’t have an idea of what this ratio currently is, it’s time for you to find out.

2. Transfer request size. Is your payload the size of a ball bearing, or a bowling ball? Applications and operating systems vary on what size is used. Use your monitoring systems to best determine what your environment consists of.

3. Disk size. Use the "maximum disk size" in multiples of 1048576, which is a 1GB file. Throwing a bunch of zeros in there might fill up your disk with Iometer’s test file. Depending on your needs, a setting of 2 to 20 GB might be a good range to work with.

4. Number of outstanding I/Os. This needs to be high enough so that the test can keep sending I/O requests to it as the storage is fulfilling requests to it. A setting of 32 is pretty common.

5. Alignment of I/O. Many of the standard Iometer ICF files you find were built for physical drives. It has the "Align I/Os on:" setting to "Sector boundaries" When running tests on a storage array, this can lead to inconsistent results, so it is best to align on 4K or 512 bytes.

6. Ramp up time. Offer at least a minute of ramp up time.

7. Run time. Some might suggest running simulations long enough to exhaust all caching, so that you can see "real" throughput. While I understand the underlying reason for this statement, I believe this is missing the point. Caching is there in the first place to take advantage of a working set of warm and cold data, bursts, etc. If you have a storage solution that satisfies the duty cycles that exists in your environment, that is the most important part.

8. Number of workers. Let this spawn automatically to the number of logical processors in your VM. It might be overkill in many cases because of terrible multithreading abilities of most applications, but its a pretty conventional practice.

9. Multiple Iometer instances. Not really a setting, but more of a practice. I’ve found running multiple tests a way to better understand how a storage solution will react under load as opposed to on it’s own. It is shared storage after all.

Disclaimer
If you were looking for this to be the definitive post on Iometer, that isn’t what I was shooting for. There are many others who are much more qualified to speak to the nuances of Iometer than me. What I hope to do is to offer a little practical perspective on it’s use, and how it can help you. So next time you run Iometer, think about what you are trying to accomplish, and let go of the number chasing. Understand your workloads, and use the tool to help you improve your environment.

Hunting down unnecessary I/O before you buy that next storage solution

Are legacy processes and workflows sabotaging your storage performance? If you are on the verge of committing good money for more IOPS, or lower latency, it might be worth taking a look at what is sucking up all of those I/Os.

In my previous posts about improving the performance of our virtualized code compiling systems, it was identified that storage performance was a key factor in our ability to leverage our scaled up compute resources. The classic response to this dilemma has been to purchase faster storage. While that might be a part of the ultimate solution, there is another factor worth looking into; legacy processes, and how they might be impacting your environment.

Even though new technologies are helping deliver performance improvements, one constant is that traditional, enterprise class storage is expensive. Committing to faster storage usually means committing large chunks of dollars to the endeavor. This can be hard to swallow at budget time, or doesn’t align well with the immediate needs. And there can certainly be a domino effect when improving storage performance. If your fabric cannot support a fancy new array, the protocol type, or speed, get ready to spend even more money.

Calculated Indecision
In the optimization world, there is an approach called "delay until the last responsible moment" (LRM). Do not mistake this for a procrastinator’s creed of "kicking the can." It is a pretty effective, Agile-esque strategy in hedging against poor, or premature purchasing decisions to, in this case, the rapidly changing world of enterprise infrastructures. Even within the last few years, some companies have challenged traditional thinking when it comes to how storage and compute is architected. LRM helps with this rapid change, and has the ability to save a lot of money in the process.

Look before you buy
Writes are what you design around and pay big money for, so wouldn’t it be logical to look at your infrastructure to see if legacy processes are undermining your ability to deliver I/O? That is the step I took in an effort to squeeze out every bit of performance that I could with my existing arrays before I commit to a new solution. My quick evaluation resulted in this:

Using array based snapshotting for short term protection was eating up way too much capacity; 25 to 30TB. That is almost half of my total capacity, and all for a retention time that wasn’t very good. How does capacity relate to performance? Well, if one doesn’t need all of that capacity for snapshot or replica reserves, one might be able to run at a better performing RAID level. Imagine being able to cut the write penalty by 2 to 3 times if you were currently running RAID levels focused on capacity. For a write-intensive environment like mine, that is a big deal.
Legacy I/O intensive applications and processes identified. What are they, can they be adjusted, or are they even needed anymore.

Much of this I didn’t need to do a formal analysis of. I knew the environment well enough to know what needed to be done. Here is what the plan of action has consisted of.

Ditch the array based snapshots and remote replicas in favor of Veeam. This is something that I wanted to do for some time. Local and remote protection is now the responsibility of some large Synology NAS units as the backup target for Veeam. Everything about this combination has worked incredibly well. For those interested, I’ll be writing about this arrangement in the near future.
Convert existing Guest Attached Volumes to native VMDKs. My objective with this is to make Veeam see the data so that it can protect it. Integrated, compressed and deduped. What it does best.
Reclaim all of the capacity gained from no longer using snaps and replicas, and rebuild one of the arrays from RAID 50, to RAID 10. This will cut the write penalty from 4, to 2.
Adjust or eliminate legacy I/O intensive apps.

The Culprits
Here were the biggest influencers of legacy I/O intensive applications (“legacy” after the incorporation of Veeam). Total time per day shown below, and may reflect different backup frequencies.

Source: Legacy SharePoint backup solution
Cost: 300 write IOPS for 8 hours per day
Action: This can be eliminated because of Veeam

Source: Legacy Exchange backup solution
Cost: 300 write IOPS for 1 hour per day
Action: This can be eliminated because of Veeam

Source: SourceCode (SVN) hotcopies and dumps
Cost: 200-700 IOPS for 12 hours per day.
Action: Hot copies will be eliminated, but SVN dumps will be redirected to an external target. An optional method of protection that in a sense is unnecessary, but source code is the lifeblood of a software company, so it is worth the overhead right now.

Source: Guest attached Volume Copies
Cost: Heavy read IOPS on mounted array snapshots when dumping to external disk or tape.
Action: Guest attached volumes will be converted to native VMDKs so that Veeam can see and protect the data.

Notice the theme here? Much of the opportunities for improvement in reducing I/O had to do with dealing with legacy “in-guest” methods of protecting the data. Moving to a hypervisor centric backup solution like Veeam has also reinforced a growing feeling I’ve had about storage array specific features that focus on data protection. I’ve grown to be disinterested in them. Here are a few reasons why.

It creates an indelible tie between your virtualization infrastructure, protection, and your storage. We all love the virtues of virtualizing compute. Why do I want to make my protection mechanisms dependent on a particular kind of storage? Abstract it out, and it becomes way easier.
Need more replica space? Buy more arrays. Need more local snapshot space? Buy more arrays. You end up keeping protection on pretty expensive storage
Modern backup solutions protect the VMs, the applications, and the data better. Application awareness may have been lacking years ago, but not anymore.
Vendor lock-in. I’ve never bought into this argument much, mostly because you end up having to make a commitment at some point with just about every financial decision you make. However, adding more storage arrays can eat up an entire budget in an SMB/SME world. There has to be a better way.
Complexity. You end up having a mess of methods of how some things are protected, while other things are protected in a different way. Good protection often comes in layers, but choosing a software based solution simplifies the effort.

I used to live by array specific tools for protecting data. It was all I had, and they served a very good purpose. I leveraged them as much as I could, but in hindsight, they can make a protection strategy very complex, fragile, and completely dependent on sticking with that line of storage solutions. Use a solution that hooks into the hypervisor via the vCenter API, and let it do the rest. Storage vendors should focus on what they do best, which is figuring out ways to deliver bigger, better, and faster storage.

What else to look for.
Other possible sources that are robbing your array of I/Os:

SQL maintenance routines (dumps, indexing, etc.). While necessary, you may choose to run these at non peak hours.
Defrags. Surely you have a GPO shutting off this feature on all your VMs, correct? (hint, hint)
In-guest legacy anything. Traditional backup agents are terrible. Virus scan’s aren’t much better.
User practices. Don’t be surprised if you find out some department doing all sorts of silly things that translates into heavy writes. (.e.g. “We copy all of our data to this other directory hourly to back it up.”)
Guest attached volumes. While they can be technically efficient, one would have to rely on protecting these in other ways because they are not visible from vCenter. Often this results in some variation of making an array based snapshot available to a backup application. While it is "off-host" to the production system, this method takes a very long time, whether the target is disk or tape.

One might think that eventually, the data has to be committed to external disk or tape anyway, so what does it matter. When it is file level backups, it matters a lot. For instance, committing 9TB of guest attached volume data (millions of smaller files) directly to tape takes nearly 6 days to complete. Committing 9TB of Veeam backups to tape takes just a little over 1 day.

The Results

So how much did these steps improve the average load on the arrays? This is a work in progress, so I don’t have the final numbers yet. But with each step, contention is decreased on my arrays, and my protection strategy has become several orders of magnitude simpler in the process.

With all of that said, I will be addressing my storage performance needs with *something* new. What might that be? Stay tuned.