July 2013

Are legacy processes and workflows sabotaging your storage performance? If you are on the verge of committing good money for more IOPS, or lower latency, it might be worth taking a look at what is sucking up all of those I/Os.

In my previous posts about improving the performance of our virtualized code compiling systems, it was identified that storage performance was a key factor in our ability to leverage our scaled up compute resources. The classic response to this dilemma has been to purchase faster storage. While that might be a part of the ultimate solution, there is another factor worth looking into; legacy processes, and how they might be impacting your environment.

Even though new technologies are helping deliver performance improvements, one constant is that traditional, enterprise class storage is expensive. Committing to faster storage usually means committing large chunks of dollars to the endeavor. This can be hard to swallow at budget time, or doesn’t align well with the immediate needs. And there can certainly be a domino effect when improving storage performance. If your fabric cannot support a fancy new array, the protocol type, or speed, get ready to spend even more money.

Calculated Indecision
In the optimization world, there is an approach called "delay until the last responsible moment" (LRM). Do not mistake this for a procrastinator’s creed of "kicking the can." It is a pretty effective, Agile-esque strategy in hedging against poor, or premature purchasing decisions to, in this case, the rapidly changing world of enterprise infrastructures. Even within the last few years, some companies have challenged traditional thinking when it comes to how storage and compute is architected. LRM helps with this rapid change, and has the ability to save a lot of money in the process.

Look before you buy
Writes are what you design around and pay big money for, so wouldn’t it be logical to look at your infrastructure to see if legacy processes are undermining your ability to deliver I/O? That is the step I took in an effort to squeeze out every bit of performance that I could with my existing arrays before I commit to a new solution. My quick evaluation resulted in this:

Using array based snapshotting for short term protection was eating up way too much capacity; 25 to 30TB. That is almost half of my total capacity, and all for a retention time that wasn’t very good. How does capacity relate to performance? Well, if one doesn’t need all of that capacity for snapshot or replica reserves, one might be able to run at a better performing RAID level. Imagine being able to cut the write penalty by 2 to 3 times if you were currently running RAID levels focused on capacity. For a write-intensive environment like mine, that is a big deal.
Legacy I/O intensive applications and processes identified. What are they, can they be adjusted, or are they even needed anymore.

Much of this I didn’t need to do a formal analysis of. I knew the environment well enough to know what needed to be done. Here is what the plan of action has consisted of.

Ditch the array based snapshots and remote replicas in favor of Veeam. This is something that I wanted to do for some time. Local and remote protection is now the responsibility of some large Synology NAS units as the backup target for Veeam. Everything about this combination has worked incredibly well. For those interested, I’ll be writing about this arrangement in the near future.
Convert existing Guest Attached Volumes to native VMDKs. My objective with this is to make Veeam see the data so that it can protect it. Integrated, compressed and deduped. What it does best.
Reclaim all of the capacity gained from no longer using snaps and replicas, and rebuild one of the arrays from RAID 50, to RAID 10. This will cut the write penalty from 4, to 2.
Adjust or eliminate legacy I/O intensive apps.

The Culprits
Here were the biggest influencers of legacy I/O intensive applications (“legacy” after the incorporation of Veeam). Total time per day shown below, and may reflect different backup frequencies.

Source: Legacy SharePoint backup solution
Cost: 300 write IOPS for 8 hours per day
Action: This can be eliminated because of Veeam

Source: Legacy Exchange backup solution
Cost: 300 write IOPS for 1 hour per day
Action: This can be eliminated because of Veeam

Source: SourceCode (SVN) hotcopies and dumps
Cost: 200-700 IOPS for 12 hours per day.
Action: Hot copies will be eliminated, but SVN dumps will be redirected to an external target. An optional method of protection that in a sense is unnecessary, but source code is the lifeblood of a software company, so it is worth the overhead right now.

Source: Guest attached Volume Copies
Cost: Heavy read IOPS on mounted array snapshots when dumping to external disk or tape.
Action: Guest attached volumes will be converted to native VMDKs so that Veeam can see and protect the data.

Notice the theme here? Much of the opportunities for improvement in reducing I/O had to do with dealing with legacy “in-guest” methods of protecting the data. Moving to a hypervisor centric backup solution like Veeam has also reinforced a growing feeling I’ve had about storage array specific features that focus on data protection. I’ve grown to be disinterested in them. Here are a few reasons why.

It creates an indelible tie between your virtualization infrastructure, protection, and your storage. We all love the virtues of virtualizing compute. Why do I want to make my protection mechanisms dependent on a particular kind of storage? Abstract it out, and it becomes way easier.
Need more replica space? Buy more arrays. Need more local snapshot space? Buy more arrays. You end up keeping protection on pretty expensive storage
Modern backup solutions protect the VMs, the applications, and the data better. Application awareness may have been lacking years ago, but not anymore.
Vendor lock-in. I’ve never bought into this argument much, mostly because you end up having to make a commitment at some point with just about every financial decision you make. However, adding more storage arrays can eat up an entire budget in an SMB/SME world. There has to be a better way.
Complexity. You end up having a mess of methods of how some things are protected, while other things are protected in a different way. Good protection often comes in layers, but choosing a software based solution simplifies the effort.

I used to live by array specific tools for protecting data. It was all I had, and they served a very good purpose. I leveraged them as much as I could, but in hindsight, they can make a protection strategy very complex, fragile, and completely dependent on sticking with that line of storage solutions. Use a solution that hooks into the hypervisor via the vCenter API, and let it do the rest. Storage vendors should focus on what they do best, which is figuring out ways to deliver bigger, better, and faster storage.

What else to look for.
Other possible sources that are robbing your array of I/Os:

SQL maintenance routines (dumps, indexing, etc.). While necessary, you may choose to run these at non peak hours.
Defrags. Surely you have a GPO shutting off this feature on all your VMs, correct? (hint, hint)
In-guest legacy anything. Traditional backup agents are terrible. Virus scan’s aren’t much better.
User practices. Don’t be surprised if you find out some department doing all sorts of silly things that translates into heavy writes. (.e.g. “We copy all of our data to this other directory hourly to back it up.”)
Guest attached volumes. While they can be technically efficient, one would have to rely on protecting these in other ways because they are not visible from vCenter. Often this results in some variation of making an array based snapshot available to a backup application. While it is "off-host" to the production system, this method takes a very long time, whether the target is disk or tape.

One might think that eventually, the data has to be committed to external disk or tape anyway, so what does it matter. When it is file level backups, it matters a lot. For instance, committing 9TB of guest attached volume data (millions of smaller files) directly to tape takes nearly 6 days to complete. Committing 9TB of Veeam backups to tape takes just a little over 1 day.

The Results

So how much did these steps improve the average load on the arrays? This is a work in progress, so I don’t have the final numbers yet. But with each step, contention is decreased on my arrays, and my protection strategy has become several orders of magnitude simpler in the process.

With all of that said, I will be addressing my storage performance needs with *something* new. What might that be? Stay tuned.

Month: July 2013

Hunting down unnecessary I/O before you buy that next storage solution