Accelerating storage using PernixData’s FVP. A perspective from customer #0001

Recently, I described in "Hunting down unnecessary I/O before you buy that next storage solution" the efforts around addressing "technical debt" that was contributing to unnecessary I/O. The goal was to get better performance out of my storage infrastructure. It’s been a worthwhile endeavor that I would recommend to anyone, but at the end of the day, one might still need faster storage. That usually means, free up another 3U of rack space, and open checkbook

Or does it?  Do I have to go the traditional route of adding more spindles, or investing heavily in a faster storage fabric?  Well, the answer was an unequivocal "yes" not too long ago, but times are a changing, and here is my way to tackle the problem in a radically different way.

I’ve chosen to delay any purchases of an additional storage array, or the infrastructure backing it, and opted to go PernixData FVP.  In fact, I was customer #0001 after PernixData announced GA of FVP 1.0.  So why did I go this route?

1.  Clustered host based caching.  Leveraging server side flash brings compute and data closer together, but thanks to FVP, it does so in such a way that works in a highly available clustered fashion that aligns perfectly with the feature sets of the hypervisor.

2.  Write-back caching. The ability to deliver writes to flash is really important. Write-through caching, which waits for the acknowledgement from the underlying storage, just wasn’t good enough for my environment. Rotational latencies, as well as physical transport latencies would still be there on over 80% of all of my traffic. I needed true write-back caching that would acknowledge the write immediately, while eventually de-staging it down to the underlying storage.

3.  Cost. The gold plated dominos of upgrading storage is not fun for anyone on the paying side of the equation. Going with PernixData FVP was going to address my needs for a fraction of the cost of a traditional solution.

4.  It allows for a significant decoupling of "storage for capacity" versus "storage for performance" dilemma when addressing additional storage needs.

5.  Another array would have been to a certain degree, more of the same. Incremental improvement, with less than enthusiastic results considering the amount invested.  I found myself not very excited to purchase another array. With so much volatility in the storage market, it almost seemed like an antiquated solution.

6.  Quick to implement. FVP installation consists of installing a VIB via Update Manager or the command line, installing the Management services and vCenter plugin, and you are off to the races.

7.  Hardware independent.  I didn’t have to wait for a special controller upgrade, firmware update, or wonder if my hardware would work with it. (a common problem with storage array solutions). Nor did I have to make a decision to perhaps go with a different storage vendor if I wanted to try a new technology.  It is purely a software solution with the flexibility of working with multiple types of flash; SSDs, or PCIe based. 

A different way to solve a classic problem
While my write intensive workload is pretty unique, my situation is not.  Our storage performance needs outgrew what the environment was designed for; capacity at a reasonable cost. This is an all too common problem.  With the increased capacities of spinning disks, it has actually made this problem worse, not better.  Fewer and fewer spindles are serving up more and more data.

My goal was to deliver the results our build VMs were capable of delivering with faster storage, but unable to because of my existing infrastructure.  For me it was about reducing I/O contention to allow the build system CPU cycles to deliver the builds without waiting on storage.  For others it might delivering lower latencies to their SQL backed ERP or CRM servers.

The allure of utilizing flash has been an intriguing one.  I often found myself looking at my vSphere hosts and all of it’s processing goodness, but disappointed those SSD sitting in the hosts couldn’t help to augment my storage performance needs.  Being an active participant in the PernixData beta program allowed me to see how it would help me in my environment, and if it would deliver the needs of the business.

Lessons learned so far
Don’t skimp on quality SSDs.  Would you buy an ESXi host with one physical core?  Of course you wouldn’t. Same thing goes with SSDs.  Quality flash is a must! I can tell you from first hand experience that it makes a huge difference.  I thought the Dell OEM SSDs that came with my M620 blades were fine, but by way of comparison, they were terrible. Don’t cripple a solution by going with cheap flash.  In this 4 node cluster, I went with 4 EMLC based, 400GB Intel S3700s. I also had the opportunity to test some Micron P400M EMLC SSDs, which also seemed to perform very well.

While I went with 400GB SSDs in each host (giving approximately 1.5TB of cache space for a 4 node cluster), I did most of my testing using 100GB SSDs. They seemed adequate in that they were not showing a significant amount of cache eviction, but I wanted to leverage my purchasing opportunity to get larger drives. Knowing the best size can be a bit of a mystery until you get things in place, but having a larger cache size allows for a larger working set of data available for future reads, as well as giving head room for the per-VM write-back redundancy setting available.

An unexpected surprise is how FVP has given me visibility into the one area of I/O monitoring that is traditional very difficult to see;  I/O patterns. See Iometer. As good as you want to make it.  Understanding this element of your I/O needs is critical, and the analytics in FVP has helped me discover some very interesting things about my I/O patterns that I will surely be investigating in the near future.

In the read-caching world, the saying goes that the fastest storage I/O is the I/O the array never will see. Well, with write caching, it eventually needs to be de-staged to the array.  While FVP will improve delivery of storage to the array by absorbing the I/O spikes and turning random writes to sequential writes, the I/O will still eventually have to be delivered to the backend storage. In a more write intensive environment, if the delta between your fast flash and your slow storage is significant, and your duty cycle of your applications driving the I/O is also significant, there is a chance it might not be able to keep up.  It might be a corner case, but it is possible.

What’s next
I’ll be posting more specifics on how running PernixData FVP has helped our environment.  So, is it really "disruptive" technology?  Time will ultimately tell.  But I chose to not purchase an array along with new SAN switchgear because of it.  Using FVP has lead to less traffic on my arrays, with higher throughput and lower read and write latencies for my VMs.  Yeah, I would qualify that as disruptive.

 

Helpful Links

Frank Denneman – Basic elements of the flash virtualization platform – Part 1
http://frankdenneman.nl/2013/06/18/basic-elements-of-the-flash-virtualization-platform-part-1/

Frank Denneman – Basic elements of the flash virtualization platform – Part 2
http://frankdenneman.nl/2013/07/02/basic-elements-of-fvp-part-2-using-own-platform-versus-in-place-file-system/

Frank Denneman – FVP Remote Flash Access
http://frankdenneman.nl/2013/08/07/fvp-remote-flash-access/

Frank Dennaman – Design considerations for the host local FVP architecture
http://frankdenneman.nl/2013/08/16/design-considerations-for-the-host-local-architecture/

Satyam Vaghani introducing PernixData FVP at Storage Field Day 3
http://www.pernixdata.com/SFD3/

Write-back deepdive by Frank and Satyam
http://www.pernixdata.com/files/wb-deepdive.html

Iometer. As good as you want to make it.

Most know Iometer as the go-to synthetic I/O measuring tool used to simulate real workload conditions. Well, somewhere, somehow, someone forgot the latter part of that sentence, which is why it ends up being so misused and abused.  How many of us have seen a storage solution delivering 6 figure IOPS using Iometer, only to find that they are running a 100% read, 512 byte 100% sequential access workload simulation.  Perfect for the two people on the planet that those specifications might apply to.  For the rest of us, it doesn’t help much.  So why would they bother running that sort of unrealistic test?   Pure, unapologetic number chasing.

The unfortunate part is that sometimes this leads many to simply dismiss Iometer results.  That is a shame really, as it can provide really good data if used in the correct way.  Observing real world data will tell you a lot of things, but the sporadic nature of real workloads make it difficult to use for empirical measurement – hence the need for simulation.

So, what are the correct settings to use in Iometer?  The answer is completely dependent on what you are trying to accomplish.  The race for a million IOPS by your favorite storage vendor really means nothing if their is no correlation between their simulated workload, and your real workload.  Maybe IOPS isn’t even an issue for you.  Perhaps your applications are struggling with poor latency.  The challenge is to emulate your environment with a synthetic workload that helps you understand how a potential upgrade, new array, or optimization might be of benefit.

The mysteries of workloads
Creating a synthetic workload representing your real workload assumes one thing; that you know what your real workload really is. This can be more challenging that one might think, as many storage monitoring tools do not help you understand the subtleties of patterns to the data that is being read or written.

Most monitoring tools tend to treat all I/O equally. By that I mean, if over a given period of time, say you have 10 million I/Os occur.  Let’s say your monitoring tells you that you average 60% reads and 40% writes. What is not clear is how many of those reads are multiple reads of the same data or completely different, untouched data. It also doesn’t tell you if the writes are overwriting existing blocks (which might be read again shortly thereafter) or generating new data. As more and more tiered storage mechanisms comes into play, understanding this aspect of your workload is becoming extremely important. You may be treating your I/Os equally, but the tiered storage system using sophisticated caching algorithms certainly do not.

How can you gain more insight?  Use every tool at your disposal.  Get to know your applications, and the duty cycles around them. What are your peak hours? Are they in the middle of the day, or in the middle of the night when backups are running?

Suggestions on Iometer settings
You may find that the settings you choose for Iometer yields results from your shared storage that isn’t nearly as good as you thought.  But does it matter?  If it is an accurate representation of your real workload, not really.  What matters is if are you able to deliver the payload from point a to point b to meet your acceptance criteria (such as latency, throughput, etc.).  The goal would be to represent that in a synthetic workload for accurate measurement and comparison.

With that in mind, here are some suggestions for the next time you set up those Iometer runs.

1.  Read/write ratio.  Choose a realistic read/write ratio representing your workload. With writes, RAID penalties can hurt your effective performance by quite a bit, so if you don’t have an idea of what this ratio currently is, it’s time for you to find out.

2.  Transfer request size. Is your payload the size of a ball bearing, or a bowling ball? Applications and operating systems vary on what size is used. Use your monitoring systems to best determine what your environment consists of.

3.  Disk size.  Use the "maximum disk size" in multiples of 1048576, which is a 1GB file. Throwing a bunch of zeros in there might fill up your disk with Iometer’s test file. Depending on your needs, a setting of 2 to 20 GB might be a good range to work with.

4.  Number of outstanding I/Os.  This needs to be high enough so that the test can keep sending I/O requests to it as the storage is fulfilling requests to it. A setting of 32 is pretty common.

5.  Alignment of I/O. Many of the standard Iometer ICF files you find were built for physical drives. It has the "Align I/Os on:" setting to "Sector boundaries"   When running tests on a storage array, this can lead to inconsistent results, so it is best to align on 4K or 512 bytes.

6.  Ramp up time. Offer at least a minute of ramp up time.

7.  Run time. Some might suggest running simulations long enough to exhaust all caching, so that you can see "real" throughput.  While I understand the underlying reason for this statement, I believe this is missing the point.  Caching is there in the first place to take advantage of a working set of warm and cold data, bursts, etc. If you have a storage solution that satisfies the duty cycles that exists in your environment, that is the most important part.

8.  Number of workers.  Let this spawn automatically to the number of logical processors in your VM. It might be overkill in many cases because of terrible multithreading abilities of most applications, but its a pretty conventional practice.

9.  Multiple Iometer instances.  Not really a setting, but more of a practice.  I’ve found running multiple tests a way to better understand how a storage solution will react under load as opposed to on it’s own. It is shared storage after all.

Disclaimer
If you were looking for this to be the definitive post on Iometer, that isn’t what I was shooting for.  There are many others who are much more qualified to speak to the nuances of Iometer than me.  What I hope to do is to offer a little practical perspective on it’s use, and how it can help you.  So next time you run Iometer, think about what you are trying to accomplish, and let go of the number chasing.  Understand your workloads, and use the tool to help you improve your environment.

Seattle VMUG Meeting for August 2013

VMworld 2013 is just a few short weeks away, and with that comes a lot of excitement in the virtualization community. Whether or not you plan on attending VMworld, there is plenty to be excited for with the next Seattle VMUG on Wednesday, August 14th (5:00pm to 8:00pm) at the Bellevue Art Museum. If you are located in the greater Seattle area, here are a few reasons why you should block out your calendar for this event.

Great content is what the Seattle VMUG has been striving for, and I think this next one is a fine example of that. Two fantastic sponsors will be presenting.

  • Veeam Backup & Replication has a tremendously loyal following. If you get a chance to use it, you quickly understand why. Big improvements are coming in Veeam Backup & Replication v7 that may alter your onsite and offsite protection strategies for the better. This is a great chance to find out more about it.
  • PernixData. Billed as the first flash hypervisor, see how this solution (known as FVP) leverages server-side cache to accelerate storage performance for vSphere environments, and why it is turning heads in the industry. It just may change how you design your infrastructure

Need a few more reasons to go? Okay, well how about:

  • You have a great chance at one of two iPad Minis that will be given away. Yes, the Seattle VMUG manages to find some pretty classy prizes to give away. You can’t win if you don’t show up.

  • Chat it up with your fellow VMUGers. Come on over and introduce yourself. Find out what others are doing in their environments, and share a little bit about what your environment is like. If you work with virtualization or manage infrastructures, and claim that nobody understands what you do, here is your chance to surround yourself with others who do!

  • Soak up the sights at the Bellevue Art Museum. Located in the heart of downtown Bellevue, it’s a great venue that has plenty of parking, and will give you an opportunity to see the exhibits
    So be sure to register and come by and say hi. You might walk away with a new iPad mini, and new ideas on how to improve your environment.