Dogs, Rush hour traffic, and the history of storage I/O benchmarking–Part 1

Evaluating performance of x86 based servers and workstations has had a history of deficiency. Twenty years ago, Administrators who tested system performance usually did little more than run a simple CPU benchmark to see how much faster a 50MHz system was than a 25 MHz system. Rarely did testing go beyond this. Nostalgia aside, it really was a simpler time.

Fast forward a few years, and testing became slightly more sophisticated. Someone figured out it might be good to test the slowest part of the system (storage), so methods and tools were created to accommodate. Storage moved beyond the physical confines of the server by using dedicated LUNS in a SAN array. The LUNS may have not been shared, but the fabric, and entry points to the array were. However, testing storage generally marched forward with little change. Virtualization changed the landscape even further by changing the notion of a dedicated LUN for a single system. Now, the fabric and every component on the storage system was shared.

Testing tools came and went, with some being nothing more than orphaned side projects. Some tools have more dials to turn, but many still run under the assumption that they are testing a physical host on local spinning disk. They do little to try to emulate a real workload, as they have no idea what that means. Many times these tools try to combine load generation with a single, final number for performance measurement. Almost as if whatever happened in between the start and finish didn’t matter.

Testing methods didn’t evolve much either. The quest for “top speed” was never supplanted by any other method. Noteworthy considering a critical measurement of anything shared is performance under load or contention. Storage architectures and the media used has evolved, but rarely is it properly accounted for in testing. Often lost in the speeds and feeds discussion is the part that really counts – The performance of the applications and the VMs they live on.

This post will point out the flaws of synthetic testing of storage performance (the tools, and the techniques), but it may incorrectly give the impression that they are useless.  Quite the contrary actually.  They can be very helpful when used the correct way, and for the right reasons.  More on this later.

Deficiencies of benchmarks as a meaningful measuring stick
“I don’t use benchmarks. I have users” — Ancient Twitter Proverb

There is no substitute for the value of observing real world performance characteristics, but it does little to address the difficulty with measuring that performance in a repeatable way. Real workloads are a collection of widely moving variables that all have different types of impact on an environment and a user experience. Testing system performance is important, but only when it is properly understood what the testing tools are producing.

Synthetic benchmarks offer a number of benefits. They are typically very easy to run, and often produce some dashboard result that can be referenced later. But these tools and test methods share common characteristics that rarely generate anything resembling real data patterns. Among those distinctions are.

  • I/O generated from them is not a closed loop dialogue
  • They do not mimic dynamic variables of real workloads
  • Improper testing practices in a clustered compute environment

All of these warrant more detail, so let’s elaborate on each one.

I/O generated from them is not a closed loop dialogue
A simplified way to describe typical I/O dialogue be this; Data is fetched, it is processed in some way, then is sent on its way. The I/O “signature” of a workload could be described as to what pattern and degree this dialogue occurs in.  It can be a pattern that is often repeated frequently if you observe workloads long enough.

Consider the fetching of some data, the processing of some data, and the writing of data. One might liken this process to a dog fetching a ball, you wiping the slobber off, and throwing it out again. Over and over again, and in that order. Single threaded of course.

n1whm

Synthetic load generators attack this quite differently. The one and only job of a synthetic I/O generator is to fill up the queues as fast as possible using every CPU cycle. The I/O generator has no regard for anything. The data is not processed in any way because that is not what was asked of it. By comparison, Synthetic I/O pretty much looks like this:

n1wck

With a Synthetic I/O generator, every CPU cycle will be pushed to perform a singular action. Reads are requested and writes are issued in an all or nothing fashion. Sure, some generators allow you to mix reads and writes, but the problem still remains.  They do not reflect any meaningful dialogue, and cannot mimic a real workload.

They do not mimic dynamic variables of real workloads
Real workloads consume resources (CPU, Memory, Storage) far differently than their synthetic counterparts. At any given time, storage I/Os will be varying mixes of reads versus writes, I/O sizes, and coming from one or many CPU threads. The two images below shows a 6 1/2 minute snippet of real I/O taken from a single VM in a production environment (using vscsiStats and Excel surface charts). During this time of heavy activity, notice how much the type of I/O that is in play varies.

Below you see the number of read I/Os, and the respective I/O sizes. Typically between 4K and 32K in size.

image

Below you see the number of write I/Os, and the respective I/O sizes. The majority of sizes range from 32K to 512K in size. This is occurring at the same time on the same VM.

image

Here you can see that read/write ratios vary for just this single VM, and more importantly, the size and the number of I/Os are all over the place. I/O sizes can have an enormous impact on storage performance, so one can imagine the difficulty in emulating it accurately. VMware’s I/O Analyzer attempts to simulate courtesy of a trace file created and replayed from a real workload, but it still will not behave in the same way as multiple VMs from multiple hosts generating widely varying I/O patterns.  Analytics from storage arrays doesn’t help much either, as they are unable to see the pattern of data in this way.

Improper testing practices in a clustered compute environment
A typical Administrator who tests storage performance usually does so by setting up a single system (VM) to test peak performance (IOPS, Throughput & Latency) on a shared storage backend. It sounds logical at first, but this method doesn’t reflect the way data is handled in a clustered compute environment. Storage I/O from a clustered compute arrangement behaves in a way that is not unlike congestion on an interstate freeway. The performance of a freeway cannot be evaluated by a single car driving on it. It’s performance measurement is derived when it is under load with multiple cars, with different intentions, destinations, sizes, and all of the other variables that introduce congestion. Modern traffic simulation and modeling solutions account for all of these variables to measure and improve what matters most – real traffic.

Unfortunately, most testers and tools take this same “single car” approach, and do not account for one of the most important elements in modern virtualized infrastructures; the clustered compute layer. A fast storage infrastructure needs to be able to handle the given number of compute nodes (physical hosts) now and in the future. After all, the I/O’s are ALWAYS generated by the VMs and the collection of hosts they live on – not the backend storage. Painfully obvious, but often overlooked.

Take a look below. This is an illustration of I/O activity in a real environment (traditional clustered compute with SAN architecture). Green lines represent read I/Os and red lines representing write I/Os.

image

Now, let’s look at I/O activity from a traditional synthetic benchmarking approach. As you can see below, it looks pretty different.

image

Storage I/O generated in a real environment is a result of the number of nodes in a cluster and the workloads running on them. So a better (but far from perfect) way to test, at the very minimum, would be:

image

In a traditional architecture with an array that exceeds the capabilities of I/O generated from a single host, this is the method most commonly used to measure the absolute high water mark numbers of that array. Typically storage manufactures quote the numbers from the array because it is a single point of measurement that fits nicely in Marketing materials. Unfortunately it doesn’t measure what really counts; the reported numbers as seen by the VMs – a fact that must not be forgotten when performing any sort of performance testing. The array of course always hits a limit at some point due to the characteristics of the array, or the fabric it has to traverse.

In any sort of clustered compute system, you cannot recognize the full power of the compute platform by testing off of one VM. The same thing goes for any type of distributed storage architecture. With clustered host based acceleration solutions like PernixData FVP, or even Hyper Converged solutions, the approach will have to be similar to above in order to measure correctly. These are different architectures that reshape the traditional data path, and with the testing recommendations above, should help in evaluating their performance. This approach also puts the focus back where it should be; the performance of the VMs, and not some irrelevant numbers from a physical storage array.

 

Proper simulation of I/O across all hosts will allow you to adequately factor in the performance of the storage fabric and all connection points. Most fabrics are quite fast when there isn’t any traffic on them. Unfortunately, that isn’t very realistic. It is important to understand the impact the fabric introduces as the environment is scaled. Since the fabric is what connects all of the hosts fetching and committing data to a storage array, we need to simulate how everything (HBAs, storage array controllers, switches, etc.) performs under contention. If you haven’t already done so, take a few moments to read one of my favorite posts from Frank Denneman, Data Path is not managed as a clustered resource.

Testing with multiple VMs on multiple hosts with FVP also allows you to takes advantage of the per VM acceleration (write buffering & read caching) capabilities across a clustered compute environment. It is one of the reasons why the FVP’s decoupled architecture can scale so well, and why real workloads become a such a beneficiary of the architecture.

You thought we were finished, didn’t you
There are just too many ways Synthetic Benchmarks are misused to cover in just one post. Stay tuned for Part 2 for more observations on why they are inadequate as a single test for modern environments, and most importantly, when and how they can actually be useful.

4 thoughts on “Dogs, Rush hour traffic, and the history of storage I/O benchmarking–Part 1”

  1. This notion of testing I/O and latency the way things now really are in our virtualized datacenters certainly sounds obvious though many continue to perform benchmarks in the “tennis ball dump” manner depicted. Also seemingly obvious is the follow on if you are going to consider all parts of the storage stack (HBAs etc.) as indicated during testing then clearly you need to consider all of that during architecture, design, and implementation of your environment. That related notion is what brings me to an observation and follow on idea many in my organization struggle with right now.

    With FVP of primary concern is the ability of back end storage to de-stage writes and perform required cache miss reads acceptably. As long as peak sustained write traffic never “overflows” any single host cache you have sized your cache layer and back “peak de-stage capacity” adequately. (Be that load during annual book closing or whatever) While getting that “just right” may still be a bit of black magic it still holds true. The same holds for delivering acceptable read performance on cache miss. The salient characteristic of both read and write traffic “offloading” performed by FVP being the need for MUCH less performance from the back end in most cases. (Massive sequential writes and reads for those few with such use cases clearly being exceptions)

    As we are new to FVP (after many, many months of campaigning) how little those pricey arrays are loaded after migration is something most will simply have to see for themselves. That brings me to the follow on which is somewhat related to a recent Frank Denneman post on performance isolation.

    In the environments I have seen and discussed with others many lump all resources of a given type (compute, RAM, storage) into a single large collection/cluster. The thinking goes the one large pool can then be cut up as needed because you “just never know” what requirements might come along. While that last maxim may be true real world workload rarely benefits from such a monolithic approach. This brings me to the idea with which many of my work colleagues continue to struggle at this moment.

    There is general recognition of the widely varying workloads among the 800+ VMs at our main site as well as the fact relatively few of them rely on another locally hosted VM. If keeping an application front end and dedicated database back end proximate (ideally on the same host) is so desirable then surely the converse is true of there being no need for any proximity between guest VMs which have nothing to do with each other. The implications of that have lead to our initial implementation of dedicated clusters and (hopefully) eventually dedicated shared storage per cluster.

    Among those 800+ VMs mentioned there are some which would benefit greatly from a significantly higher CPU clock rate, but gain almost nothing from adding vCPUs after a certain point. (In my experience that point of diminishing returns is around 6 or 8 despite vendors and application “owners” belief they NEED 32 vCPUs) As such we are deploying a “fast cluster with 6 core 3.4GHz CPUs and will limit the number of vCPUs allowed on VMs hosted there. We are also implementing a test/dev/QA cluster to separate that workload from the two new “wide” 12 core 2.5GHz CPU clusters.

    The concern remains as we move away from a single monolithic “one size fits all” cluster what if load on one of the new clusters becomes too large while another remains quite idle? Time will tell, but I have never been one to guarantee limitations in my steady state operational domain over concerns of remotely possible occurrences. Risk analysis is not simply the implications of something happening, it is also the probability of it happening.

    This reply is already way too long so I will not go into plans for dedicated per cluster storage right now. That is far more related to this post though, so perhaps another time?

Leave a comment