Dogs, Rush hour traffic, and the history of storage I/O benchmarking–Part 2

Part one of "History of storage I/O Benchmarking" attempted to demonstrate how Synthetic Benchmarks on their own simply cannot generate or measure storage performance in a meaningful way. Emulating real workloads seems like such a simple matter to solve, but in truth it is a complex problem involving technical, and non-technical challenges.

  • Assessing workload characteristics is often left for conjecture.  Understanding the correct elements to observe is the first step to simulating them for testing, but how the problem is viewed is often left for whatever tool is readily available.  Many of these tools may look at the wrong variables.  A storage array might have great tools for monitoring the array, but is an incomplete view as it relates to the VM or application performance.
  • Understanding performance in a Datacenter crosses boundaries of subject matter expertise.  A traditional Storage Administrator will see the world in the same way the array views it.  Blocks, LUNS, queues, and transport protocols.  Ask them about performance and be prepared for a monologue on rotational latencies, RAID striping efficiencies and read/write handling.  What about the latency as seen by the VM?  Don’t be surprised if that is never mentioned.  It may not even be their fault, since their view of the infrastructure may be limited by access control.
  • When introducing a new solution that uses a technology like Flash, the word itself is seen as a superlative, not a technology.  The name implies instant, fast, and other super-hero like qualities.  Brilliant industry marketing, but it comes at a cost.  Storage solutions are often improperly tested after some technology with Flash is introduced because conventional wisdom says it is universally faster than anything in the past.  A simplified and incorrect assertion.

Evaluating performance demands a balance of understanding the infrastructure, the workloads, and the software platforms they run on. This takes time and the correct tools for insight – something most are lacking. Part one described the characteristics of real workloads that are difficult to emulate, plus the flawed approach of testing in a clustered compute environment. Unfortunately, it doesn’t end there. There is another factor to be considered; the physical characteristics of storage performance tiering layers, and the logic moving data between those layers.

Storage Performance tiering
Most Datacenters deliver storage performance using multiple persistent storage tiers and various forms of caching and buffering. Synthetic benchmarks force a behavior on these tiers that may be unrealistic. Many times this is difficult to decipher, as the tier sizes and data handling can be obfuscated by a storage vendor or unknown by the tester. What we do know is that storage tiering can certainly come in all shapes and sizes. Whether it traditional array with data progression techniques, a hybrid array, a decoupled architecture like PernixData FVP, or a Hyper Converged solution. The reality is that this tiering occurs all the time

With that in mind, there are two distinct approaches to test these environments.

  • Testing storage in a way to guarantee no I/O data comes from and goes to a top performing tier.
  • Testing storage in a way to guarantee that all I/O data comes from and goes to a top performing tier.

Which method is right for you? Both methods are neither right nor wrong as each can serve a purpose. Let’s use the car analogy again

  • Some might be averse to driving an electric car that only has a 100 mile range.  But what if you had a commute that rarely ever went more than 30 miles a day?  Think of that as like a caching/buffering tier.  If a caching layer is large enough that it might serve that I/O 95% of the time, well then, it may not be necessary to focus on testing performance from that lower tier of storage. 
  • In that same spirit, let’s say that same owner changed jobs and drove 200 miles a day.  That same car is a pretty poor solution for the job.  Similarly, if a storage array had just 20GB of caching/buffering for 100TB of persistent storage, the realistic working set size of each of the VMs that live on that storage would realize very little benefit from that 20GB of caching space.  In that case, it would be better to test the performance of the lower tier of storage.

What about testing the storage in a way to guarantee that data comes from all tiers?  Mixing a combination of the two sounds ideal, but often will not simulate the way real data will reside on the tiers, and produces a result that is difficult to determine if it reflects the way a real workload will behave. Due to the lack of identifying these caching tier sizes, or no true way to isolate a tier, this ironically ends up being the approach most commonly used – by accident alone.

When generating synthetic workloads that have a large proportion of writes, it can often be quite easy to hit buffer limit thresholds. Once again this is due to a benchmark committing every CPU cycle as a write I/O and for unrealistic periods of time. Even in extremely write intensive environments, this is completely unrealistic. It is for that reason that one can create a behavior with a synthetic benchmark against a tiered storage solution that rarely, if ever, happens in a real world environment.

When generating read I/O based synthetic tests using a large test file, those reads may sometimes hit the caching tier, and other times hit the slowest tier, which may show sporadic results. The reaction to this result often leads to running the test longer. The problem however is the testing approach, not the length of the test. Understanding the working set size of a VM is key, and should dictate how best to test in your environment. How do you determine a working set size? Let’s save that for a future post. Ultimately it is real workloads that matter, so the more you can emulate the real workloads, the better.

Storage caching population and eviction. Not all caching is the same
Caching layers in storage solutions can come in all shapes and sizes, but they depend on rules of engagement that may be difficult to determine. An example of two very important characteristics would be:

  • How they place data in cache.  Is some sort of predictive "data progression" algorithm being used?  Are the tiers using Write-Through caching to populate the cache in addition to population from data fetched from the backend storage. 
  • How they choose to evict data from cache.  Does the tier use "First-in-First-Out" (FIFO), Least Recently Used (LRU), Least Frequently Used (LFU) or some other approach for eviction?

Synthetic benchmarks do not accommodate this well.  Real world workloads will depend highly on them however, and the differences show up only in production environments.

Other testing mistakes
As if there weren’t enough ways to screw up a test, here are a few other common storage performance testing mistakes.

  • Not testing as close to the application level as possible.  This sort of functional testing will be more in line with how the application (and OS it lives on) handles real world data.
  • Long test durations.  Synthetic benchmarks are of little use when running an exhaustive (multi-hour) set of tests.  It tells very little, and just wastes time.
  • Overlooking a parameter on a benchmark.  Settings matter because they can yield very different results.
    Misunderstanding the read/write ratios of an environment.  Are you calculating your ratio by IOPS, or Throughput?  This alone can lead to two very different results.
  • Misunderstanding of typical I/O sizes in organization for reads and writes.  How are you choosing to determine what the typical I/O size is?
  • Testing reads and writes like two independent objectives.  Real workloads do not work like this, so there is little reason to test like this.
  • Using a final ‘score’ provided by a benchmark.  The focus should be on the behavior for the duration of the test.  Especially with technologies like Flash, careful attention should be paid to side effects from garbage collection techniques and other events that cause latency spikes. Those spikes matter.

Testing organizations often are vying for a position as a testing authority, or push methods or standards that somehow eliminate the mistakes described in this blog post series. Unfortunately that is not the case, but it does not matter anyway, as it is your data, and your workloads that count.

Making good use of synthetic benchmarks
It may come across that Synthetic Benchmarks or Synthetic Load Generators are useless. That is untrue. In fact, I use them all the time. Just not the way they conventional wisdom indicates. The real benefit comes once you accept the fact that they do not simulate real workloads. Here are a few scenarios in which they are quite useful.

  • Steady-state load generation.  This is especially useful in virtualized environments when you are trying to create load against a few systems.  It can be a great way to learn and troubleshoot.
  • Micro-benchmarking.  This is really about taking a small snippet of a workload, and attempting to emulate it for testing and evaluation.  Often times the test may only be 5 to 30 seconds, but will provide a chance to capture what is needed.  It’s more about generating I/O to observe behavior than testing absolute performance.  Look here for a good example.
  • Comparing independent hardware components.  This is a great way to show differences an old and new SSD.
  • Help provide broader insight to the bigger architectural picture.

Observe, Learn and Test
To avoid wasting time "testing" meaningless conditions, spend some time in vCenter, esxtop, and other methods to capture statistics. Learn about your existing workloads before running a benchmark. Collaborating with an internal application owner can make better use of your testing efforts. For instance, if you are looking to improve your SQL performance, create a series of tests or modify an existing batch job to run inside of SQL to establish baselines and observe behavior. Test at the busiest time and the quietest time of the day, as they both provide great data points. This approach was incredibly helpful for me when I was optimizing an environment for code compiling.

Summary
Try not to lose sight of the fact that testing storage performance is not about testing an array. It’s about testing how your workloads behave against your storage architecture. Real applications always tell the real story. The reason why most dislike this answer is that it is difficult to repeat, and challenging to measure the right way. Testing the correct way can mean you might spend a little time better understanding the demand your applications put on your environment.

And here you thought you ran out of things to do for the day. Happy testing.

Decisions

 


 

 

 



 

Dogs, Rush hour traffic, and the history of storage I/O benchmarking–Part 1

Evaluating performance of x86 based servers and workstations has had a history of deficiency. Twenty years ago, Administrators who tested system performance usually did little more than run a simple CPU benchmark to see how much faster a 50MHz system was than a 25 MHz system. Rarely did testing go beyond this. Nostalgia aside, it really was a simpler time.

Fast forward a few years, and testing became slightly more sophisticated. Someone figured out it might be good to test the slowest part of the system (storage), so methods and tools were created to accommodate. Storage moved beyond the physical confines of the server by using dedicated LUNS in a SAN array. The LUNS may have not been shared, but the fabric, and entry points to the array were. However, testing storage generally marched forward with little change. Virtualization changed the landscape even further by changing the notion of a dedicated LUN for a single system. Now, the fabric and every component on the storage system was shared.

Testing tools came and went, with some being nothing more than orphaned side projects. Some tools have more dials to turn, but many still run under the assumption that they are testing a physical host on local spinning disk. They do little to try to emulate a real workload, as they have no idea what that means. Many times these tools try to combine load generation with a single, final number for performance measurement. Almost as if whatever happened in between the start and finish didn’t matter.

Testing methods didn’t evolve much either. The quest for “top speed” was never supplanted by any other method. Noteworthy considering a critical measurement of anything shared is performance under load or contention. Storage architectures and the media used has evolved, but rarely is it properly accounted for in testing. Often lost in the speeds and feeds discussion is the part that really counts – The performance of the applications and the VMs they live on.

This post will point out the flaws of synthetic testing of storage performance (the tools, and the techniques), but it may incorrectly give the impression that they are useless.  Quite the contrary actually.  They can be very helpful when used the correct way, and for the right reasons.  More on this later.

Deficiencies of benchmarks as a meaningful measuring stick
“I don’t use benchmarks. I have users” — Ancient Twitter Proverb

There is no substitute for the value of observing real world performance characteristics, but it does little to address the difficulty with measuring that performance in a repeatable way. Real workloads are a collection of widely moving variables that all have different types of impact on an environment and a user experience. Testing system performance is important, but only when it is properly understood what the testing tools are producing.

Synthetic benchmarks offer a number of benefits. They are typically very easy to run, and often produce some dashboard result that can be referenced later. But these tools and test methods share common characteristics that rarely generate anything resembling real data patterns. Among those distinctions are.

  • I/O generated from them is not a closed loop dialogue
  • They do not mimic dynamic variables of real workloads
  • Improper testing practices in a clustered compute environment

All of these warrant more detail, so let’s elaborate on each one.

I/O generated from them is not a closed loop dialogue
A simplified way to describe typical I/O dialogue be this; Data is fetched, it is processed in some way, then is sent on its way. The I/O “signature” of a workload could be described as to what pattern and degree this dialogue occurs in.  It can be a pattern that is often repeated frequently if you observe workloads long enough.

Consider the fetching of some data, the processing of some data, and the writing of data. One might liken this process to a dog fetching a ball, you wiping the slobber off, and throwing it out again. Over and over again, and in that order. Single threaded of course.

n1whm

Synthetic load generators attack this quite differently. The one and only job of a synthetic I/O generator is to fill up the queues as fast as possible using every CPU cycle. The I/O generator has no regard for anything. The data is not processed in any way because that is not what was asked of it. By comparison, Synthetic I/O pretty much looks like this:

n1wck

With a Synthetic I/O generator, every CPU cycle will be pushed to perform a singular action. Reads are requested and writes are issued in an all or nothing fashion. Sure, some generators allow you to mix reads and writes, but the problem still remains.  They do not reflect any meaningful dialogue, and cannot mimic a real workload.

They do not mimic dynamic variables of real workloads
Real workloads consume resources (CPU, Memory, Storage) far differently than their synthetic counterparts. At any given time, storage I/Os will be varying mixes of reads versus writes, I/O sizes, and coming from one or many CPU threads. The two images below shows a 6 1/2 minute snippet of real I/O taken from a single VM in a production environment (using vscsiStats and Excel surface charts). During this time of heavy activity, notice how much the type of I/O that is in play varies.

Below you see the number of read I/Os, and the respective I/O sizes. Typically between 4K and 32K in size.

image

Below you see the number of write I/Os, and the respective I/O sizes. The majority of sizes range from 32K to 512K in size. This is occurring at the same time on the same VM.

image

Here you can see that read/write ratios vary for just this single VM, and more importantly, the size and the number of I/Os are all over the place. I/O sizes can have an enormous impact on storage performance, so one can imagine the difficulty in emulating it accurately. VMware’s I/O Analyzer attempts to simulate courtesy of a trace file created and replayed from a real workload, but it still will not behave in the same way as multiple VMs from multiple hosts generating widely varying I/O patterns.  Analytics from storage arrays doesn’t help much either, as they are unable to see the pattern of data in this way.

Improper testing practices in a clustered compute environment
A typical Administrator who tests storage performance usually does so by setting up a single system (VM) to test peak performance (IOPS, Throughput & Latency) on a shared storage backend. It sounds logical at first, but this method doesn’t reflect the way data is handled in a clustered compute environment. Storage I/O from a clustered compute arrangement behaves in a way that is not unlike congestion on an interstate freeway. The performance of a freeway cannot be evaluated by a single car driving on it. It’s performance measurement is derived when it is under load with multiple cars, with different intentions, destinations, sizes, and all of the other variables that introduce congestion. Modern traffic simulation and modeling solutions account for all of these variables to measure and improve what matters most – real traffic.

Unfortunately, most testers and tools take this same “single car” approach, and do not account for one of the most important elements in modern virtualized infrastructures; the clustered compute layer. A fast storage infrastructure needs to be able to handle the given number of compute nodes (physical hosts) now and in the future. After all, the I/O’s are ALWAYS generated by the VMs and the collection of hosts they live on – not the backend storage. Painfully obvious, but often overlooked.

Take a look below. This is an illustration of I/O activity in a real environment (traditional clustered compute with SAN architecture). Green lines represent read I/Os and red lines representing write I/Os.

image

Now, let’s look at I/O activity from a traditional synthetic benchmarking approach. As you can see below, it looks pretty different.

image

Storage I/O generated in a real environment is a result of the number of nodes in a cluster and the workloads running on them. So a better (but far from perfect) way to test, at the very minimum, would be:

image

In a traditional architecture with an array that exceeds the capabilities of I/O generated from a single host, this is the method most commonly used to measure the absolute high water mark numbers of that array. Typically storage manufactures quote the numbers from the array because it is a single point of measurement that fits nicely in Marketing materials. Unfortunately it doesn’t measure what really counts; the reported numbers as seen by the VMs – a fact that must not be forgotten when performing any sort of performance testing. The array of course always hits a limit at some point due to the characteristics of the array, or the fabric it has to traverse.

In any sort of clustered compute system, you cannot recognize the full power of the compute platform by testing off of one VM. The same thing goes for any type of distributed storage architecture. With clustered host based acceleration solutions like PernixData FVP, or even Hyper Converged solutions, the approach will have to be similar to above in order to measure correctly. These are different architectures that reshape the traditional data path, and with the testing recommendations above, should help in evaluating their performance. This approach also puts the focus back where it should be; the performance of the VMs, and not some irrelevant numbers from a physical storage array.

 

Proper simulation of I/O across all hosts will allow you to adequately factor in the performance of the storage fabric and all connection points. Most fabrics are quite fast when there isn’t any traffic on them. Unfortunately, that isn’t very realistic. It is important to understand the impact the fabric introduces as the environment is scaled. Since the fabric is what connects all of the hosts fetching and committing data to a storage array, we need to simulate how everything (HBAs, storage array controllers, switches, etc.) performs under contention. If you haven’t already done so, take a few moments to read one of my favorite posts from Frank Denneman, Data Path is not managed as a clustered resource.

Testing with multiple VMs on multiple hosts with FVP also allows you to takes advantage of the per VM acceleration (write buffering & read caching) capabilities across a clustered compute environment. It is one of the reasons why the FVP’s decoupled architecture can scale so well, and why real workloads become a such a beneficiary of the architecture.

You thought we were finished, didn’t you
There are just too many ways Synthetic Benchmarks are misused to cover in just one post. Stay tuned for Part 2 for more observations on why they are inadequate as a single test for modern environments, and most importantly, when and how they can actually be useful.