Viewing the impact of block sizes with PernixData Architect

In the post Understanding block sizes in a virtualized environment, I describe what block sizes are as they relate to storage I/O, and how it became one of the most overlooked metrics of virtualized environments.  The first step in recognizing their importance is providing visibility to them across the entire Data Center, but with enough granularity to view individual workloads.  However, visibility into a metric like block sizes isn’t enough. The data itself has little value if it cannot be interpreted correctly. The data must be:

  • Easy to access
  • Easy to interpret
  • Accurate
  • Easy to understand how it relates to other relevant metrics
    Future posts will cover specific scenarios detailing how this information can be used to better understand, and tune your environment for better application performance. Let’s first learn how PernixData Architect presents block size when looking at very basic, but common read/write activity.

Block size frequencies at the Summary View
When looking at a particular VM, the "Overview" view in Architect can be used to show the frequency of I/O sizes across the spectrum of small blocks to large blocks for any time period you wish. This I/O frequency will show up on the main Summary page when viewing across the entire cluster. The image below focuses on just a single VM.

(Click on images for a full size view)

vmpete-summary

What is most interesting about the image above is that the frequency of block sizes, based on reads and writes are different. This is a common characteristic that has largely gone unnoticed because there has been no way to easily view that data in the first place.

Block sizes using a "Workload" View
The "Workload" view in Architect presents a distribution of block sizes in a percentage form as they occur on a single workload, a group of workloads, or across an entire vSphere cluster. The time frame can be any period that you wish. This view tells more clearly and quickly than any other view as to how complex, unique, and dynamic the distribution of block sizes are for any given VM. The example below represents a single workload, across a 15 minute period of time. Much like read/write ratios, or other metrics like CPU utilization, it’s important to understand these changes as they occur, and not just a single summation or percentage over a long period of time.

vmpete-workloadview

When viewing the "Workload" view in any real world environment, it will instantly provide new perspective on the limitations of Synthetic I/O generators. Their general lack of ability to emulate the very complex distribution of block sizes in your own environment limit their value for that purpose. The "Workload" view also shows how dramatic, and continuous the changes in workloads can be. This speaks volumes as to why one-time storage assessments are not enough. Nobody treats their CPU or memory resources in that way. Why would we limit ourselves that way for storage?

Keep in mind that the "Workload" view illustrates this distribution in a percentage form. Views that are percentage based aim to illustrate proportion relative to a whole. They do not show absolute values behind those percentages. The distribution could represent 50 IOPS, or 5,000 IOPS. However, this type of view can be an incredibly effective in identifying subtle changes in a workload or across an environment for short term analysis, or long term trending.

Block sizes in an IOPS and Throughput view
First let’s take a look at this VM by looking at IOPS, based on a default "Read/Write" breakdown. The image below shows a series of reads before a series of mostly writes. When you look back at the "Workload" view above, you can see how these I/Os were represented by block size distribution.

vmpete-IOPSrw

Staying on this IOPS view and selecting the predefined "Block Size" breakdown, we can see the absolute numbers that are occurring based on block size. The image below shows that unlike the "Workload" view above, this shows the actual number of I/Os issued for the given block size.

vmpete-IOPSblocksize

But that doesn’t tell the whole story. A block size is an attribute for a single I/O. So in an IOPS view 10 IOPS of 4K blocks looks the same as 10 IOPS of 256K blocks. In reality, the latter is 64 times the amount of data. The way to view this from a "payload amount transmitted" perspective is using the Throughput view with the "Block Size" breakdown, as shown below.

vmpete-TPblocksize

When viewing block size by its payload size (Throughput) as shown above, it provides a much better representation of the dominance of large block sizes, and the relatively small payload of the smaller block sizes.

Here is another way Architect can help you visualize this data. We can click on the "Performance Grid" view and change the view so that we have IOPS and Throughput but for specific block sizes. As the image below illustrates, the top row shows IOPS and Throughput for 4K to <8K blocks, while the bottom row shows IOPS and Throughput for blocks over 256K in size.

vmpete-performancegrid-IOPSTP

What the image above shows is that while the number of IOPS for block sizes in the 4K to <8K range at it’s peak were similar to the number of IOPS for block sizes of 256K and above, there was an enormous amount of payload delivered.

Why does it matter?
Let’s let PernixData Architect tell us why all of this matters. We will look at the effective latency of the VM over that same time period. We can see from the image below that the effective latency of the VM definitely increased as it transitioned to predominately writes. (Read/Write boxes unticked for clarity).

vmpete-Latencyrw

Now, let’s look at the image below, which shows latency by block size using the "Block Size" breakdown.

vmpete-Latency-blocksize

There you see it. Latency was by in large a result of the larger block sizes. The flexibility of these views can take an otherwise innocent looking latency metric and tell you what was contributing most to that latency.

Now let’s take it a step further. With Architect, the "Block Size" breakdown is a predefined view that shows block size characteristic of both reads, and writes combined – whether you are looking at Latency, IOPS, or Throughput. However, you can use a custom breakdown to not only show block sizes, but show them specifically for reads or writes, as shown in the image below.

vmpete-Latency-blocksize-custom

The "Custom" Breakdown for the Latency view shown above had all of the reads and writes of individual block sizes enabled, but some of them were simply "unticked" for clarity. This view confirms that the majority of latency was the result of writes that were 64K and above. In this case, we can clearly demonstrate that latency seen by the VM was the result of larger block sizes issued by writes. It’s impact however is not limited to just the higher latency of those larger blocks, as those large block latencies can impact the smaller block I/Os as well. Stay tuned for more information on that subject.

As shown in the image below, Architect also allows you to simply click on a single point, and drill in for more insight. This can be done on a per VM basis, or across the entire cluster. By hovering over each vertical bar representing various block sizes, it will tell you how many IOs were issued at that time, and the corresponding latency.

vmpete-Latency-Drilldown

Flash to the rescue?
It’s pretty clear that block size can have significant impact on the latency your applications see. Flash to the rescue, right?  Well, not exactly. All of the examples above come from VMs running on Flash. Flash, and how it is implemented in a storage solution is part of what makes this so interesting, and so impactful to the performance of your VMs. We also know that the storage media is just one component of your storage infrastructure. These components, and their abilities to hinder performance, exist regardless if one is using a traditional three-tier architecture, or distributed storage architectures like Hyper Converged environments.

Block sizes in the Performance Matrix
One unique view in Architect is the Performance Matrix. Unique in what it presents, and how it can be used. Your storage solution might have been optimized from the Manufacturer based on certain assumptions that may not align with your workloads. Typically there is no way of knowing that. As shown below, Architect can help you understand what type of workload characteristics in which the array begins to suffer.

vmpete-PeformanceMatrix

The Performance Matrix can be viewed on a per VM basis (as shown above) or in an aggregate form. It’s a great view to see what block size thresholds your storage infrastructure may be suffering, as the VMs see it. This is very different than statistics provided by an array, as Architect offers a complete, end-to-end understanding of these metrics with extraordinary granularity. Arrays are not in the correct place to accurately understand, or measure this type of data.

Summary
Block sizes have a profound impact on the performance of your VMs, and is a metric that should be treated as a first class citizen just like compute and other storage metrics. The stakes are far too high to leave this up to speculation, or words from a vendor that say little more than "Our solution is fast. Trust us.Architect leverages it’s visibility of block sizes in ways that have never been possible. It takes advantage of this visibility to help you translate what it is, to what it means for your environment.

Understanding block sizes in a virtualized environment

Cracking the mysteries of the Data Center is a bit like space exploration. You think you understand what everything is, and how it all works together, but struggle to understand where fact and speculation intersect. The topic of block sizes, as they relate to storage infrastructures is one such mystery. The term being familiar to some, but elusive enough to remain uncertain as to what it is, or why it matters.

This inconspicuous, but all too important characteristic of storage I/O has often been misunderstood (if not completely overlooked) by well-intentioned Administrators attempting to design, optimize, or troubleshoot storage performance. Much like the topic of Working Set Sizes, block sizes are not of great concern to an Administrator or Architect because of this lack of visibility and understanding. Sadly, myth turns into conventional wisdom – in not only what is typical in an environment, but how applications and storage systems behave, and how to design, optimize, and troubleshoot for such conditions.

Let’s step through this process to better understand what a block is, and why it is so important to understand it’s impact on the Data Center.

What is it?
Without diving deeper than necessary, a block is simply a chunk of data. In the context of storage I/O, it would be a unit in a data stream; a read or a write from a single I/O operation. Block size refers the payload size of a single unit. We can blame a bit of this confusion on what a block is by a bit of overlap in industry nomenclature. Commonly used terms like blocks sizes, cluster sizes, pages, latency, etc. may be used in disparate conversations, but what is being referred to, how it is measured, and by whom may often vary. Within the context of discussing file systems, storage media characteristics, hypervisors, or Operating Systems, these terms are used interchangeably, but do not have universal meaning.

Most who are responsible for Data Center design and operation know the term as an asterisk on a performance specification sheet of a storage system, or a configuration setting in a synthetic I/O generator. Performance specifications on a storage system are often the result of a synthetic test using the most favorable block size (often 4K or smaller) for an array to maximize the number of IOPS that an array can service. Synthetic I/O generators typically allow one to set this, but users often have no idea what the distribution of block sizes are across their workloads, or if it is even possibly to simulate that with synthetic I/O. The reality is that many applications draw a unique mix of block sizes at any given time, depending on the activity.

I first wrote about the impact of block sizes back in 2013 when introducing FVP into my production environment at the time. (See section "The IOPS, Throughput & Latency relationship")  FVP provided a tertiary glimpse of the impact of block sizes in my environment. Countless hours with the performance graphs, and using vscsistats provided new insight about those workloads, and the environment in which they ran. However, neither tool was necessarily built for real time analysis or long term trending of block sizes for a single VM, or across the Data Center. I had always wished for an easier way.

Why does it matter?
The best way to think of block sizes is how much of a storage payload consisting in a single unit.  The physics of it becomes obvious when you think about the size of a 4KB payload, versus a 256KB payload, or even a 512KB payload. Since we refer to them as a block, let’s use a square to represent their relative capacities.

image

Throughput is the result of IOPS, and the block size for each I/O being sent or received. It’s not just the fact that a 256KB block has 64 times the amount of data that a 4K block has, it is the amount of additional effort throughout the storage stack it takes to handle that. Whether it be bandwidth on the fabric, the protocol, or processing overhead on the HBAs, switches, or storage controllers. And let’s not forget the burden it has on the persistent media.

This variability in performance is more prominent with Flash than traditional spinning disk.  Reads are relatively easy for Flash, but the methods used for writing to NAND Flash can inhibit the same performance results from reads, especially with writes using large blocks. (For more detail on the basic anatomy and behavior of Flash, take a look at Frank Denneman’s post on Flash wear leveling, garbage collection, and write amplification. Here is another primer on the basics of Flash.)  A very small number of writes using large blocks can trigger all sorts of activity on the Flash devices that obstructs the effective performance from behaving as it does with smaller block I/O. This volatility in performance is a surprise to just about everyone when they first see it.

Block size can impact storage performance regardless of the type of storage architecture used. Whether it is a traditional SAN infrastructure, or a distributed storage solution used in a Hyper Converged environment, the factors, and the challenges remain. Storage systems may be optimized for different block size that may not necessarily align with your workloads. This could be the result of design assumptions of the storage system, or limits of their architecture.  The abilities of storage solutions to cope with certain workload patterns varies greatly as well.  The difference between a good storage system and a poor one often comes down to the abilities of it to handle large block I/O.  Insight into this information should be a part of the design and operation of any environment.

The applications that generate them
What makes the topic of block sizes so interesting are the Operating Systems, the applications, and the workloads that generate them. The block sizes are often dictated by the processes of the OS and the applications that are running in them.

Unlike what many might think, there is often a wide mix of block sizes that are being used at any given time on a single VM, and it can change dramatically by the second. These changes have profound impact on the ability for the VM and the infrastructure it lives on to deliver the I/O in a timely manner. It’s not enough to know that perhaps 30% of the blocks are 64KB in size. One must understand how they are distributed over time, and how latencies or other attributes of those blocks of various sizes relate to each other. Stay tuned for future posts that dive deeper into this topic.

Traditional methods capable of visibility
The traditional methods for viewing block sizes have been limited. They provide an incomplete picture of their impact – whether it be across the Data Center, or against a single workload.

1. Kernel statistics courtesy of vscsistats. This utility is a part of ESXi, and can be executed via the command line of an ESXi host. The utility provides a summary of block sizes for a given period of time, but suffers from a few significant problems.

  • Not ideal for anything but a very short snippet of time, against a specific vmdk.
  • Cannot present data in real-time.  It is essentially a post-processing tool.
  • Not intended to show data over time.  vscsistats will show a sum total of I/O metrics for a given period of time, but it’s of a single sample period.  It has no way to track this over time.  One must script this to create results for more than a single period of time.
  • No context.  It treats that workload (actually, just the VMDK) in isolation.  It is missing the context necessary to properly interpret.
  • No way to visually understand the data.  This requires the use of other tools to help visualize the data.

The result, especially at scale, is a very labor intensive exercise that is an incomplete solution. It is extremely rare that an Administrator runs through this exercise on even a single VM to understand their I/O characteristics.

2. Storage array. This would be a vendor specific "value add" feature that might present some simplified summary of data with regards to block sizes, but this too is an incomplete solution:

  • Not VM aware.  Since most intelligence is lost the moment storage I/O leaves a host HBA, a storage array would have no idea what block sizes were associated with a VM, or what order they were delivered in.
  • Measuring at the wrong place.  The array is simply the wrong place to measure the impact of block sizes in the first place.  Think about all of the queues storage traffic must go through before the writes are committed to the storage, and reads are fetched. (It also assumes no caching tiers outside of the storage system exist).  The desire would be to measure at a location that takes all of this into consideration; the hypervisor.  Incidentally, this is often why an array can show great performance on the array, but suffer in the observed latency of the VM.  This speaks to the importance of measuring data at the correct location. 
  • Unknown and possibly inconsistent method of measurement.  Showing any block size information is not a storage array’s primary mission, and doesn’t necessarily provide the same method of measurement as where the I/O originates (the VM, and the host it lives on). Therefore, how it is measured, and how often it is measured is generally of low importance, and not disclosed.
  • Dependent on the storage array.  If different types of storage are used in an environment, this doesn’t provide adequate coverage for all of the workloads.

The Hypervisor is an ideal control plane to analyze the data. It focuses on the results of the VMs without being dependent on nuances of in-guest metrics or a feature of a storage solution. It is inherently the ideal position in the Data Center for proper, holistic understanding of your environment.

Eyes wide shut – Storage design mistakes from the start
The flaw with many design exercises is we assume we know what our assumptions are. Let’s consider typical inputs when it comes to storage design. This includes factors such as

  • Peak IOPS and Throughput.
  • Read/Write ratios
  • RAID penalties
  • Perhaps some physical latencies of components, if we wanted to get fancy.

Most who have designed or managed environments have gone through some variation of this exercise, followed by a little math to come up with the correct blend of disks, RAID levels, and fabric to support the desired performance. Known figures are used when they are available, and the others might be filled in with assumptions.  But yet, block sizes, and everything they impact are nowhere to be found. Why? Lack of visibility, and understanding.

If we know that block sizes can dramatically impact the performance of a storage system (as will be shown in future posts) shouldn’t it be a part of any design, optimization, or troubleshooting exercise?  Of course it should.  Just as with working set sizes, lack of visibility doesn’t excuse lack of consideration.  An infrastructure only exists because of the need to run services and applications on it. Let those applications and workloads help tell you what type of storage fits your environment best. Not the other way around.

Is there a better way?
The ideal approach for measuring the impact of block sizes will always include measuring from the location of the hypervisor, as this will provide these measurements in the right way, and from the right location.  vscsiStats and vCenter related metrics are an incredible resource to tap into, and will provide the best understanding on impacts of block sizes in a storage system.  There may be some time investment to decipher block size characteristics of a workload, but the payoff is generally worth the effort.