Using vscsiStats to better visualize storage I/O

As the saying goes, a picture is worth a thousand words. This isn’t more true than in the world of data visualization. Raw data has its place, but good visualization methods help translate numbers into a meaningful story, and assist with overcoming the deficiencies associated with looking at a spreadsheet of raw numbers. A good visual representation of data gives context, establishes relationships between the numbers, communicates results more clearly; making it easier for you and others to remember. The difference for you as an Administrator can be better approaches to trouble shooting, or helping you in your ability to make smart design and purchasing decisions.

Virtualization Administrators are faced with digesting performance information quite often. vCenter does a pretty good job of letting the Administrator skip the data collection nonsense, and jump into viewing relevant metrics in an easy to read manner. But the vCenter metrics do not always give a complete view of information available, and occasionally needs a little help when one is trying to better understand key performance indicators.

A different way to use vscsiStats
“Some people see the glass half full. Others see it half empty. I see a glass that’s twice as big as it needs to be.” — George Carlin.

VMware’s vscsiStats is a great tool to collect and view storage I/O data in a different way. It can help to harvest a wealth of information about VMs that can be manipulated in a number of ways. For as good as it is, I believe it suffers a bit in that it is geared toward providing summations of a single sample period of time. One can collect all sorts of great information during a specific period, but it gives you no idea of what happened when, and why. To be truly useful, it needs to handle continuous, adjacent sampling periods.

But fear not, with a little extra effort, vscsiStats can be manipulated to factor in time. Combine those results with an Excel 3D surface chart, and you have some neat new ways to interpret the data. Erik Zandboer has fantastic information on how to leverage vscsiStats to generate multiple sampling periods. Combine this with a nice template he provides, and most of the heavy lifting is done for you already. Having that created already was great for me, as I find that the fun is not in generating the graphs, but interpreting, and learning from them.

In an effort to see how similar data can look a bit differently using other tools, let us take a look at a production VM running a real code compiling workload. The area in the red bubble is the time period we will be concentrating on. The screen capture below shows the CPU utilization for the 8 vCPU VM.

image

The screen capture below shows the storage related metrics for the specific VMDK of the VM, such as read and write IOPS, latency, and number of outstanding commands. In this particular case, the VM is being accelerated by PernixData FVP, but I changed the configuration so that it was only accelerating reads via its "write-through" configuration. Write I/Os are limited to the speed of the backing physical infrastructure. I did this to provide some more interesting graphs, as you will see in a bit.

image

Now it is time to use vscsiStats to look at similar storage related metrics. In this case, vscsiStats sampled the data in 20 second intervals, for a duration of 400 seconds, and reflects the time period within the red bubbles in the screen captures above. It is a relatively short amount of time for observations, but I didn’t want to smooth the data too much by choosing a long sample interval. In the charts below, read related activity is in green, and write related activity is in dark red. Note that on values such as latency, and I/O size, the axis will use a logarithmic scale.

I/O Size
First, lets take a look at I/O size for reads

image

You see from above that read I/Os from this period of time were mostly 4K and 32K in size. Contrast this with the write I/Os that are shown for the same sample period below.

image

The image above shows a significant amount of write I/Os at 32K, 64K, 128K, 256K, and 512K. Notice how much different that looks as compared to the read I/Os. Unlike read I/Os, we know write I/O sizes tend to have a more significant impact on latency.

Latency
Now let’s take a look at latency.

image

Many of the read I/Os shown above come in at around .5ms to 1ms latency. Reads I/Os can be an easier I/O to satisfy, and the latency reflects that. The image below shows many of the writes coming in between 5ms and 15ms or higher. Just like with the other graphs, we get a better understanding of the magnitude of I/Os (z axis) that come in at a given measurement.

image

Outstanding I/Os
This shows the number of outstanding read I/Os when a new read I/O is issued. As you can see below, the reads are being served pretty fast, and does not have more than around 1 or 2 outstanding read I/Os. In an ideal world we would want this to be as low as possible for all reads and writes.

image

However, you can see that with writes, it is quite a different story. The increased latency, which comes in part due to the larger I/O sizes used, impacts the number of outstanding write I/Os waiting. The image shows a several points in which the number of outstanding write I/Os surpassed 20. I find this image below visually one of the most impactful.

image

Sequential versus Random
vscsiStats also demonstrates whether the I/O of a given workload is sequential, or random.

image

With both reads, and writes, you can see that this particular snippet of a workload is predominately random I/O. Sequential I/O would all be very closely aligned with the ‘0’ value near the middle of the graph.

image

Conclusion
You can see that from this very small, 6 1/2 minute time period on one VM, the workload demanded different things at different times from the backing storage. Differences that were not readily apparent from the traditional vCenter metrics. Now imagine what other workloads on the same system may look like, or even what other systems may look like. As an aggregate, how might all of these systems be taxing your hosts and storage infrastructures? These are all very good questions with answers specific to each and every environment.

As demonstrated above, using vscsiStats can be a great way to compliment other monitoring metrics found in vCenter, and will surely give you a better understanding of the behavior of your virtualized environment.

Thanks for reading.

– Pete