Using a new tool to discover old problems

It is interesting what can be discovered when storage is accelerated. Virtual machines that were previously restricted by the underperforming arrays now get to breath freely.  They are given the ability to pass storage I/O as quickly as the processor needs. In other words, the applications that need the CPU cycles get to dictate your storage requirements, rather than your storage imposing artificial limits on your CPU.

With that idea in mind, a few things revealed themselves during the process of implementing PernixData FVP.  Early on, it was all about implementing and understanding the solution.  However, once the real world workloads began accelerating, there was intrigue on the analytics that FVP was providing.  What was generating the I/O that was being accelerated?  What processes were associated with the other traffic not being accelerated, and why?  What applications were behind the changing I/O sizes?  And what was causing the peculiar I/O patterns that were showing up?  Some of these were questions raised at an earlier time (see: Hunting down unnecessary I/O before you buy that next storage solution ).  The trouble was, the tools I had to discover the pattern of data I/O were limited.

Why is this so important? In the spirit of reminding ourselves that no resource is an island, here is an example of a production code compile run, as looking from the perspective of the guest CPU. The first screen capture is the code with adequate storage I/O to support the application’s needs. A full build and is running nearly perfect CPU utilization of all 8 of it’s vCPUs.  (screen shots taken from my earlier post; Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements-Part 2)

image

Below is that very same code compile, under stressed backend storage. It took 46% longer to complete, and as you can see, changes the CPU utilization of the build run.

image

The primary goal for this environment was to accelerate the storage. However, it would have been a bit presumptuous to conclude that all existing storage traffic is good, useful I/O. There is a significant amount of traffic originating from outside of IT, and the I/O generated needed to be understood better.  With the traffic passing more freely thanks to FVP acceleration, patterns that previously could not expose themselves should be more visible. This was the basis for the first discovery

A little “CSI” work on the IOPS
Many continuous build systems use some variation of a polling mechanism to understand when there is new checked in code that needs to be compiled. This should be a very light weight process.  However, once storage performance was allowed to breath better, the following patterns started showing up on all of the build VMs.

The image below shows the IOPS for one build VM during a one hour period of no compiling for that particular VM.  The VM’s were polling for new builds every 5 minutes.  Yep, that “build heartbeat” was as high as 450 IOPS on each VM.

high-IOPS-heartbeat

Why wasn’t this noticed before?  These spikes were being suppressed by my previously overtaxed storage, which made them more difficult to see. These were all writes, and were translating into 500 to 600 steady state IOPS just to sit idle (as seen below from the perspective of the backing storage)

Array-VMFSvolumeIOPS

So what was the cause? As it turned out, the polling mechanism was using some source code control (SVN) calls to help the build machines understand if it needed to execute a build. Programmatically, the Development Team has no idea that the script that they develop is going to be efficient, or not efficient. They are separated by that layer of the infrastructure. (Sadly, I have a feeling this happens more often than not in general Application Development). This resulted in a horribly inefficient method. After helping them understand the matter, it was revamped, and now polling for each VM only takes 1 to 2 IOPS every 5 minutes.

Idle-IOPS2

The image below shows how the accelerated cluster of 30 build VMs looks when there are no builds running.

Idle-IOPS

The inefficient polling mechanism wasn’t the only thing found. A few of the Linux build VMs had a rouge “Beagle” search daemon running on them. This crawler did just that, indexing data on these Linux machines, and creating unnecessary I/O.  With Windows, Indexers and other CPU and I/O hogs are typically controlled quite easily by GPO, but the equivalent services can creep into Linux systems if not careful.  It was an easy fix at least.

The cumulative benefit
Prior to the efforts of accelerating the storage, and looking to make it more efficient, the utilization of the arrays looked as the image shows.  (6 hour period, from the perspective of the arrays)

Array-IOPS-before

Now, with the combination of understanding my workload better, and acceleration through FVP, that same workload looks like this (6 hour period, from the perspective of the arrays):

Array-IOPS-after

Notice that the estimated workload is far under the 100% it was regularly pegged at for 24 hours a day, 6 days a week.  In fact, during the workday, the arrays might only peak at 50% to 60% utilization.  When no builds are running, the continuous build system may only be drawing 25 IOPS from the VMFS volumes that contain the build machines, which is much more reasonable than where it was at.

With the combination of less pressure on the backing physical storage, and the magic of pooled flash on the hosts, the applications and CPU get to dictate how much storage I/O is needed.  Below is a screen capture of IOPS on a production build VM while compiling was being performed.  It was not known up until this point that a single build VM needed as much as 4,000 IOPS to compile code because the physical storage was never capable of satisfying that type of need.

IOPS-single-VM

Conclusion
Could some of these discoveries have been made without FVP?  Yes, perhaps some of it. But good analysis comes from being able to interpret data in a consumable way. Its why various methods of data visualization such as bar graphs, pie charts, and X-Y-Z plots exist. FVP certainly has been doing a good job of accelerating workloads, but it is also helps the administrator understand the I/O better.  I look forward to seeing how the analytics might expand in future tools or releases from PernixData.

A friend once said to me that the only thing better than a new tractor is a reason to use it. In many ways, the same thing goes for technology. Virtualization might not even be that fascinating unless you had real workloads to run on top of it. Ditto for for PernixData FVP. When applied to real workloads, the magic begins to happen, and you learn a lot about your data in the process.