September 13, 2013 3 Comments
Since my last post, "Accelerating storage using PernixData’s FVP. A perspective from customer #0001" I’ve had a number of people ask me questions on what type of improvements I’ve seen with FVP. Well, let’s take a look at how it is performing.
The cluster I’ve applied FVP to is dedicated for the purpose of compiling code. Over two dozen 8 vCPU Linux and Windows VM’s churning out code 24 hours a day. It is probably one of the more challenging environments to improve, as accelerating code compiling is inherently a very difficult task. Massive amounts of CPU using a highly efficient, multithreaded compiler, a ton of writes, and throw in some bursts of reads for good measure. All of this occurs in various order depending on the job. Sounds like fun, huh.
Our full builds benefited the most by our investment in additional CPU power earlier in the year. This is because full compiles are almost always CPU bound. But incremental builds are much more challenging to improve because of the dialog that occurs between CPU and disk. The compiler is doing all sorts of checking, throughout the compile. Some of the phases of an incremental build are not multithreaded, so while a full build offers nearly perfect multithreading on these 8 vCPU build VMs, this just isn’t the case on an incremental build.
The screen shots below will step you through how FVP is improving these very difficult to accelerate incremental builds. They will be broken down into the categories that FVP divides them into; IOPS, Latency, and Throughput. Also included will be a CPU utilization metric, because they all have an indelible tie to each other. Some of the screen shots are from the same compile run, while others are not. The point here it to show how it is accelerating, and more importantly how to interpret the data. The VM being used here is our standard 8 vCPU Windows VM with 8GB of RAM. It has write-back caching enabled, with a write redundancy setting of "Local flash and 1 network flash device."
Click on each image to see a larger version
Below is an incremental compile on a build VM during the middle of the day. The magenta line is showing what is being satisfied by the backing data store, and the blue line shows the Total Effective IOPS after flash is leveraged. The key to remember on this view is that it does not distinguish between reads and writes. If you are doing a lot of "cold reads" the magenta "data store" line and blue "Total effective" line may very well overlap.
This is the same metric, but toggled to the read/write view. In this case, you can see below that a significant amount of acceleration came from reads (orange). For as much writing as a build run takes, I never knew a single build VM could use 1,600 IOPS or more of reads, because my backing storage could never satisfy the request.
Allowing the CPU to pass the I/O as quickly as it needs to does one thing, it allows the multithreaded compiler to maximize CPU usage. During a full compile, it is quite easy to max out an 8 vCPU system and have a sustained 100% CPU usage, but again, these incremental compiles were much more challenging. What you see below is the CPU utilization associated with the VM running the build. It is a significant improvement of an incremental build by using acceleration. A non accelerated build would rarely get above 60% CPU utilization.
At a distance, this screen grab probably looks like a total mess, but it has really great data behind it. Why? The need for high IOPS is dictated by the VMs demanding it. If it doesn’t demand it, you won’t see it. But where acceleration comes in more often is reduced latency, on both reads and writes. The most important line here is the blue line, which represents the total effective latency.
Just as with other metrics, the latency reading can often times be a bit misleading with the default "Flash/Datastore" view. This view does not distinguish between reads and writes, so a cold read pulling off of spinning disk will have traditional amounts of latency you are familiar with. This can skew your interpretation of the numbers in the default view. For all measurements (IOPS, Throughput, Latency) I often find myself toggling between this view, and the read/write view. Here you can see how a cold read sticks out like a sore thumb. The read/write view is where you would go to understand individual read and write latencies.
While a throughput chart can often look very similar to the IOPS chart, you might want to spend a moment and dig a little deeper. You might find some interesting things about your workload. Here, you can see the total effective throughput significantly improved by caching.
Just as with the other metrics, toggling it into read/write view will help you better understand your reads and writes.
The IOPS, Throughput & Latency relationship
It is easy to overlook the relationship that IOPS, throughput, and latency have to each other. Let me provide an real world example of how one can influence the other. The following represents the early, and middle phases of a code compile run. This is the FVP "read/write" view of this one VM. Green indicates writes. Orange indicates reads. Blue indicates "Total Effective" (often hidden by the other lines).
First, IOPS (green). High write IOPS at the beginning, yet relatively low write IOPS later on.
Now, look at write throughput (green) below for that same time period of the build. A modest amount of throughput at the beginning where the higher IOPS were at, then followed by much higher throughput later on when IOPS had been low. This is the indicator of changing I/O sizes from the applications generating the data.
Now look at write latency (green) below. Extremely low latency (sub 1ms) with smaller I/O sizes. Higher latency on the much larger I/O sizes later on. By the way, the high read latencies generally come from cold reads that were served from the backing spindles.
The findings here show that early on in the workflow where SVN is doing a lot of it’s prep work, a 32KB I/O size for writes is typically used. The write IOPS are high, Throughput is modest, and latency comes in at sub 1ms. Later on in the run, the compiler itself uses much larger I/O sizes (128KB to 256KB). IOPS are lower, but throughput is very high. Latency suffers (approaching 8ms) with the significantly larger I/O sizes. There are other factors influencing this, to which I will address in an upcoming post.
This is one of the methods to determine your typical I/O size to provide a more accurate test configuration for Iometer, if you choose to do additional benchmarking. (See: Iometer. As good as you want to make it.)
1. After you have deployed an FVP cluster into production, your SAN array monitoring tool will most likely show you an increase in your write percentage compared to your historical numbers . This is quite logical when you think about it.. All writes, even when accelerated, eventually make it to the data store (albeit in a much more efficient way). Many of your reads may be satisfied by FVP, and never hit the array.
2. When looking at a summary of the FVP at the cluster level, I find it helpful to click on the "Performance Map" view. This gives me a weighted view of how to distinguish what is being accelerated most during the given sampling period.
3. In addition to the GUI, controlling the VM write caching settings can easily managed via PowerShell. This might be a good step to take if the cluster tripped over to UPS power. Backup infrastructures that do not have a VADP capable proxy living in the accelerated cluster might also need to rely on some PowerShell scripts. PernixData has some good documentation on the matter.
PernixData FVP is doing a very good job of accelerating a verify difficult workload. I would have loved to show you data from accelerating a more typical workload such as Exchange or SQL, but my other cluster containing these systems is not accelerated at this time. Stay tuned for the next installment, as I will show you what was discovered as I started looking at my workload more closely.