August 2015

When PernixData debuted FVP back in August 2013, for me there was one innovation in particular that stood out above the rest. The ability to accelerate writes (known as “Write Back” caching) on the server side, and do so in a fault tolerant way. Leverage fast media on the server side to drive microsecond write latencies to a VM while enjoying all of the benefits of VMware clustering (vMotion, HA, DRS, etc.). Give the VM the advantage of physics by presenting a local acknowledgement of the write, but maintain all of the benefits of keeping your compute and storage layers separate.

But sometimes overlooked with this innovation is the effectiveness that comes with how FVP clusters acceleration devices to create a pool of resources for read caching (known as “Write Through” caching with FVP). For new and existing FVP users, it is good to get familiar with the basics of how to interpret the effectiveness of clustered read caching, and how to look for opportunities to improve the results of it in an environment. For those who will be trying out the upcoming FVP Freedom edition, this will also serve as an additional primer for interpreting the metrics. Announced at Virtualization Field Day 5, the Freedom Edition is a free edition of FVP with a few limitations, such as read caching only, and a maximum of 128GB tier size using RAM.

The power of read caching done the right way
Read caching alone can sometimes be perceived as a helpful way to improve performance, but temporary, and only addressing one side of the I/O dialogue. Unfortunately, this assertion tells an incomplete story. It is often criticized, but let’s remember that caching in some form is used by almost everyone, and everything. Storage arrays of all types, Hyper Converged solutions, and even DAS. Dig a little deeper, and you realize its perceived shortcomings are most often attributed to how it has been implemented. By that I mean:

Limited, non-adjustable cache sizes in arrays or Hyper Converged environments.
Limited to a single host in server side solutions. (operations like vMotion undermining its effectiveness)
Not VM or workload aware.

Existing solutions address some of these shortcomings, but fall short in addressing all three in order to deliver read caching in a truly effective way. FVP’s architecture address all three, giving you the agility to quickly adjust the performance tier while letting your centralized storage do what it does best; store data.

Since FVP allows you to choose the size of the acceleration tier, this impact alone can be profound. For instance, current NVMe based Flash cards are 2TB in size, and are expected to grow dramatically in the near future. Imagine a 10 node cluster that would have perhaps 20-40TB of an acceleration tier that may be serving up just 50TB of persistent storage. Compare this to a hybrid array that may only put in a few hundred GB of flash devices in an array serving up that same 50TB, and funneling through a pair of array controllers. Flash that the I/Os would still have traverse the network and storage stack to get to, and cached data that is arbitrarily evicted for new incoming hot blocks.

Unlike other host side caching solutions, FVP treats the collection of acceleration devices on each host as a pool. As workloads are being actively moved across hosts in the vSphere cluster, those workloads will still be able to fetch the cached content from that pool using a light weight protocol. Traditionally host based caching would have to re-warm the data from the backend storage using the entire storage stack and traditional protocols if something like a vMotion event occurred.

FVP is also VM aware. This means it understands the identity of each cached block – where it is coming from, and going to - and has many ways to maintain cache coherency (See Frank Denneman’s post Solving Cache Pollution). Traditional approaches to providing a caching tier meant that they were largely unaware of who the blocks of data were associated with. Intelligence was typically lost the moment the block exits the HBA on the host. This sets up one of the most common but often overlooked scenarios in a real environment. One or more noisy neighbor VMs can easily pollute, and force eviction of hot blocks in the cache used by other VMs. The arbitrary nature of this means potentially unpredictable performance with these traditional approaches.

How it works
The logic behind FVP’s clustered read caching approach is incredibly resilient and efficient. Cached reads for a VM can be fetched from any host participating in the cluster, which allows for a seamless leveraging of cache content regardless of where the VM lives in the cluster. Frank Denneman’s post on FVP’s remote cache access describes this in great detail.

Adjusting the charts
Since we will be looking at the FVP charts to better understand the benefit of just read caching alone, let’s create a custom view. This will allow us to really focus on read I/Os and not get them confused with any other write I/O activity occurring at the same time.

Note that when you choose a "Custom Breakdown", the same colors used to represent both reads and writes in the default "Storage Type" view will now be representing ONLY reads from their respective resource type. Something to keep in mind as you toggle between the default "Storage Type" view, and this custom view.

Looking at Offload
The goal for any well designed storage system is to deliver optimal performance to the applications. With FVP, I/Os are offloaded from the array to the acceleration tier on the server side. Read requests will be delivered to the VMs faster, reducing latency, and speeding up your applications.

From a financial investment perspective, let’s not forget the benefit of I/O “offload.” Or in other words, read requests that were satisfied from the acceleration tier. Using FVP, offload from the storage arrays serving the persistent storage tier, from the array controllers, from the fabric, and the HBAs. The more offload there is, the less work for your storage arrays and fabric, which means you can target more affordable backend storage. The hero numbers showcase the sum of this offload nicely.

Looking at Network acceleration reads
Unlike other host based solutions, FVP allows for common activities such as vMotions, DRS, and HA to work seamlessly without forcing any sort of rewarming of the cache from the backend storage. Below is an example of read I/O from 3 VMs in a production environment, and their ability to access cached reads on an acceleration device on a remote host.

Note how the Latency maintains its low latency on those read requests that came from a remote acceleration device (the green line).

How good is my read caching working?
Regardless of which write policy (Write Through or Write Back) is being used in FVP, the cache is populated in the same way.

All read requests from the backing array will place the data into the acceleration tier as it fetches it from the backing storage.
All write I/O is placed in the cache as it is written to the physical storage.

Therefore, it is easy to conclude that if read I/Os did NOT come from acceleration tier, it is from one of three reasons.

A block of data had been requested that had never been requested before.
The block of data had not been written recently, and thus, not residing in cache.
A block of data had once lived in the cache (via a read or write), but had been evicted due to cache size.

The first two items reflect the workload characteristics, while the last one is a result of a design decision – that being the cache size. With FVP you get to choose how large the devices are that make up the caching tier, so you can determine ultimately how much the solution will benefit you. Cache size can have a dramatic impact on performance because there is less pressure to evict previous data that have already been cached to make room for new data.

Visualizing the read cache usage
This is where the FVP metrics can tell the story. When looking at the "Custom Breakdown" view described earlier in this post, you can clearly see on the image below that while a sizable amount of reads were being serviced from the caching tier, the majority of reads (3,500+ IOPS sustained) in this time frame (1 week) came from the backing datastore.

Now, let’s contrast this to another environment and another workload. The image below clearly shows a large amount of data over the period of 1 day that is served from the acceleration tier. Nearly all of the read I/Os and over 60MBps of throughput that never touched the array.

When evaluating read cache sizing, this is one of the reasons why I like this particular “Custom Breakdown” view so much. Not only does it tell you how well FVP is working at offloading reads. It tells you the POTENTIAL of all reads that *could* be offloaded from the array. You get to choose how much offload occurs, because you decide on how large your tier size is, or how many VMs participate in that tier.

Hit Rate will also tell you the percentage of reads that are coming from the acceleration tier at any point and time. This can be an effective way to few cache hit frequency, but to gain more insight, I often rely on this "Custom Breakdown" to get better context of how much data is coming from the cache and backing datastores at any point in time. Eviction rate can also provide complimentary information if it shows the eviction rate creeping upward. But there can be cases were lower eviction percentages may evict enough cached data over time that it can still impact if it is still in cache. Thus the reason why this particular "Custom Breakdown" is my favorite for evaluating reads.

What might be a scenario for seeing a lot of reads coming from a backing datastore, and not from cache? Imagine running 500 VMs in an acceleration tier size of just a few GB. The working set sizes are likely much larger than the cache size, and will result in churning through the cache and not show significant demonstrable benefit. Something to keep in mind if you are trying out FVP with a very small amount of RAM as an acceleration resource. Two effective ways to make this more efficient would be to 1.) increase the cache size or 2.) decrease the number of VMs participating in acceleration. Both will achieve the same thing; providing more potential cache tier size for each VM accelerated. The idea for any caching layer is to have it large enough to hold most of the active data (aka "working set") in the tier. With FVP, you get to easily adjust the tier size, or the VMs participating in it.

Don’t know what your working set sizes are? Stay tuned for PernixData Architect!

Summary
Once you have a good plan for read caching with FVP, and arrange for a setup with maximum offload, you can drive the best performance possible from clustered read caching. On it’s own, clustered read caching implemented the way FVP does it can change the architectural discussion of how you design and spend those IT dollars. Pair this with write-buffering with the full edition of FVP, and it can change the game completely.

Month: August 2015

Understanding PernixData FVP’s clustered read caching functionality