October 2014

A look at FVP 2.0’s new features in a production environment

I love a good benchmark as much as the next guy. But success in the datacenter is not solely predicated on the results of a synthetic benchmark, especially those that do not reflect a real workload. This was the primary motivation in upgrading my production environment to FVP 2.0 as quickly as possible. After plenty of testing in the lab, I wanted to see how the new and improved features of FVP 2.0 impacted a production workload. The easiest way to do this is to sit back and watch, then share some screen shots.

All of the images below are from my production code compiling machines running at random points of the day. The workloads will always vary somewhat, so take them as more "observational differences" than benchmark results. Also note that these are much more than the typical busy VM. The code compiling VMs often hit the triple crown in the "difficult to design for" department.

Large I/O sizes. (32K to 512K, with most being around 256K)
Heavy writes (95% to 100% writes during a full compile)
Sustained use of compute, networking, and storage resources during the compiling.

The characteristics of flash under these circumstances can be a surprise to many. Heavy writes with large I/Os can turn flash into molasses, and is not uncommon to have sporadic latencies well above 50ms. Flash has been a boon for the industry, and has changed almost everything for the better. But contrary to conventional wisdom, it is not a panacea. The characteristics of flash need to be taken into consideration, and expectations should be adjusted, whether it be used as an acceleration resource, or for persistent data storage. If you think large I/O sizes do not apply to you, just look at the average I/O size when copying some files to a file server.

One important point is that the comparisons I provide did not include any physical changes to my infrastructure. Unfortunately, my peering network for replica traffic is still using 1GbE, and my blades are only capable of leveraging Intel S3700 SSDs via embedded SAS/SATA controllers. The VMs are still backed by a near end-of-life 1GbE based storage array.

Another item worth mentioning is that due to my workload, my numbers usually reflect worst case scenarios. You may have latencies that are drastically lower than mine. The point being that if FVP can adequately accelerate my workloads, it will likely do even better with yours. Now let’s take a look and see the results.

Adaptive Network Compression
Specific to customers using 1GbE as their peering network, FVP 2.0 offers a bit of relief in the form of Adaptive Network Compression. While there is no way for one to toggle this feature off or on for comparison, I can share what previous observations had shown.

FVP 1.x
Here is an older image a build machine during a compile. This was in WB+1 mode (replicating to 1 peer). As you can see, the blue line (Observed VM latency) shows the compounding effect of trying to push large writes across a 1GbE pipe, to SATA/SAS based Flash devices was not as good as one would hope. The characteristics of flash itself, along with the constraints of 1GbE were conspiring with each other to make acceleration difficult.

FVP 2.0 using Adaptive Network Compression
Before I show the comparison of effective latencies between 1.x and 2.0, I want to illustrate the workload a bit better. Below is a zoomed in view (about a 20 minute window) showing the throughput of a single VM during a compile job. As you can see, it is almost all writes.

Below shows the relative number of IOPS. Almost all are write IOPS, and again, the low number of IOPS relative to the throughput is an indicator of large I/O sizes. Remember that with 512K I/O sizes, it only takes a couple of hundred IOPS to nearly saturate a 1GbE link – not to mention the problems that flash has with it.

Now let’s look at latency on that same VM, during that same time frame. In the image below, the blue line shows that the VM observed latency has now improved to the 6 to 8ms range during heavy writes (ignore the spike on the left, as that was from a cold read). The 6 to 8ms of latency is very close to the effective latency of a WB+0, local flash device only configuration.

Using the same accelerator device (Intel S3700 on embedded Patsburg controllers) as in 1.x, the improvements are dramatic. The "penalty" for the redundancy is greatly reduced to the point that the backing flash may be the larger contributor to the overall latency. What has really been quite an eye opener is how well the compression is helping. In just three business days, it has saved 1.5 TB of data running over the peer network. (350 GB of savings coming from another FVP cluster not shown)

Distributed Fault Tolerant Memory
If there is one thing that flash doesn’t do well with, it is writes using large I/O sizes. Think about all of the overhead that comes from flash (garbage collection, write amplification, etc.), and that in my case, it still needs to funnel through an overwhelmed storage controller. This is where I was looking forward to seeing how Distributed Fault Tolerant Memory (DFTM) impacted performance in my environment. For this test, I carved out 96GB of RAM on each host (384GB total) for the DFTM Cluster.

Let’s look at a similar build run accelerated using write-back, but with DFTM. This VM is configured for WB+1, meaning that it is using DFTM, but still must push the replica traffic across a 1GbE pipe. The image below shows the effective latency of the WB+1 configuration using DFTM.

The image above shows that using DFTM in a WB+1 mode eliminated some of that overhead inherent with flash, and was able to drop latencies below 4ms with just a single 1GbE link. Again, these are massive 256K and 512K I/Os. I was curious to know how 10GbE would have compared, but didn’t have this in my production environment.

Now, let’s try DFTM in a WB+0 mode. Meaning that it has no peering traffic to send it to. What do the latencies look like then for that same time frame?

If you can’t see the blue line showing the effective (VM observed) latencies, it is because it is hovering quite close to 0 for the entire sampling period. Local acceleration was 0.10ms, and the effective latency to the VM under the heaviest of writes was just 0.33ms. I’ll take that.

Here is another image of when I turned a DFTM accelerated VM from WB+1 to WB+0. You can see what happened to the latency.

Keep in mind that the accelerated performance I show in the images above come from a VM that is living on a very old Dell EqualLogic PS6000e. Just fourteen 7,200 RPM SATA drives that can only serve up about 700 IOPS on a good day.

An unintended, but extremely useful benefit of DFTM is to troubleshoot replica traffic that has higher than expected latencies. A WB+1 configuration using DFTM eliminates any notion of latency introduced by flash devices or offending controllers, and limits the possibilities to NICs on the host, or switches. Something I’ve already found useful with another vSphere cluster.

Simply put, DFTM is a clear winner. It can address all of the things that flash cannot do well. It avoids storage buses, drive controllers, NAND overhead, and doesn’t wear out. And it sits as close to the CPU with as much bandwidth as anything. But make no mistake, memory is volatile. With the exception of some specific use cases such as non persistent VDI, or other ephemeral workloads, one should take advantage of the "FT" part of DFTM. Set it to 1 or more peers. You may give back a bit of latency, but the superior performance is perfect for those difficult tier one workloads.

When configuring an FVP cluster, the current implementation limits your selection to a single acceleration type per host. So, if you have flash already installed in your servers, and want to use RAM for some VMs, what do you do? …Make another FVP cluster. Frank Denneman’s post: Multi-FVP cluster design – using RAM and FLASH in the same vSphere Cluster describes how to configure VMs in the same vSphere cluster to use different accelerators. Borrowing those tips, this is how my FVP clusters inside of a vSphere cluster look.

Write Buffer and destaging mechanism
This is a feature not necessarily listed on the bullet points of improvements, but deserves a mention. At Storage Field Day 5, Satyam Vaghani mentioned the improvements with the destaging mechanism. I will let the folks at PernixData provide the details on this, but there were corner cases in which VMs could bump up against some limits of the destager. It was relatively rare, but it did happen in my environment. As far as I can tell, this does seem to be improved.

Destaging visibility has also been improved. Ever since the pre 1.0, beta days, I’ve wanted more visibility on the destaging buffer. After all, we know that all writes eventually have to hit the backing physical datastore (see Effects of introducing write-back caching with PernixData FVP) and can be a factor in design. FVP 2.0 now gives two key metrics; the amount of writes to destage (in MB), and the time to the backing datastore. This will allow you to see if your backing storage can or cannot keep up with your steady state writes. From my early impressions, the current mechanism doesn’t quite capture the metric data at a high enough frequency for my liking, but it’s a good start to giving more visibility.

Honorable mentions
NFS support is a fantastic improvement. While I don’t have it currently in production, it doesn’t mean that I may not have it in the future. Many organizations use it and love it. And I’m quite partial to it in the old home lab. Let us also not dismiss the little things. One of my favorite improvements is simply the pre-canned 8 hour time window for observing performance data. This gets rid of the “1 day is too much, 1 hour is not enough” conundrum.

Conclusion
There is a common theme to almost every feature evaluation above. The improvements I showcase cannot by adequately displayed or quantified with a synthetic workload. It took real data to appreciate the improvements in FVP 2.0. Although 10GbE is the minimum ideal, Adaptive Network Compression really buys a lot of time for legacy 1GbE networks. And DFTM is incredible.

The functional improvements to FVP 2.0 are significant. So significant that with an impending refresh of my infrastructure, I am now taking a fresh look at what is actually needed for physical storage on the back end. Perhaps some new compute with massive amounts of PCIe based flash, and RAM to create large tiered acceleration pools. Then backing spindles supporting our capacity requirements, with relatively little data services, and just enough performance to keep up with the steady-state writes.

Working at a software company myself, I know all too well that software is never "complete." But FVP 2.0 is a great leap forward for PernixData customers.

Using FVP in multi-NIC vMotion environments

In FVP version 1.5, PernixData introduced a nice little feature that allows a user to specify the network to use for all FVP peering/replica traffic. This added quite a bit of flexibility in adapting FVP to a wider variety of environments. It can also come in handy when testing performance characteristics of different network speeds, similar to what I did when testing FVP over Infiniband. While the “network configuration” setting is self-explanatory, and ultra-simple, it is ESXi that makes it a little more adventurous.

VMkernels and rules to abide by. …Sort of.
“In theory there is no difference between theory and practice. In practice, there is.” — Yogi Berra

Under the simplest of arrangements, FVP will use the vMotion network for its replica traffic. If your vMotion works, then FVP works. FVP will also work in a multi-NIC vMotion arrangement. While it can’t use more than one VMkernel, vMotion certainly can. Properly configured, vMotion will use whatever links are available, leaving more opportunity and bandwidth for FVP’s replica traffic. This can be especially helpful in 1GbE environments. Okay, so far, so good. The problem can become when an ESXi host has multiple VMkernels in the same subnet.

The issues around having multiple VMkernels on a single host in one IP subnet is nothing new. The accepted practice has been to generally stay away from multiple VMkernels in a single subnet, but the lines blur a bit when factoring the VMkernel’s intended purpose.

In VMware Support Insider Post, it states to only use one VMkernel per IP Subnet. Well, except for iSCSI storage, and vMotion.
In VMware KB 2007467, it states: “Ensure that both VMkernel interfaces participating in the vMotion have the IP address from the same IP subnet.“

The motives for recommending isolation of VMkernels is pretty simple. The VMkernel network stack uses a single routing table to route traffic. Two hosts talking to each other on one subnet with multiple VMkernels may not know what interface to use. The result can be unexpected behavior, and depending on what service is sitting in the same network, even a loss of host connectivity. This behavior can also vary depending on the version of ESXi being used. ESXi 5.0 may act differently than 5.1, and 5.5 changes the game even more with the ability to create Custom TCP/IP stacks per VMkernel adapter.which could give each VMkernel its own routing table.

So what about FVP?
How does any of this relate to FVP? For me, this initial investigation stemmed from some abnormally high latencies I was seeing on my VMs. This is quite the opposite effect I’m used to having with FVP. As it turns out, when FVP was pinned to my vMotion-2 network, it was correctly sending out of the correct interface on my multi-NIC vMotion setup, but the receiving ESXi host was using the wrong target interface (vMotion-1 VMkernel on target host), which caused the latency. Just like other VMkernel behavior, it naturally wanted to always choose the lower vmk number. Configuring FVP to use vMotion-1 network resolved the issue instantly, as vMotion-1 in my case it was using vmk1 instead of vmk5. Many thanks to the support team for noticing the goofy communication path it was taking.

Testing similar behavior with vMotion
While the symptoms showed up in FVP, the cause is an ESXi matter. While not an exact comparison, one can simulate a similar behavior that I was seeing with FVP by doing a little experimenting with vMotion. The experiment simply involves taking an arrangement originally configured for Multi-NIC vMotion, disabling vMotion on the network with the lowest vmk number on both hosts, kicking off a vMotion, and observing the traffic via esxtop. (Warning. Keep this experiment to your lab only).

For the test, two ESXi 5.5 hosts were used, and mult-NIC vMotion was set up in accordance to KB 2007467. One vSwitch. Two VMkernel ports (vMotion-0 & vMotion-1 respectively) in an active/standby arrangement. The uplinks are flopped on the other VMkernel. Below is an example of ESX01:

And both displayed what I’d expect in the routing table.

The tests below will show what the traffic looks like using just one of the vMotion networks, but only where the “vMotion” service is enabled on one of the VMkernel ports.

Test 1: Verify what correct vMotion traffic looks like
First, let’s establish what correct vMotion traffic will look like. This is on a dual NIC vMotion arrangement in which only the network with the lowest numbered vmk is ticked with the “vMotion” service.

The screenshot below is how the traffic looks from the source on ESX01. The green bubble indicates the anticipated/correct VMkernel to be used. Success!

The screenshot below is how traffic looks from the target on ESX02. The green bubble indicates the anticipated/correct VMkernel to be used. Success!

As you can see, the traffic is exactly as expected, with no other traffic occurring on the other VMkernel, vmk2.

Test 2: Verify what incorrect vMotion traffic looks like
Now let’s look at what happens on those same hosts when trying to use only the higher numbered vMotion network. The “vMotion” service was changed on both hosts to the other VMkernel, and both hosts were restarted. What is shown below is how the traffic looks on a dual NIC vMotion arrangement in which the network with the lowest numbered vmk has the “vMotion” service unticked, and the higher numbered vMotion network has the service enabled.

The screenshot below is how the traffic looks from the source on ESX01. The green bubble indicates the anticipated/correct VMkernel to be used. The red bubble indicates the VMkernel it is actually using. Uh oh. Notice how there is no traffic coming from vmk2, where it should be coming from? It’s coming from vmk1, exactly like the first test.

The screenshot below is how traffic looks from the target on ESX02. The green bubble indicates the anticipated/correct VMkernel to be used.

As you can see, under the described test arrangement, ESXi can and may use the incorrect VMkernel on the source, when vMotion is disabled on the vMotion network with the lowest VMkernel number, and active on the other vMotion network. It was repeatable with both ESXi 5.0 and ESXi 5.5. The results were consistent in tests with host uplinks connected to the same switch versus two stacked switches. The tests were also consistent using both standard vSwitches and Distributed vSwitches.

The experiment above is just a simple test to better understand how the path from the source to the target can get confused. From my interpretation, it is not unlike that of which is described in Frank Denneman’s post on why a vMotion network may accidently use a Management Network. (His other post on Designing your vMotion Network is also a great read, and applicable to the topic here.) Since FVP can only use one specific VMkernel on each host, I believe I was able to simulate the basics of why ESXi was making it difficult for FVP when pinning the FVP replica traffic on my higher numbered vMotion network in my production network. Knowing this lends itself to the first recommendation below.

A few different ways to configure FVP
After looking at the behavior of all of this, here are a few recommendations on using FVP with your VMkernel ports. Let me be clear that these are my recommendations only.

Ideally, create an isolated, non-routable network using a dedicated VLAN with a single VMkernel on each host, and assign only FVP to that network. It can live in whatever vSwitch is most suitable for your environment. (The faster the uplinks, the better). This will isolate, and insure the peer traffic is flowing as designed, and will let a multi-NIC vMotion arrangement work by itself. Here is an example of what that might look like:

If for some reason you can’t do the recommendation above, (maybe you need to wait on getting a new VLAN provisioned by your network team) use a vMotion network, but if it is a multi-NIC vMotion, set FVP to run on the vMotion network with lowest numbered VMkernel. According to, yes, another great post from Frank, this was the default approach for FVP prior to exposing the ability to assign FVP traffic to a specific network.

Remember that if there is ever a need to modify anything related to the VMkernel ports (untick the “vMotion” configuration box, adding or removing VMkernels), be aware that the routing interface (as seen via esxcfg-route -l ) may not change until there is a host restart. You may also find that using esxcfg-route -n to view the host’s arp table handy.

The ability for you to deliver your FVP traffic to it’s East-West peers in the fastest, most reliable way will allow you to make the most of FVP offers. Treat the FVP like a first class citizen in your network, and it will pay off with better performing VMs.

And a special thank you to Erik Bussink and Frank Denneman for confirming my vMotion test results, and my sanity.

Thanks for reading.

– Pete