PernixData – Page 2

Applying Innovation in the Workplace

Those of us in this IT industry need not be reminded that IT is as much of a consumer of solutions as it is a provider of services. Advancing technologies attempt to provide services that are faster, more resilient, and feature rich. Sometimes advancement brings unintended consequences while simultaneously creating even more room to innovate. Exciting for sure, but all of this can become a bit precarious if you hold decision making responsibilities on what technologies best suite your environment. Go too deep on the unproven edge, and risk the well-being of your company, and possibly your career. Arguably more dangerous is to stay too conservative, and risk being a Luddite holding onto unused, outdated or failed vestiges of your IT past. It is a delicate balance, but rarely does it reward the status quo. There is an IT administrator out there somewhere that still doesn’t trust x86 servers, let alone virtualization.

Nobody in this industry is the sole proprietor of innovation, and we are all better off for it. Good ideas are everywhere, and it is fun to see them materialize into functional products. IT departments get a front row seat in seeing how different solutions solve problems. Some ideas are better than others. Others create more problems than they fix. Many companies providing a solution are victims of bad timing, bad marketing, or poor execution. In the continuum of progress and changing market conditions, others fail to acknowledge the change and course correct, or simply lost sight of why they exist in the first place.

Then, there are some solutions that show up as a gem. Perhaps they are transformational to how a problem is solved. Maybe they win you over by their elegance in masking the terribly complex with clean and simple. Maybe the solution has a bit of both. Those who are responsible for running environments get pretty good at recognizing the standouts.

Innovation’s impact on thinking differently
"If I had asked my customers what they wanted they would have said a faster horse" — Henry Ford

It started out as a simple trial of some beta software. It ended up as an integral component of my infrastructure; viewed in the same way as my compute, switchgear, storage, and hypervisor. That is basically the story of how PernixData FVP became a part of my environment. For the next 18 months I would watch daily, hourly, even by the minute as to how my very demanding workloads were improved because of this new approach to solving a common problem. The results were immediate, and obvious. Faster code compiling times. Lower latencies and more predictable performance for all of our applications. All while gaining better visibility to the behavior and needs of our workloads. And of course, storage arrays that were no longer paralyzed by I/O requests. Even the best of slide decks couldn’t convey what I was seeing. I got to see it happen every day, and much like the magic of virtualization in general, it never got old.

It is for that reason that I’ve joined the team at PernixData. I get the chance to help others understand how the PernixData approach can help their environment, and is more than just a faster horse. I’m no longer responsible for my own workloads, but now get to help people better understand their own. Since I’ve always had a passion for virtualization, IT infrastructures, and how real application workloads impact an environment, I think it’s going to be a great fit. I look forward to working with an unbelievably talented group of people. It is quite an honor.

A tip of the hat
I leave an organization that is top notch. Tecplot is a market leader in data visualization, and is routinely voted in the top 100 companies to work for. This doesn’t happen by accident. It comes from great people, great leadership, and has resulted in trusted, innovative products. I would like to thank the ownership group for allowing me the opportunity to be a part of their team, as it has been an absolute pleasure to work there. I’ve learned a lot from smart, principled folks that make up that company, and am better off for it. I leave behind the day to day administrative duties and challenges of a virtualized environment, but I am very excited to join a great team of really smart people who have helped change how challenges in modern IT infrastructures are viewed, and addressed.

Happy New Year.

– Pete

A look at FVP 2.0’s new features in a production environment

I love a good benchmark as much as the next guy. But success in the datacenter is not solely predicated on the results of a synthetic benchmark, especially those that do not reflect a real workload. This was the primary motivation in upgrading my production environment to FVP 2.0 as quickly as possible. After plenty of testing in the lab, I wanted to see how the new and improved features of FVP 2.0 impacted a production workload. The easiest way to do this is to sit back and watch, then share some screen shots.

All of the images below are from my production code compiling machines running at random points of the day. The workloads will always vary somewhat, so take them as more "observational differences" than benchmark results. Also note that these are much more than the typical busy VM. The code compiling VMs often hit the triple crown in the "difficult to design for" department.

Large I/O sizes. (32K to 512K, with most being around 256K)
Heavy writes (95% to 100% writes during a full compile)
Sustained use of compute, networking, and storage resources during the compiling.

The characteristics of flash under these circumstances can be a surprise to many. Heavy writes with large I/Os can turn flash into molasses, and is not uncommon to have sporadic latencies well above 50ms. Flash has been a boon for the industry, and has changed almost everything for the better. But contrary to conventional wisdom, it is not a panacea. The characteristics of flash need to be taken into consideration, and expectations should be adjusted, whether it be used as an acceleration resource, or for persistent data storage. If you think large I/O sizes do not apply to you, just look at the average I/O size when copying some files to a file server.

One important point is that the comparisons I provide did not include any physical changes to my infrastructure. Unfortunately, my peering network for replica traffic is still using 1GbE, and my blades are only capable of leveraging Intel S3700 SSDs via embedded SAS/SATA controllers. The VMs are still backed by a near end-of-life 1GbE based storage array.

Another item worth mentioning is that due to my workload, my numbers usually reflect worst case scenarios. You may have latencies that are drastically lower than mine. The point being that if FVP can adequately accelerate my workloads, it will likely do even better with yours. Now let’s take a look and see the results.

Adaptive Network Compression
Specific to customers using 1GbE as their peering network, FVP 2.0 offers a bit of relief in the form of Adaptive Network Compression. While there is no way for one to toggle this feature off or on for comparison, I can share what previous observations had shown.

FVP 1.x
Here is an older image a build machine during a compile. This was in WB+1 mode (replicating to 1 peer). As you can see, the blue line (Observed VM latency) shows the compounding effect of trying to push large writes across a 1GbE pipe, to SATA/SAS based Flash devices was not as good as one would hope. The characteristics of flash itself, along with the constraints of 1GbE were conspiring with each other to make acceleration difficult.

FVP 2.0 using Adaptive Network Compression
Before I show the comparison of effective latencies between 1.x and 2.0, I want to illustrate the workload a bit better. Below is a zoomed in view (about a 20 minute window) showing the throughput of a single VM during a compile job. As you can see, it is almost all writes.

Below shows the relative number of IOPS. Almost all are write IOPS, and again, the low number of IOPS relative to the throughput is an indicator of large I/O sizes. Remember that with 512K I/O sizes, it only takes a couple of hundred IOPS to nearly saturate a 1GbE link – not to mention the problems that flash has with it.

Now let’s look at latency on that same VM, during that same time frame. In the image below, the blue line shows that the VM observed latency has now improved to the 6 to 8ms range during heavy writes (ignore the spike on the left, as that was from a cold read). The 6 to 8ms of latency is very close to the effective latency of a WB+0, local flash device only configuration.

Using the same accelerator device (Intel S3700 on embedded Patsburg controllers) as in 1.x, the improvements are dramatic. The "penalty" for the redundancy is greatly reduced to the point that the backing flash may be the larger contributor to the overall latency. What has really been quite an eye opener is how well the compression is helping. In just three business days, it has saved 1.5 TB of data running over the peer network. (350 GB of savings coming from another FVP cluster not shown)

Distributed Fault Tolerant Memory
If there is one thing that flash doesn’t do well with, it is writes using large I/O sizes. Think about all of the overhead that comes from flash (garbage collection, write amplification, etc.), and that in my case, it still needs to funnel through an overwhelmed storage controller. This is where I was looking forward to seeing how Distributed Fault Tolerant Memory (DFTM) impacted performance in my environment. For this test, I carved out 96GB of RAM on each host (384GB total) for the DFTM Cluster.

Let’s look at a similar build run accelerated using write-back, but with DFTM. This VM is configured for WB+1, meaning that it is using DFTM, but still must push the replica traffic across a 1GbE pipe. The image below shows the effective latency of the WB+1 configuration using DFTM.

The image above shows that using DFTM in a WB+1 mode eliminated some of that overhead inherent with flash, and was able to drop latencies below 4ms with just a single 1GbE link. Again, these are massive 256K and 512K I/Os. I was curious to know how 10GbE would have compared, but didn’t have this in my production environment.

Now, let’s try DFTM in a WB+0 mode. Meaning that it has no peering traffic to send it to. What do the latencies look like then for that same time frame?

If you can’t see the blue line showing the effective (VM observed) latencies, it is because it is hovering quite close to 0 for the entire sampling period. Local acceleration was 0.10ms, and the effective latency to the VM under the heaviest of writes was just 0.33ms. I’ll take that.

Here is another image of when I turned a DFTM accelerated VM from WB+1 to WB+0. You can see what happened to the latency.

Keep in mind that the accelerated performance I show in the images above come from a VM that is living on a very old Dell EqualLogic PS6000e. Just fourteen 7,200 RPM SATA drives that can only serve up about 700 IOPS on a good day.

An unintended, but extremely useful benefit of DFTM is to troubleshoot replica traffic that has higher than expected latencies. A WB+1 configuration using DFTM eliminates any notion of latency introduced by flash devices or offending controllers, and limits the possibilities to NICs on the host, or switches. Something I’ve already found useful with another vSphere cluster.

Simply put, DFTM is a clear winner. It can address all of the things that flash cannot do well. It avoids storage buses, drive controllers, NAND overhead, and doesn’t wear out. And it sits as close to the CPU with as much bandwidth as anything. But make no mistake, memory is volatile. With the exception of some specific use cases such as non persistent VDI, or other ephemeral workloads, one should take advantage of the "FT" part of DFTM. Set it to 1 or more peers. You may give back a bit of latency, but the superior performance is perfect for those difficult tier one workloads.

When configuring an FVP cluster, the current implementation limits your selection to a single acceleration type per host. So, if you have flash already installed in your servers, and want to use RAM for some VMs, what do you do? …Make another FVP cluster. Frank Denneman’s post: Multi-FVP cluster design – using RAM and FLASH in the same vSphere Cluster describes how to configure VMs in the same vSphere cluster to use different accelerators. Borrowing those tips, this is how my FVP clusters inside of a vSphere cluster look.

Write Buffer and destaging mechanism
This is a feature not necessarily listed on the bullet points of improvements, but deserves a mention. At Storage Field Day 5, Satyam Vaghani mentioned the improvements with the destaging mechanism. I will let the folks at PernixData provide the details on this, but there were corner cases in which VMs could bump up against some limits of the destager. It was relatively rare, but it did happen in my environment. As far as I can tell, this does seem to be improved.

Destaging visibility has also been improved. Ever since the pre 1.0, beta days, I’ve wanted more visibility on the destaging buffer. After all, we know that all writes eventually have to hit the backing physical datastore (see Effects of introducing write-back caching with PernixData FVP) and can be a factor in design. FVP 2.0 now gives two key metrics; the amount of writes to destage (in MB), and the time to the backing datastore. This will allow you to see if your backing storage can or cannot keep up with your steady state writes. From my early impressions, the current mechanism doesn’t quite capture the metric data at a high enough frequency for my liking, but it’s a good start to giving more visibility.

Honorable mentions
NFS support is a fantastic improvement. While I don’t have it currently in production, it doesn’t mean that I may not have it in the future. Many organizations use it and love it. And I’m quite partial to it in the old home lab. Let us also not dismiss the little things. One of my favorite improvements is simply the pre-canned 8 hour time window for observing performance data. This gets rid of the “1 day is too much, 1 hour is not enough” conundrum.

Conclusion
There is a common theme to almost every feature evaluation above. The improvements I showcase cannot by adequately displayed or quantified with a synthetic workload. It took real data to appreciate the improvements in FVP 2.0. Although 10GbE is the minimum ideal, Adaptive Network Compression really buys a lot of time for legacy 1GbE networks. And DFTM is incredible.

The functional improvements to FVP 2.0 are significant. So significant that with an impending refresh of my infrastructure, I am now taking a fresh look at what is actually needed for physical storage on the back end. Perhaps some new compute with massive amounts of PCIe based flash, and RAM to create large tiered acceleration pools. Then backing spindles supporting our capacity requirements, with relatively little data services, and just enough performance to keep up with the steady-state writes.

Working at a software company myself, I know all too well that software is never "complete." But FVP 2.0 is a great leap forward for PernixData customers.

Using FVP in multi-NIC vMotion environments

In FVP version 1.5, PernixData introduced a nice little feature that allows a user to specify the network to use for all FVP peering/replica traffic. This added quite a bit of flexibility in adapting FVP to a wider variety of environments. It can also come in handy when testing performance characteristics of different network speeds, similar to what I did when testing FVP over Infiniband. While the “network configuration” setting is self-explanatory, and ultra-simple, it is ESXi that makes it a little more adventurous.

VMkernels and rules to abide by. …Sort of.
“In theory there is no difference between theory and practice. In practice, there is.” — Yogi Berra

Under the simplest of arrangements, FVP will use the vMotion network for its replica traffic. If your vMotion works, then FVP works. FVP will also work in a multi-NIC vMotion arrangement. While it can’t use more than one VMkernel, vMotion certainly can. Properly configured, vMotion will use whatever links are available, leaving more opportunity and bandwidth for FVP’s replica traffic. This can be especially helpful in 1GbE environments. Okay, so far, so good. The problem can become when an ESXi host has multiple VMkernels in the same subnet.

The issues around having multiple VMkernels on a single host in one IP subnet is nothing new. The accepted practice has been to generally stay away from multiple VMkernels in a single subnet, but the lines blur a bit when factoring the VMkernel’s intended purpose.

In VMware Support Insider Post, it states to only use one VMkernel per IP Subnet. Well, except for iSCSI storage, and vMotion.
In VMware KB 2007467, it states: “Ensure that both VMkernel interfaces participating in the vMotion have the IP address from the same IP subnet.“

The motives for recommending isolation of VMkernels is pretty simple. The VMkernel network stack uses a single routing table to route traffic. Two hosts talking to each other on one subnet with multiple VMkernels may not know what interface to use. The result can be unexpected behavior, and depending on what service is sitting in the same network, even a loss of host connectivity. This behavior can also vary depending on the version of ESXi being used. ESXi 5.0 may act differently than 5.1, and 5.5 changes the game even more with the ability to create Custom TCP/IP stacks per VMkernel adapter.which could give each VMkernel its own routing table.

So what about FVP?
How does any of this relate to FVP? For me, this initial investigation stemmed from some abnormally high latencies I was seeing on my VMs. This is quite the opposite effect I’m used to having with FVP. As it turns out, when FVP was pinned to my vMotion-2 network, it was correctly sending out of the correct interface on my multi-NIC vMotion setup, but the receiving ESXi host was using the wrong target interface (vMotion-1 VMkernel on target host), which caused the latency. Just like other VMkernel behavior, it naturally wanted to always choose the lower vmk number. Configuring FVP to use vMotion-1 network resolved the issue instantly, as vMotion-1 in my case it was using vmk1 instead of vmk5. Many thanks to the support team for noticing the goofy communication path it was taking.

Testing similar behavior with vMotion
While the symptoms showed up in FVP, the cause is an ESXi matter. While not an exact comparison, one can simulate a similar behavior that I was seeing with FVP by doing a little experimenting with vMotion. The experiment simply involves taking an arrangement originally configured for Multi-NIC vMotion, disabling vMotion on the network with the lowest vmk number on both hosts, kicking off a vMotion, and observing the traffic via esxtop. (Warning. Keep this experiment to your lab only).

For the test, two ESXi 5.5 hosts were used, and mult-NIC vMotion was set up in accordance to KB 2007467. One vSwitch. Two VMkernel ports (vMotion-0 & vMotion-1 respectively) in an active/standby arrangement. The uplinks are flopped on the other VMkernel. Below is an example of ESX01:

And both displayed what I’d expect in the routing table.

The tests below will show what the traffic looks like using just one of the vMotion networks, but only where the “vMotion” service is enabled on one of the VMkernel ports.

Test 1: Verify what correct vMotion traffic looks like
First, let’s establish what correct vMotion traffic will look like. This is on a dual NIC vMotion arrangement in which only the network with the lowest numbered vmk is ticked with the “vMotion” service.

The screenshot below is how the traffic looks from the source on ESX01. The green bubble indicates the anticipated/correct VMkernel to be used. Success!

The screenshot below is how traffic looks from the target on ESX02. The green bubble indicates the anticipated/correct VMkernel to be used. Success!

As you can see, the traffic is exactly as expected, with no other traffic occurring on the other VMkernel, vmk2.

Test 2: Verify what incorrect vMotion traffic looks like
Now let’s look at what happens on those same hosts when trying to use only the higher numbered vMotion network. The “vMotion” service was changed on both hosts to the other VMkernel, and both hosts were restarted. What is shown below is how the traffic looks on a dual NIC vMotion arrangement in which the network with the lowest numbered vmk has the “vMotion” service unticked, and the higher numbered vMotion network has the service enabled.

The screenshot below is how the traffic looks from the source on ESX01. The green bubble indicates the anticipated/correct VMkernel to be used. The red bubble indicates the VMkernel it is actually using. Uh oh. Notice how there is no traffic coming from vmk2, where it should be coming from? It’s coming from vmk1, exactly like the first test.

The screenshot below is how traffic looks from the target on ESX02. The green bubble indicates the anticipated/correct VMkernel to be used.

As you can see, under the described test arrangement, ESXi can and may use the incorrect VMkernel on the source, when vMotion is disabled on the vMotion network with the lowest VMkernel number, and active on the other vMotion network. It was repeatable with both ESXi 5.0 and ESXi 5.5. The results were consistent in tests with host uplinks connected to the same switch versus two stacked switches. The tests were also consistent using both standard vSwitches and Distributed vSwitches.

The experiment above is just a simple test to better understand how the path from the source to the target can get confused. From my interpretation, it is not unlike that of which is described in Frank Denneman’s post on why a vMotion network may accidently use a Management Network. (His other post on Designing your vMotion Network is also a great read, and applicable to the topic here.) Since FVP can only use one specific VMkernel on each host, I believe I was able to simulate the basics of why ESXi was making it difficult for FVP when pinning the FVP replica traffic on my higher numbered vMotion network in my production network. Knowing this lends itself to the first recommendation below.

A few different ways to configure FVP
After looking at the behavior of all of this, here are a few recommendations on using FVP with your VMkernel ports. Let me be clear that these are my recommendations only.

Ideally, create an isolated, non-routable network using a dedicated VLAN with a single VMkernel on each host, and assign only FVP to that network. It can live in whatever vSwitch is most suitable for your environment. (The faster the uplinks, the better). This will isolate, and insure the peer traffic is flowing as designed, and will let a multi-NIC vMotion arrangement work by itself. Here is an example of what that might look like:

If for some reason you can’t do the recommendation above, (maybe you need to wait on getting a new VLAN provisioned by your network team) use a vMotion network, but if it is a multi-NIC vMotion, set FVP to run on the vMotion network with lowest numbered VMkernel. According to, yes, another great post from Frank, this was the default approach for FVP prior to exposing the ability to assign FVP traffic to a specific network.

Remember that if there is ever a need to modify anything related to the VMkernel ports (untick the “vMotion” configuration box, adding or removing VMkernels), be aware that the routing interface (as seen via esxcfg-route -l ) may not change until there is a host restart. You may also find that using esxcfg-route -n to view the host’s arp table handy.

The ability for you to deliver your FVP traffic to it’s East-West peers in the fastest, most reliable way will allow you to make the most of FVP offers. Treat the FVP like a first class citizen in your network, and it will pay off with better performing VMs.

And a special thank you to Erik Bussink and Frank Denneman for confirming my vMotion test results, and my sanity.

Thanks for reading.

– Pete

Using vscsiStats to better visualize storage I/O

As the saying goes, a picture is worth a thousand words. This isn’t more true than in the world of data visualization. Raw data has its place, but good visualization methods help translate numbers into a meaningful story, and assist with overcoming the deficiencies associated with looking at a spreadsheet of raw numbers. A good visual representation of data gives context, establishes relationships between the numbers, communicates results more clearly; making it easier for you and others to remember. The difference for you as an Administrator can be better approaches to trouble shooting, or helping you in your ability to make smart design and purchasing decisions.

Virtualization Administrators are faced with digesting performance information quite often. vCenter does a pretty good job of letting the Administrator skip the data collection nonsense, and jump into viewing relevant metrics in an easy to read manner. But the vCenter metrics do not always give a complete view of information available, and occasionally needs a little help when one is trying to better understand key performance indicators.

A different way to use vscsiStats
“Some people see the glass half full. Others see it half empty. I see a glass that’s twice as big as it needs to be.” — George Carlin.

VMware’s vscsiStats is a great tool to collect and view storage I/O data in a different way. It can help to harvest a wealth of information about VMs that can be manipulated in a number of ways. For as good as it is, I believe it suffers a bit in that it is geared toward providing summations of a single sample period of time. One can collect all sorts of great information during a specific period, but it gives you no idea of what happened when, and why. To be truly useful, it needs to handle continuous, adjacent sampling periods.

But fear not, with a little extra effort, vscsiStats can be manipulated to factor in time. Combine those results with an Excel 3D surface chart, and you have some neat new ways to interpret the data. Erik Zandboer has fantastic information on how to leverage vscsiStats to generate multiple sampling periods. Combine this with a nice template he provides, and most of the heavy lifting is done for you already. Having that created already was great for me, as I find that the fun is not in generating the graphs, but interpreting, and learning from them.

In an effort to see how similar data can look a bit differently using other tools, let us take a look at a production VM running a real code compiling workload. The area in the red bubble is the time period we will be concentrating on. The screen capture below shows the CPU utilization for the 8 vCPU VM.

The screen capture below shows the storage related metrics for the specific VMDK of the VM, such as read and write IOPS, latency, and number of outstanding commands. In this particular case, the VM is being accelerated by PernixData FVP, but I changed the configuration so that it was only accelerating reads via its "write-through" configuration. Write I/Os are limited to the speed of the backing physical infrastructure. I did this to provide some more interesting graphs, as you will see in a bit.

Now it is time to use vscsiStats to look at similar storage related metrics. In this case, vscsiStats sampled the data in 20 second intervals, for a duration of 400 seconds, and reflects the time period within the red bubbles in the screen captures above. It is a relatively short amount of time for observations, but I didn’t want to smooth the data too much by choosing a long sample interval. In the charts below, read related activity is in green, and write related activity is in dark red. Note that on values such as latency, and I/O size, the axis will use a logarithmic scale.

I/O Size
First, lets take a look at I/O size for reads

You see from above that read I/Os from this period of time were mostly 4K and 32K in size. Contrast this with the write I/Os that are shown for the same sample period below.

The image above shows a significant amount of write I/Os at 32K, 64K, 128K, 256K, and 512K. Notice how much different that looks as compared to the read I/Os. Unlike read I/Os, we know write I/O sizes tend to have a more significant impact on latency.

Latency
Now let’s take a look at latency.

Many of the read I/Os shown above come in at around .5ms to 1ms latency. Reads I/Os can be an easier I/O to satisfy, and the latency reflects that. The image below shows many of the writes coming in between 5ms and 15ms or higher. Just like with the other graphs, we get a better understanding of the magnitude of I/Os (z axis) that come in at a given measurement.

Outstanding I/Os
This shows the number of outstanding read I/Os when a new read I/O is issued. As you can see below, the reads are being served pretty fast, and does not have more than around 1 or 2 outstanding read I/Os. In an ideal world we would want this to be as low as possible for all reads and writes.

However, you can see that with writes, it is quite a different story. The increased latency, which comes in part due to the larger I/O sizes used, impacts the number of outstanding write I/Os waiting. The image shows a several points in which the number of outstanding write I/Os surpassed 20. I find this image below visually one of the most impactful.

Sequential versus Random
vscsiStats also demonstrates whether the I/O of a given workload is sequential, or random.

With both reads, and writes, you can see that this particular snippet of a workload is predominately random I/O. Sequential I/O would all be very closely aligned with the ‘0’ value near the middle of the graph.

Conclusion
You can see that from this very small, 6 1/2 minute time period on one VM, the workload demanded different things at different times from the backing storage. Differences that were not readily apparent from the traditional vCenter metrics. Now imagine what other workloads on the same system may look like, or even what other systems may look like. As an aggregate, how might all of these systems be taxing your hosts and storage infrastructures? These are all very good questions with answers specific to each and every environment.

As demonstrated above, using vscsiStats can be a great way to compliment other monitoring metrics found in vCenter, and will surely give you a better understanding of the behavior of your virtualized environment.

Thanks for reading.

– Pete

Testing InfiniBand in the home lab with PernixData FVP

One of the reasons I find the latest trends in datacenter architectures so interesting is the innovative approaches used to address deficiencies associated with more traditional arrangements. These innovations have been able to drive more of what almost everyone needs; better storage performance and better scalability.

The caveat to some of these newer arrangements is that it can put heavy stress on the plumbing that connects these servers. Distributed storage technologies like VMware VSAN, or clustered write buffering techniques used by PernixData FVP and Atlantis Computing’s USX leverage these interconnects to accelerate storage traffic. Turn-key Hyperconverged solutions do too, but they enjoy the luxury of having full control over the hardware used. Some of these software based solutions might need some retrofitting of an environment to run optimally or meet their requirements (read: 10GbE or better). The desire for the fastest interconnect possible between hosts doesn’t always align with budget or technical constraints, so it makes most sense to first see what impact there really is.

I wanted to test the impact better bandwidth would have between servers a bit more, but do to constraints in my production environment, I needed to rely on my home lab. As much as I wanted to throw 10GbE NICs in my home lab, the price points were too high. I had to do it another way. Enter InfiniBand. I’m certainly not the only one to try InfiniBand in a home lab, but I wanted to focus on two elements that are critical to the effectiveness of replica traffic. The overall bandwidth of the pipe, and equally important, the latency. While I couldn’t simulate an exact workload that I see in my production environment, I could certainly take smaller snippets of I/O patterns that I see, and model them the best I can.

InfiniBand is really interesting. As Joeb Jackson put it in a NetworkWorld.com article, "InfiniBand is architecturally sacrilegious" as it combines many layers of the OSI model. The results can be stunning. Transport latencies in the 2 microsecond neighborhood, and a healthy roadmap to 200Gbps and beyond. It’s sort of like the ’66 AC Shelby Cobra of data transports. Simple, and perhaps a little rough around the edges, but brutally fast. While it doesn’t have the ubiquity of RJ/Ethernet, it also doesn’t have the latencies that are still a part of those faster forms of Ethernet.

At the time of this writing, the InfiniBand drivers for ESXi 5.5 weren’t quite ready for VSAN testing yet, so the focus of this testing is to see how InfiniBand behaves when used in a PernixData FVP deployment. I hope to publish a VSAN edition in the future. I simply wanted to better understand if (and how much) a faster connection would improve the transmission of replica traffic when using FVP in WB+1 mode (local flash, and 1 peer). My production environment is very write intensive, and uses 1GbE for the interconnects. Any insight gained here will help in my design and purchasing roadmap for my production environment.

Testing:
Testing occurred on a two host cluster backed by a Synology DS1512+. Local flash leveraged SATA III based EMLC SSD drives using an onboard controller. 1GbE interconnects traversed a Cisco SG300-20 using a 1500 byte MTU size. For InfiniBand, each host used a Mellanox MT25418 DDR 2 port HCA that offered 10Gb per connection. They were directly connected to each other, and used a 2044 byte MTU size. InfiniBand can be set to 4092 bytes but for compatibility reasons under ESXi 5.5, 2044 is the desired size.

I tend to prefer testing that relies on observational patterns versus one final, empirical number. These tests were no different, and while they attempt to simulate a very brief snippet of a workload in my production environment, I find that I still gain a much better understanding from a time based performance graph than an insulated final number.

The test case was a simple one, but would be enough to illustrate the differences I was hoping to see. The test comprised of a 2vCPU VM using 2 workers on a 100% write, 100% random workload lasting for 1 minute. The test was run three times. First with WB+0 (no peer/replica traffic), then WB+1 (one peer) using a 1GbE connection, and finally WB+1 over a single 10Gb InfiniBand connection. Each screen capture I provide will show them in that order. That test case was repeated 3 times. First with 256KB I/O sizes, followed by 32KB, then onto 4KB. I ran the tests several times in different order to ensure I wasn’t introducing inflated or deflated performance due to previous tests or caching. All were repeated several times to flush out any anomalies.

(Click on each image for a larger view)

256KB I/O size test
Testing results using this I/O size is rarely published anywhere because it never bodes well in comparison to a smaller I/O size like 4KB. But my production workloads (compiling) often deal with these I/O sizes, so it is important for me to understand their behavior.

IOPS with 256KB I/O

Latency with 256KB I/O

Throughput using 256KB I/O

Observations from 256KB I/O test
Note that the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance, driving just half of the IOPS and throughput compared to InfiniBand. But also take a look at the terrible native latency (70ms) of large I/O sizes even when using WB+0 (no peer traffic. Just local flash). Also note that when peer traffic performance is improved, the larger backlog of data in the destager occurs.

32KB I/O size test
Just 1/8th the size of a 256KB I/O, this is still larger than most storage vendors like to advertise in their testing. My production workload often oscillates between 32KB and 256KB I/Os.

IOPS with 32KB I/O

Latency with 32KB I/O

Throughput using 32KB I/O

Observations from 32KB I/O test
Once again, the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance on throughput. Latency had only a minor improvement moving away from 1GbE, as the latency of the flash was about 6ms.

4KB I/O size test
The most common of I/O sizes that you might see, although it is more common on reads than writes. 1/64th the size of a 256KB I/O, it is tiny compared to the others, but important to test because of the attempt to learn if and how much a fatter, lower latency pipe helps in various I/O sizes.

IOPS with 4KB I/O

Latency with 4KB I/O

Throughput using 4KB I/O

Observations from 4KB I/O test
IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. But as the I/O sizes shrink, so does the effective total/concurrent payload size. So the differences between InfiniBand and 1GbE were less than on tests with larger I/O. Latencies of this I/O size were around 2ms.

Other observations that stood out
One of the first things that stood is illustrated below, with two 5 minutes test runs. Look at where the two arrows point. The arrow on the left points to the number of packets sent while using 1GbE. The arrow on the right shows the number of packets sent while using 10Gb InfiniBand. Quite a difference. Also notice that the effective throughput started out higher, but had to throttle back

Findings:
The key takeaways from these tests:

A high bandwidth, low latency interconnect like InfiniBand can virtually eliminate any write redundancy penalty incurred in WB+1 mode.

From a single workload, I/O sizes of 32KB and 256KB saw between 65% and 90% improvement on IOPS and throughput. I/O sizes of 4KB saw essentially no improvement (many concurrent 4KB workloads likely would see a benefit however)..

Writes using larger I/O sizes were the clear beneficiary of a fatter pipe between servers. However, the native latencies of the flash devices under larger I/O sizes could not take advantage of the low latencies of InfiniBand. In other words, with large I/O sizes, the flash device themselves, or the bus they were using were by far the major impediment lower latency and faster I/O delivery
The smaller pipe of 1GbE throttled back the flash device’s ability to ingest the data as fast as InfiniBand. There was always a smaller amount of outstanding writes once the test was complete, but it came at the cost of poorer performance for 1GbE.

A few other matters can come up when attempting to accurately interpret latencies. As

VMware KB 2036863

points out, reporting of latencies accurately can sometimes be a challenge. Just something to be aware of.

Conclusion
InfiniBand was my affordable way to test how a faster interconnect would improve the abilities of FVP to accelerate replica storage I/O. It lived up to the promise of high bandwidth with low latency. However, effective latencies were ultimately crippled by the SSDs, the controller, or the bus it was using. I did not have the opportunity to test other flash technologies such as PCIe based solutions from Fusion-IO or Virident, or the memory channel based solution from Diablo Technologies. But based on the above, it seems to be clear that how the flash is able to ingest the data is crucial to the overall performance of whatever solution that is using it.

Helpful Links
Erik Bussink’s great post on using InfiniBand with vSphere 5.5
http://www.bussink.ch/?p=1306

Vladen Seget’s post on incorporating InfiniBand into his backing storage
http://www.vladan.fr/homelab-storage-network-speedup/

Mellanox, OFED and OpenSM bundles
https://my.vmware.com/web/vmware/details/dt_esxi50_mellanox_connectx/dHRAYnRqdEBiZHAlZA==
http://www.mellanox.com/downloads/Drivers/MLNX-OFED-ESX-1.8.1.0.zip
http://files.hypervisor.fr/zip/ib-opensm-3.3.16-64.x86_64.vib

Practical tips for a Veeam Backup and Recovery deployment

I’ve been using Veeam Backup and Recovery in my production environment for a while now, and in hindsight, it was one of the best investments we’ve ever made in our IT infrastructure. It has completely changed the operational overhead of protecting our VMs, and the data they serve up. Using a data protection solution that utilizes VMware’s APIs provides the simplicity and flexibility that was always desired. Moving away from array based features for protection has enabled the protection of VMs to better reflect desired RPO and RTO requirements – not by the limitations imposed by LUN sizes, array capacity, or functionality.

While Veeam is extremely simple in many respects, it is also a versatile, feature packed application that can be configured a variety of different ways. The versatility and the features can be a little confusing to the new user, so I wanted to share 25 tips that will help make for a quick and successful deployment of Veeam Backup and Recovery in your environment.

First lets go over a few assumptions that will be the basis for my recommendations:

There are two sites that need protection.
VMs and data need to be protected at each site, locally.
VMs and data need to be protected at each site, remotely.
A NAS target exists at each site.
Quick deployment is important.
You’ve already read all of the documentation.

Architecture
There are a number of different ways to set up the architecture for Veeam. I will show a few of the simplest arrangements:

In this arrangement below there would be no physical servers – only a NAS device. This is a simplified arrangement of what I use. If one wanted a rebuilt server (Windows or Linux) acting purely as a storage target, that could be in place of where you see the NAS. The architecture would stay the same.

Optionally, a physical server not just acting as a storage target, but also as a physical proxy would look something like this below:

Below is a combination of both, where a physical server is acting as the Proxy, but like the virtual proxy, is using an SMB share to house the data. In this case, a NAS unit.

Implementation tips
These tips focus not so much on ultimately what may suite your environment best (only you know that) or leveraging all of the features inside the product, but rather, getting you up and running as quickly as possible so you can start returning great results.

Job Manager Servers & Proxies

1. Have the job Manager server, any proxies, and the backup targets living on their own VLAN for a dedicated backup network.

2. Set up SNMP monitoring on any physical ports used in the backup arrangement. It will be helpful to understand how utilized the physical links get, and for how long.

3. Make sure to give the Job Manager VM enough resources to play with – especially if it will have any data mover/proxy responsibilities. The deployment documentation has good information on this, but for starters, make it 4vCPU with 5GB of RAM.

4. If there is more than one cluster to protect, consider building a virtual proxy inside each cluster that it will be responsible for protecting, then assign it to jobs that protect VMs in that cluster. In my case, I use PernixData FVP in two clusters. I have the data stores that house those VMs only accessible by their own cluster (a constraint of FVP). Because of that, I have a virtual proxy living in each cluster, with backup jobs configured so that it will use a specific virtual proxy. These virtual proxies have a special setting in FVP that will instruct the VMs being backed up to flush their write cache to the backing storage

Storage and Design

5. Keep the design simple, even if you know you will need to adjust at a later time. Architectural adjustments are easy to do with Veeam, so go ahead and get Veeam pointed to the target, and start running some jobs. Use this time to get familiar with the product, and begin protecting the jewels as quickly as possible.

6. Let Veeam use the default SQL Server Express instance on the Veeam Job Manager VM. This is a very reasonable, and simple configuration that should be adequate for a lot of environments.

7. Question whether a physical proxy is needed. Typically physical proxies are used for one of three reasons. 1.) They offload job processing CPU cycles from your cluster. 2.) In simple arrangements a Windows based Physical proxy might also be the Repository (aka storage target). 3.) They allow for one to leverage a "direct-from-SAN" feature by plugging in the system to your SAN fabric. The last one in my opinion introduces the most hesitation. Here is why:

Some storage arrays do not have a "read-only" iSCSI connection type. When this is the case, special care needs to be taken on the physical server directly attached to the SAN to ensure that it cannot initialize the data store. The reality is that you are one mistake away from having a very long day in front of you. I do not like this option when there is no secondary safety mechanism from the array on a "read-only" connection type.
Direct-from-SAN access can be a very good method for moving data to your target. So good that it may stress your backing storage enough (via link saturation or physical disk limits) to perhaps interfere with your production I/O requirements.
Additional efforts must be taken when using write buffering mechanisms that do not live on the storage array (e.g. PernixData) .

8. Veeam has the ability to back up to an SMB share, or an NFS mount. If an NFS mount is chosen, make sure that it is a storage target running native Linux. Most NAS units like a Synology are indeed just a tweaked version of Linux, and it would be easy to conclude that one should just use NFS. However, in this case, you may run into two problems.

The SMB connection to a NAS unit will likely be faster (which most certainly is the first time in history that an SMB connection is faster than an NFS connection) .
The Job Manager might not be able to manage the jobs on that NAS unit (connected via NFS) properly. This is due to BusyBox and Perl on the Synology not really liking each other. For me, this resulted in Veeam being unable to remove sun setting backups. Changing over to an SMB connection on the NAS improved the performance significantly, and allowed for job handling to work as desired.

9. Veeam has a great new feature (version 7.x) called a "Backup Copy" job, which allows for the backup made locally to be shipped to a remote site. The "Backup Copy" job achieves one of the most basic requirements of data protection in the simplest of ways. Two copies of the data at two different locations, but with the benefit of only processing the backup job once. It is a new feature of Version 7, and although it is a great feature, it behaves differently, and warrants some time spent before putting into production. For a speedy deployment, it might be best simply to configure two jobs. One to a local target, and one to a remote target. This will give you the time to experiment with the Backup Copy job feature.

10. There are compelling reasons for and against using a rebuilt server as a storage target, or using a NAS unit. Both are attractive options. I ended using a dedicated NAS unit. It’s form factor, drive bay count, and the overall cost of provisioning was the only option that could match my requirements.

Operations

11. In Veeam B&R, "Replication Jobs" are different than "Backup Jobs." Instead of trying to figure out all of the nuances of both right away, use just the "Backup job" function with both local and remote targets. This will give you time to better understand the characteristics of the replication functionality. One also might find that the "Backup Job" suites the environment and need better than the replication option.

12. If there are daily backups going to both local and offsite targets (and you are not using the "Backup Copy" option, have them run 12 hours apart from one another to reduce RPOs.

13. Build up a test VM to do your testing of a backup and restore. Restore it in the many ways that Veeam has to offer. Best to understand this now rather than when you really need to.

14. I like the job chaining/dependency feature, which allows you to chain multiple jobs together. But remember that if a job is manually started, it will run through the rest of the jobs too. The easiest way to accommodate this is to temporarily remove it from the job chain.

15. Your "Backup Repository" is just that, a repository for data. It can be a Windows Server, a Linux Server, or an SMB share. If you don’t have a NAS unit, stuff an old server (Windows or Linux) with some drives in it and it will work quite well for you.

16. Devise a simple, clear job naming scheme. Something like [BackupType]-[Descriptive Name]-[TargetLocation] will quickly tell you what it is and where it is going to. If you use folders in vCenter to organize your VMs, and your backups reflect the same, you could also choose to use the folder name. An example would be "Backup-SharePointFarm-LOCAL" which quickly and accurately describes the job.

17. Start with a simple schedule. Say, once per day, then watch the daily backup jobs and the synthetic fulls to see what sort of RPO/RTOs are realistic.

18. Repository naming. Be descriptive, but come up with some naming scheme that remains clear even if you aren’t in the application for several weeks. I like indicating the location of the repository, if it is intended for local jobs, or remote jobs, and what kind of repository it is (Windows, Linux, or SMB). For example: VeeamRepo-[LOCATION]-for-Local(SMB)

19. Repository organization. Create a good tree structure for organization and scalability. Veeam will do a very good job at handling the organization of the backups once you assign a specific location (share name) on a repository. However, create a structure that provides the ability to continue with the same naming convention as your needs evolve. For instance, a logical share name assigned to a repository might be \\nas01\backups\veeam\local\cluster1 This arrangement allows for different types of backups to live in different branches.

20. Veeam might prevent the ability of creating more than one repository going to the same share name (it would see \\nas01\backups\veeam\local\cluster1 and \\nas01\backups\veeam\local\cluster2 as the same). Create DNS aliases to fool it, then make those two targets something like \\nascluster1\backups\veeam\local\cluster1 and \\nascluster2\backups\veeam\local\cluster2

21. When in doubt, leave the defaults. Veeam put in great efforts to make sure that you, or the software doesn’t trip over itself. Uncertain of job number concurrency? Stick to the default. Wondering about which backup mode to use? (Reverse Incremental versus Incrementals with synthetic fulls). Stay with the defaults, and save the experimentation for later.

22. Don’t overcomplicate the schedule (at least initially). Veeam might give you flexibility that you never had with array based protection tools, but at the same time, there is no need to make it complicated. Perhaps group the VMs by something that you can keep track of, such as the folders they are contained in within vCenter.

23. Each backup job can be adjusted so that whatever target you are using, you can optimize it for preset storage optimization type. WAN target, LAN target, or local target. This can easily be overlooked, but will make a difference in backup performance.

24. How many backups you can keep is a function of change range, frequency, dedupe and compression, and the size of your target. Yep, that is a lot of variables. If nothing else, find some storage that can serve as the target for say, 2 weeks. That should give a pretty good sampling of all of the above.

25. Take one item/feature once a week, and spend an hour or two looking into it. This will allow you to find out more about say, Changed block tracking, or what the application aware image processing feature does. Your reputation (and perhaps, your job) may rely on your ability to recover systems and data. Come up with a handful of scenarios and see if they work.

Veeam is an extremely powerful tool that will simplify your layers of protection in your environment. Features like SureBackup, Virtual Labs, and their Replication offerings are all very good. But more than likely, they do not need to be a part of your initial deployment plan. Stay focused, and get that new backup software up and running as quickly as possible. You, and your organization, will be better off for it.

– Pete

Effects of introducing write-back caching with PernixData FVP

Implementing new technology that solves real problems is great. It is exciting, and you get to stand on the shoulders of the smart folks who dreamed up the solution. But with all of that glory comes new design and operation elements that may have been introduced. This isn’t a bad thing. It is just different. The magic of virtualization didn’t excuse the requirement of needing to understand the design and operational considerations of the new paradigm. The same goes for implementing host based caching in a virtualized environment.

Implementing FVP is simple and the results can be impressive. For many, that is about all the effort they may end up putting into it. But there are design considerations that will help maximize the investment, and minimize false impressions, or costly mistakes. I want to share what has been learned against my real world workloads, so that you can understand what to look for, and possibly how to get more out of your investment. While FVP accelerates both reads and writes, it is the latter that warrants the most consideration, so that will be the focus of this post.

When accelerating storage using FVP, the factors that I’ve found to have the most influence on how much your storage I/O is accelerated are:

Interconnect speed between hosts of your pooled flash
Performance delta between your flash tier, and your storage tier.
Working set size of your data
Duty cycle write I/O profile of your VMs (including peak writes, and duration)
I/O size of your writes (which can vary within each workload)
Likelihood or frequency of DRS or manual vMotion activities
Native speed and consistency of your flash (the flash itself, and the bus speed)
Capacity of your flash (more of an influence on read caching, but can have some impact on writes too)

Write-back caching & vMotion
Most know by now that to guard against any potential data loss in the event of a host failure, FVP provides redundancy of write-back caching through the use of one or more peers. The interconnect used is the vMotion network. While FVP does a good job of decoupling the VM’s need to wait for the backing datastore, a VM configured for write-back with redundancy must acknowledge the write I/O of the VM from it’s local flash, AND the one or more peers before it returns the write ACK to the VM.

What does this mean to your environment? More traffic on your vMotion network. Take a look at the image below. In a cluster NOT accelerated by FVP, the host uplinks that serve a vMotion network might see relatively little traffic, with bursts of traffic only during vMotion activities. That would also be the case if you were running FVP in write-back mode with no peers (WB+0). This image below is what the activity on the vMotion network looks like as perceived by one of the hosts after the VMs had write-back with redundancy of one peer. In this case the writes were averaging about 12MBps across the vMotion network. You will see that the spike is where a vMotion kicked off: The spike is the peak output of a 1GbE interface; about 125MBps.

Is this bad that the traffic is running over your vMotion network? No, not necessarily. It has to run over something. But with this knowledge, it is easy to see that bandwidth for inter-server communication will be more important than ever before. Your infrastructure design may need to be tweaked to accommodate the new role that the vMotion network plays.

Can one get away with a 1GbE link for cross server communication? Perhaps. It really depends on the factors above, which can sometimes be hard to determine. So with all of the variables to consider, it is sometimes easiest to circle back to what we do know:

Redundant write back caching with FVP will be using network connectivity (via vMotion network) for every single write that occurs for an accelerated VM.
Redundant write back caching writes are multiplied by the number of peers that are configured per accelerated VM.
The write accelerated I/O commit time (latency) will be as fast as the slowest connection. Your vMotion network will likely be slower than the local bus. A poor quality SSD or an older generation bus could be a bottleneck too.
vMotion activities enjoy using every bit of bandwidth it has available to it.
VM’s that are committing a lot of writes might also be taxing CPU resources, which may kick in DRS rules to rebalance the load – thus creating more vMotion traffic. Those busy VMs may be using more active memory pages as well, which may increase the amount of data to move during the vMotion process.

The multiplier of redundancy
Lets run through a simple scenario to better understand the potential impact an undersized vMotion network can have on the performance of write-back caching with redundancy. The example is addressing writes only.

4 hosts each have a group of 6 VM’s that consistently write 5MBps per VM. Traditionally, these 24 VMs would be sending a total of 120MBps to the backing physical storage.
When write back is enabled without any redundancy (WB+0), the backing storage will still see the same amount of writes committed, but it will be in a slightly different way. Sequential, and smoothed out as data is flushed to the backing physical storage.
When write back is enabled and a write redundancy of “local flash and 1 network flash device” (WB+1) is chosen, the backing storage will still see 120MBps go to it eventually, but there will be an additional 120MBPs of data going to the host peers, traversing the vMotion network.
When write back is enabled and a write redundancy of “local flash and 2 network flash devices” (WB+2) is chosen, the backing storage will still see 120MBps to it, but there will be an additional 240MBps of data going to the host peers, traversing the vMotion network.

The write-back redundancy configuration is a per-VM setting, so there not necessarily a need to change them all to one setting. Your VMs will most likely not have the same write workload either. But this is to illustrate the point that as the example shows, it is not hard to saturate a 1GbE interface. Assuming an approximate 125MBps on a single 1GbE interface, under the described arrangement, saturation would occur with each VM configured for write-back with redundancy of one peer (WB+1). This leaves little headroom for other traffic that might be traversing that network, such as vMotions, or heartbeats.

Fortunately FVP has the smarts built in to ensure that vMotion activities and write-back caching get along. However, there is no denying the physics associated with the matter. If you have a lot of writes, and you really want to leverage the full beauty of FVP, you are best served by fast interconnects between hosts. It is a small price to pay for supreme performance. FVP might expose the fact that 1GbE not be ideal in an accelerated environment, but consider what else has changed over the years. Standard memory sizes of deployed VMs have increased significantly (The vOpenData Public Dashboard confirms this). That 1GbE vMotion network might have been good for VM’s with 512MB of RAM, but what about those with 4, 8, or 12GB of RAM? That 1GbE vMotion network has become outdated even for what it was originally designed for.

Destaging
One characteristic unique with any type of write-back caching is that eventually, the data needs to be destaged to the backing physical datastore. The server-side flash that is now decoupled from the backing storage has the potential to accommodate a lot of write I/Os with minimal latency. One may or may not have the backing spindles, or conduit large enough to be sending your write I/O to the backing physical storage if this high write I/O lasts long enough. Destaging issues can occur on an arrangement like FVP, or with storage arrays and DAS arrangements that front performance I/O with flash that get pushed to slower spindles.

Knowing the impact of this depends on the workload and the environment it runs in.

If the duty cycle of the write workload that is above the physical storage I/O limit allows for enough “rest time” (defined as any moment that the max I/O to the backing physical storage is below 100%) to destage before the next over commitment begins, then you have effectively increased your ability to deliver more write I/Os with less latency.
If the duty cycle of the write workload that is above the physical storage I/O limit is sustained for too long, the destager of that given VM will fill to capacity, and will not be able to accelerate any faster than it’s ability to destage.

Huh? Okay, a picture might be a better way to describe this. The callouts below point to the two scenarios described.

So when looking at this write I/O duty cycle, there becomes a concept of amplitude of the maximum write I/O, and frequency of those times in which is it overcommitting. When evaluating an environment, you might see this crude sine-wave show up. This write I/O duty cycle, coupled with your physical components is the key to how much FVP can accelerate your environment.

What happens when the writes to the destager surpass the ability of your backing storage to keep up with the writes? Once the destager for that given VM fills up, it’s acceleration will reduce to the rate that it can evacuate the data to the backing storage. One may never see this in production, but it is possible. It really depends on the factors listed at the beginning of the post. The only way to clearly see this is from a synthetic workload, where I show it was able to push 5 times the write I/Os (blue line) before eventually filling up the destager to the point where it was throttled back to the rate of the datastore (purple line)

This will have an impact on the effective latency, shown below (blue line). While the destager is full, it will not be able to fulfill the write at the low latency typically associated with flash, reflecting latency closer to the backing datastore (purple line).

Many workloads would never see this behavior, but those that are very write intensive (like mine), and that have a big delta between their acceleration tier and their backing storage may run into this.

The good news is that workloads have a tendency to be bursty, which is a perfect match for an acceleration tier. In a clustered arrangement, this is much harder to predict, and bursty can be changed to steady-state quite quickly. What this demonstrates is that if there is enough of a performance delta between your acceleration tier, and your storage tier, under cases of sustained writes, there may be times where it doesn’t have the opportunity to flush enough writes to maintain it’s ability to accelerate.

Recommendations
My recommendations (and let me clarify that these are my opinions only) on implementing FVP would include.

Initially, run the VMs in write-through mode so that you can leverage the FVP analytics to better understand your workload (duty cycles, read/write ratios, maximum write throughput for a VM, IOPS, latency, etc.)
As you gain a better understanding of the behavior of these workloads, introduce write-back caching to see how it helps the systems changed.
Keep and eye on your vMotion network (in particular, those with 1GbE environments and limited physical ports) and see if one ever comes close to saturation. Other leading indicators will be increased latency on accelerated writes.
Run out and buy some 10GbE NICs for your vMotion network. If you are in a situation with a total 1GbE legacy fabric for your SAN, and your vMotion network, and perhaps you have limits on form factors that may make upgrading difficult (think blades here), consider investing in 10GbE for your vMotion network, as opposed to your backing storage. Your read caching has probably already relieved quite a bit of I/O pressure on your storage, and addressing your cross server bandwidth is ultimately a more affordable, and simpler task.
If possible, allocate more than one link and configure for Multi-NIC vMotion. At this time, FVP will not be able to leverage this, but it will allow vMotion to use another link if the other link is busy. Another possible option would be to bond multiple 1GbE links for vMotion. This may or may not be suitable for your environment.

So if you haven’t done so already, plan to incorporate 10GbE for cross-server communication for your vMotion Network. Not only will your vMotioning VM’s thank you, so will the performance of FVP.

– Pete

Helpful links:

Fault Tolerant Write acceleration
http://frankdenneman.nl/2013/11/05/fault-tolerant-write-acceleration/

Destaging Writes from Acceleration Tier to Primary Storage
http://voiceforvirtual.com/2013/08/14/destaging-writes-i/

Using a new tool to discover old problems

It is interesting what can be discovered when storage is accelerated. Virtual machines that were previously restricted by the underperforming arrays now get to breath freely. They are given the ability to pass storage I/O as quickly as the processor needs. In other words, the applications that need the CPU cycles get to dictate your storage requirements, rather than your storage imposing artificial limits on your CPU.

With that idea in mind, a few things revealed themselves during the process of implementing PernixData FVP. Early on, it was all about implementing and understanding the solution. However, once the real world workloads began accelerating, there was intrigue on the analytics that FVP was providing. What was generating the I/O that was being accelerated? What processes were associated with the other traffic not being accelerated, and why? What applications were behind the changing I/O sizes? And what was causing the peculiar I/O patterns that were showing up? Some of these were questions raised at an earlier time (see: Hunting down unnecessary I/O before you buy that next storage solution ). The trouble was, the tools I had to discover the pattern of data I/O were limited.

Why is this so important? In the spirit of reminding ourselves that no resource is an island, here is an example of a production code compile run, as looking from the perspective of the guest CPU. The first screen capture is the code with adequate storage I/O to support the application’s needs. A full build and is running nearly perfect CPU utilization of all 8 of it’s vCPUs. (screen shots taken from my earlier post; Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements-Part 2)

Below is that very same code compile, under stressed backend storage. It took 46% longer to complete, and as you can see, changes the CPU utilization of the build run.

The primary goal for this environment was to accelerate the storage. However, it would have been a bit presumptuous to conclude that all existing storage traffic is good, useful I/O. There is a significant amount of traffic originating from outside of IT, and the I/O generated needed to be understood better. With the traffic passing more freely thanks to FVP acceleration, patterns that previously could not expose themselves should be more visible. This was the basis for the first discovery

A little “CSI” work on the IOPS
Many continuous build systems use some variation of a polling mechanism to understand when there is new checked in code that needs to be compiled. This should be a very light weight process. However, once storage performance was allowed to breath better, the following patterns started showing up on all of the build VMs.

The image below shows the IOPS for one build VM during a one hour period of no compiling for that particular VM. The VM’s were polling for new builds every 5 minutes. Yep, that “build heartbeat” was as high as 450 IOPS on each VM.

Why wasn’t this noticed before? These spikes were being suppressed by my previously overtaxed storage, which made them more difficult to see. These were all writes, and were translating into 500 to 600 steady state IOPS just to sit idle (as seen below from the perspective of the backing storage)

So what was the cause? As it turned out, the polling mechanism was using some source code control (SVN) calls to help the build machines understand if it needed to execute a build. Programmatically, the Development Team has no idea that the script that they develop is going to be efficient, or not efficient. They are separated by that layer of the infrastructure. (Sadly, I have a feeling this happens more often than not in general Application Development). This resulted in a horribly inefficient method. After helping them understand the matter, it was revamped, and now polling for each VM only takes 1 to 2 IOPS every 5 minutes.

The image below shows how the accelerated cluster of 30 build VMs looks when there are no builds running.

The inefficient polling mechanism wasn’t the only thing found. A few of the Linux build VMs had a rouge “Beagle” search daemon running on them. This crawler did just that, indexing data on these Linux machines, and creating unnecessary I/O. With Windows, Indexers and other CPU and I/O hogs are typically controlled quite easily by GPO, but the equivalent services can creep into Linux systems if not careful. It was an easy fix at least.

The cumulative benefit
Prior to the efforts of accelerating the storage, and looking to make it more efficient, the utilization of the arrays looked as the image shows. (6 hour period, from the perspective of the arrays)

Now, with the combination of understanding my workload better, and acceleration through FVP, that same workload looks like this (6 hour period, from the perspective of the arrays):

Notice that the estimated workload is far under the 100% it was regularly pegged at for 24 hours a day, 6 days a week. In fact, during the workday, the arrays might only peak at 50% to 60% utilization. When no builds are running, the continuous build system may only be drawing 25 IOPS from the VMFS volumes that contain the build machines, which is much more reasonable than where it was at.

With the combination of less pressure on the backing physical storage, and the magic of pooled flash on the hosts, the applications and CPU get to dictate how much storage I/O is needed. Below is a screen capture of IOPS on a production build VM while compiling was being performed. It was not known up until this point that a single build VM needed as much as 4,000 IOPS to compile code because the physical storage was never capable of satisfying that type of need.

Conclusion
Could some of these discoveries have been made without FVP? Yes, perhaps some of it. But good analysis comes from being able to interpret data in a consumable way. Its why various methods of data visualization such as bar graphs, pie charts, and X-Y-Z plots exist. FVP certainly has been doing a good job of accelerating workloads, but it is also helps the administrator understand the I/O better. I look forward to seeing how the analytics might expand in future tools or releases from PernixData.

A friend once said to me that the only thing better than a new tractor is a reason to use it. In many ways, the same thing goes for technology. Virtualization might not even be that fascinating unless you had real workloads to run on top of it. Ditto for for PernixData FVP. When applied to real workloads, the magic begins to happen, and you learn a lot about your data in the process.

Observations of PernixData FVP in a production environment

Since my last post, "Accelerating storage using PernixData’s FVP. A perspective from customer #0001" I’ve had a number of people ask me questions on what type of improvements I’ve seen with FVP. Well, let’s take a look at how it is performing.

The cluster I’ve applied FVP to is dedicated for the purpose of compiling code. Over two dozen 8 vCPU Linux and Windows VM’s churning out code 24 hours a day. It is probably one of the more challenging environments to improve, as accelerating code compiling is inherently a very difficult task. Massive amounts of CPU using a highly efficient, multithreaded compiler, a ton of writes, and throw in some bursts of reads for good measure. All of this occurs in various order depending on the job. Sounds like fun, huh.

Our full builds benefited the most by our investment in additional CPU power earlier in the year. This is because full compiles are almost always CPU bound. But incremental builds are much more challenging to improve because of the dialog that occurs between CPU and disk. The compiler is doing all sorts of checking, throughout the compile. Some of the phases of an incremental build are not multithreaded, so while a full build offers nearly perfect multithreading on these 8 vCPU build VMs, this just isn’t the case on an incremental build.

Enter FVP
The screen shots below will step you through how FVP is improving these very difficult to accelerate incremental builds. They will be broken down into the categories that FVP divides them into; IOPS, Latency, and Throughput. Also included will be a CPU utilization metric, because they all have an indelible tie to each other. Some of the screen shots are from the same compile run, while others are not. The point here it to show how it is accelerating, and more importantly how to interpret the data. The VM being used here is our standard 8 vCPU Windows VM with 8GB of RAM. It has write-back caching enabled, with a write redundancy setting of "Local flash and 1 network flash device."

Click on each image to see a larger version

IOPS
Below is an incremental compile on a build VM during the middle of the day. The magenta line is showing what is being satisfied by the backing data store, and the blue line shows the Total Effective IOPS after flash is leveraged. The key to remember on this view is that it does not distinguish between reads and writes. If you are doing a lot of "cold reads" the magenta "data store" line and blue "Total effective" line may very well overlap.

This is the same metric, but toggled to the read/write view. In this case, you can see below that a significant amount of acceleration came from reads (orange). For as much writing as a build run takes, I never knew a single build VM could use 1,600 IOPS or more of reads, because my backing storage could never satisfy the request.

CPU
Allowing the CPU to pass the I/O as quickly as it needs to does one thing, it allows the multithreaded compiler to maximize CPU usage. During a full compile, it is quite easy to max out an 8 vCPU system and have a sustained 100% CPU usage, but again, these incremental compiles were much more challenging. What you see below is the CPU utilization associated with the VM running the build. It is a significant improvement of an incremental build by using acceleration. A non accelerated build would rarely get above 60% CPU utilization.

Latency
At a distance, this screen grab probably looks like a total mess, but it has really great data behind it. Why? The need for high IOPS is dictated by the VMs demanding it. If it doesn’t demand it, you won’t see it. But where acceleration comes in more often is reduced latency, on both reads and writes. The most important line here is the blue line, which represents the total effective latency.

Just as with other metrics, the latency reading can often times be a bit misleading with the default "Flash/Datastore" view. This view does not distinguish between reads and writes, so a cold read pulling off of spinning disk will have traditional amounts of latency you are familiar with. This can skew your interpretation of the numbers in the default view. For all measurements (IOPS, Throughput, Latency) I often find myself toggling between this view, and the read/write view. Here you can see how a cold read sticks out like a sore thumb. The read/write view is where you would go to understand individual read and write latencies.

Throughput
While a throughput chart can often look very similar to the IOPS chart, you might want to spend a moment and dig a little deeper. You might find some interesting things about your workload. Here, you can see the total effective throughput significantly improved by caching.

Just as with the other metrics, toggling it into read/write view will help you better understand your reads and writes.

The IOPS, Throughput & Latency relationship
It is easy to overlook the relationship that IOPS, throughput, and latency have to each other. Let me provide an real world example of how one can influence the other. The following represents the early, and middle phases of a code compile run. This is the FVP "read/write" view of this one VM. Green indicates writes. Orange indicates reads. Blue indicates "Total Effective" (often hidden by the other lines).

First, IOPS (green). High write IOPS at the beginning, yet relatively low write IOPS later on.

Now, look at write throughput (green) below for that same time period of the build. A modest amount of throughput at the beginning where the higher IOPS were at, then followed by much higher throughput later on when IOPS had been low. This is the indicator of changing I/O sizes from the applications generating the data.

Now look at write latency (green) below. Extremely low latency (sub 1ms) with smaller I/O sizes. Higher latency on the much larger I/O sizes later on. By the way, the high read latencies generally come from cold reads that were served from the backing spindles.

The findings here show that early on in the workflow where SVN is doing a lot of it’s prep work, a 32KB I/O size for writes is typically used. The write IOPS are high, Throughput is modest, and latency comes in at sub 1ms. Later on in the run, the compiler itself uses much larger I/O sizes (128KB to 256KB). IOPS are lower, but throughput is very high. Latency suffers (approaching 8ms) with the significantly larger I/O sizes. There are other factors influencing this, to which I will address in an upcoming post.

This is one of the methods to determine your typical I/O size to provide a more accurate test configuration for Iometer, if you choose to do additional benchmarking. (See: Iometer. As good as you want to make it.)

Other observations

1. After you have deployed an FVP cluster into production, your SAN array monitoring tool will most likely show you an increase in your write percentage compared to your historical numbers . This is quite logical when you think about it.. All writes, even when accelerated, eventually make it to the data store (albeit in a much more efficient way). Many of your reads may be satisfied by FVP, and never hit the array.

2. When looking at a summary of the FVP at the cluster level, I find it helpful to click on the "Performance Map" view. This gives me a weighted view of how to distinguish what is being accelerated most during the given sampling period.

3. In addition to the GUI, controlling the VM write caching settings can easily managed via PowerShell. This might be a good step to take if the cluster tripped over to UPS power. Backup infrastructures that do not have a VADP capable proxy living in the accelerated cluster might also need to rely on some PowerShell scripts. PernixData has some good documentation on the matter.

Conclusion
PernixData FVP is doing a very good job of accelerating a verify difficult workload. I would have loved to show you data from accelerating a more typical workload such as Exchange or SQL, but my other cluster containing these systems is not accelerated at this time. Stay tuned for the next installment, as I will show you what was discovered as I started looking at my workload more closely.

– Pete

Accelerating storage using PernixData’s FVP. A perspective from customer #0001

Recently, I described in "Hunting down unnecessary I/O before you buy that next storage solution" the efforts around addressing "technical debt" that was contributing to unnecessary I/O. The goal was to get better performance out of my storage infrastructure. It’s been a worthwhile endeavor that I would recommend to anyone, but at the end of the day, one might still need faster storage. That usually means, free up another 3U of rack space, and open checkbook

Or does it? Do I have to go the traditional route of adding more spindles, or investing heavily in a faster storage fabric? Well, the answer was an unequivocal "yes" not too long ago, but times are a changing, and here is my way to tackle the problem in a radically different way.

I’ve chosen to delay any purchases of an additional storage array, or the infrastructure backing it, and opted to go PernixData FVP. In fact, I was customer #0001 after PernixData announced GA of FVP 1.0. So why did I go this route?

1. Clustered host based caching. Leveraging server side flash brings compute and data closer together, but thanks to FVP, it does so in such a way that works in a highly available clustered fashion that aligns perfectly with the feature sets of the hypervisor.

2. Write-back caching. The ability to deliver writes to flash is really important. Write-through caching, which waits for the acknowledgement from the underlying storage, just wasn’t good enough for my environment. Rotational latencies, as well as physical transport latencies would still be there on over 80% of all of my traffic. I needed true write-back caching that would acknowledge the write immediately, while eventually de-staging it down to the underlying storage.

3. Cost. The gold plated dominos of upgrading storage is not fun for anyone on the paying side of the equation. Going with PernixData FVP was going to address my needs for a fraction of the cost of a traditional solution.

4. It allows for a significant decoupling of "storage for capacity" versus "storage for performance" dilemma when addressing additional storage needs.

5. Another array would have been to a certain degree, more of the same. Incremental improvement, with less than enthusiastic results considering the amount invested. I found myself not very excited to purchase another array. With so much volatility in the storage market, it almost seemed like an antiquated solution.

6. Quick to implement. FVP installation consists of installing a VIB via Update Manager or the command line, installing the Management services and vCenter plugin, and you are off to the races.

7. Hardware independent. I didn’t have to wait for a special controller upgrade, firmware update, or wonder if my hardware would work with it. (a common problem with storage array solutions). Nor did I have to make a decision to perhaps go with a different storage vendor if I wanted to try a new technology. It is purely a software solution with the flexibility of working with multiple types of flash; SSDs, or PCIe based.

A different way to solve a classic problem
While my write intensive workload is pretty unique, my situation is not. Our storage performance needs outgrew what the environment was designed for; capacity at a reasonable cost. This is an all too common problem. With the increased capacities of spinning disks, it has actually made this problem worse, not better. Fewer and fewer spindles are serving up more and more data.

My goal was to deliver the results our build VMs were capable of delivering with faster storage, but unable to because of my existing infrastructure. For me it was about reducing I/O contention to allow the build system CPU cycles to deliver the builds without waiting on storage. For others it might delivering lower latencies to their SQL backed ERP or CRM servers.

The allure of utilizing flash has been an intriguing one. I often found myself looking at my vSphere hosts and all of it’s processing goodness, but disappointed those SSD sitting in the hosts couldn’t help to augment my storage performance needs. Being an active participant in the PernixData beta program allowed me to see how it would help me in my environment, and if it would deliver the needs of the business.

Lessons learned so far
Don’t skimp on quality SSDs. Would you buy an ESXi host with one physical core? Of course you wouldn’t. Same thing goes with SSDs. Quality flash is a must! I can tell you from first hand experience that it makes a huge difference. I thought the Dell OEM SSDs that came with my M620 blades were fine, but by way of comparison, they were terrible. Don’t cripple a solution by going with cheap flash. In this 4 node cluster, I went with 4 EMLC based, 400GB Intel S3700s. I also had the opportunity to test some Micron P400M EMLC SSDs, which also seemed to perform very well.

While I went with 400GB SSDs in each host (giving approximately 1.5TB of cache space for a 4 node cluster), I did most of my testing using 100GB SSDs. They seemed adequate in that they were not showing a significant amount of cache eviction, but I wanted to leverage my purchasing opportunity to get larger drives. Knowing the best size can be a bit of a mystery until you get things in place, but having a larger cache size allows for a larger working set of data available for future reads, as well as giving head room for the per-VM write-back redundancy setting available.

An unexpected surprise is how FVP has given me visibility into the one area of I/O monitoring that is traditional very difficult to see; I/O patterns. See Iometer. As good as you want to make it. Understanding this element of your I/O needs is critical, and the analytics in FVP has helped me discover some very interesting things about my I/O patterns that I will surely be investigating in the near future.

In the read-caching world, the saying goes that the fastest storage I/O is the I/O the array never will see. Well, with write caching, it eventually needs to be de-staged to the array. While FVP will improve delivery of storage to the array by absorbing the I/O spikes and turning random writes to sequential writes, the I/O will still eventually have to be delivered to the backend storage. In a more write intensive environment, if the delta between your fast flash and your slow storage is significant, and your duty cycle of your applications driving the I/O is also significant, there is a chance it might not be able to keep up. It might be a corner case, but it is possible.

What’s next
I’ll be posting more specifics on how running PernixData FVP has helped our environment. So, is it really "disruptive" technology? Time will ultimately tell. But I chose to not purchase an array along with new SAN switchgear because of it. Using FVP has lead to less traffic on my arrays, with higher throughput and lower read and write latencies for my VMs. Yeah, I would qualify that as disruptive.

Helpful Links

Frank Denneman – Basic elements of the flash virtualization platform – Part 1
http://frankdenneman.nl/2013/06/18/basic-elements-of-the-flash-virtualization-platform-part-1/

Frank Denneman – Basic elements of the flash virtualization platform – Part 2
http://frankdenneman.nl/2013/07/02/basic-elements-of-fvp-part-2-using-own-platform-versus-in-place-file-system/

Frank Denneman – FVP Remote Flash Access
http://frankdenneman.nl/2013/08/07/fvp-remote-flash-access/

Frank Dennaman – Design considerations for the host local FVP architecture
http://frankdenneman.nl/2013/08/16/design-considerations-for-the-host-local-architecture/

Satyam Vaghani introducing PernixData FVP at Storage Field Day 3
http://www.pernixdata.com/SFD3/

Write-back deepdive by Frank and Satyam
http://www.pernixdata.com/files/wb-deepdive.html