Testing InfiniBand in the home lab with PernixData FVP

One of the reasons I find the latest trends in datacenter architectures so interesting is the innovative approaches used to address deficiencies associated with more traditional arrangements. These innovations have been able to drive more of what almost everyone needs; better storage performance and better scalability.

The caveat to some of these newer arrangements is that it can put heavy stress on the plumbing that connects these servers. Distributed storage technologies like VMware VSAN, or clustered write buffering techniques used by PernixData FVP and Atlantis Computing’s USX leverage these interconnects to accelerate storage traffic. Turn-key Hyperconverged solutions do too, but they enjoy the luxury of having full control over the hardware used. Some of these software based solutions might need some retrofitting of an environment to run optimally or meet their requirements (read: 10GbE or better). The desire for the fastest interconnect possible between hosts doesn’t always align with budget or technical constraints, so it makes most sense to first see what impact there really is.

I wanted to test the impact better bandwidth would have between servers a bit more, but do to constraints in my production environment, I needed to rely on my home lab. As much as I wanted to throw 10GbE NICs in my home lab, the price points were too high. I had to do it another way. Enter InfiniBand. I’m certainly not the only one to try InfiniBand in a home lab, but I wanted to focus on two elements that are critical to the effectiveness of replica traffic. The overall bandwidth of the pipe, and equally important, the latency. While I couldn’t simulate an exact workload that I see in my production environment, I could certainly take smaller snippets of I/O patterns that I see, and model them the best I can.

InfiniBand is really interesting. As Joeb Jackson put it in a NetworkWorld.com article, "InfiniBand is architecturally sacrilegious" as it combines many layers of the OSI model. The results can be stunning. Transport latencies in the 2 microsecond neighborhood, and a healthy roadmap to 200Gbps and beyond. It’s sort of like the ’66 AC Shelby Cobra of data transports. Simple, and perhaps a little rough around the edges, but brutally fast. While it doesn’t have the ubiquity of RJ/Ethernet, it also doesn’t have the latencies that are still a part of those faster forms of Ethernet.

At the time of this writing, the InfiniBand drivers for ESXi 5.5 weren’t quite ready for VSAN testing yet, so the focus of this testing is to see how InfiniBand behaves when used in a PernixData FVP deployment. I hope to publish a VSAN edition in the future. I simply wanted to better understand if (and how much) a faster connection would improve the transmission of replica traffic when using FVP in WB+1 mode (local flash, and 1 peer). My production environment is very write intensive, and uses 1GbE for the interconnects. Any insight gained here will help in my design and purchasing roadmap for my production environment.

Testing:
Testing occurred on a two host cluster backed by a Synology DS1512+. Local flash leveraged SATA III based EMLC SSD drives using an onboard controller. 1GbE interconnects traversed a Cisco SG300-20 using a 1500 byte MTU size. For InfiniBand, each host used a Mellanox MT25418 DDR 2 port HCA that offered 10Gb per connection. They were directly connected to each other, and used a 2044 byte MTU size. InfiniBand can be set to 4092 bytes but for compatibility reasons under ESXi 5.5, 2044 is the desired size.

I tend to prefer testing that relies on observational patterns versus one final, empirical number. These tests were no different, and while they attempt to simulate a very brief snippet of a workload in my production environment, I find that I still gain a much better understanding from a time based performance graph than an insulated final number.

The test case was a simple one, but would be enough to illustrate the differences I was hoping to see. The test comprised of a 2vCPU VM using 2 workers on a 100% write, 100% random workload lasting for 1 minute. The test was run three times. First with WB+0 (no peer/replica traffic), then WB+1 (one peer) using a 1GbE connection, and finally WB+1 over a single 10Gb InfiniBand connection. Each screen capture I provide will show them in that order. That test case was repeated 3 times. First with 256KB I/O sizes, followed by 32KB, then onto 4KB. I ran the tests several times in different order to ensure I wasn’t introducing inflated or deflated performance due to previous tests or caching. All were repeated several times to flush out any anomalies.

(Click on each image for a larger view)

256KB I/O size test
Testing results using this I/O size is rarely published anywhere because it never bodes well in comparison to a smaller I/O size like 4KB. But my production workloads (compiling) often deal with these I/O sizes, so it is important for me to understand their behavior.

IOPS with 256KB I/O
256KB-IOPS

Latency with 256KB I/O
256KB-Latency

Throughput using 256KB I/O
256KB-Throughput

Observations from 256KB I/O test
Note that the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance, driving just half of the IOPS and throughput compared to InfiniBand. But also take a look at the terrible native latency (70ms) of large I/O sizes even when using WB+0 (no peer traffic. Just local flash). Also note that when peer traffic performance is improved, the larger backlog of data in the destager occurs.

32KB I/O size test
Just 1/8th the size of a 256KB I/O, this is still larger than most storage vendors like to advertise in their testing. My production workload often oscillates between 32KB and 256KB I/Os.

IOPS with 32KB I/O
32KB-IOPS

Latency with 32KB I/O
32KB-Latency

Throughput using 32KB I/O
32KB-Throughput

Observations from 32KB I/O test
Once again, the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance on throughput. Latency had only a minor improvement moving away from 1GbE, as the latency of the flash was about 6ms.

4KB I/O size test
The most common of I/O sizes that you might see, although it is more common on reads than writes. 1/64th the size of a 256KB I/O, it is tiny compared to the others, but important to test because of the attempt to learn if and how much a fatter, lower latency pipe helps in various I/O sizes.

IOPS with 4KB I/O
4KB-IOPS

Latency with 4KB I/O
4KB-Latency

Throughput using 4KB I/O
4KB-Throughput

Observations from 4KB I/O test
IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. But as the I/O sizes shrink, so does the effective total/concurrent payload size. So the differences between InfiniBand and 1GbE were less than on tests with larger I/O. Latencies of this I/O size were around 2ms.

Other observations that stood out
One of the first things that stood is illustrated below, with two 5 minutes test runs. Look at where the two arrows point. The arrow on the left points to the number of packets sent while using 1GbE. The arrow on the right shows the number of packets sent while using 10Gb InfiniBand. Quite a difference. Also notice that the effective throughput started out higher, but had to throttle back

Packetstransmitted

Findings:
The key takeaways from these tests:

    • A high bandwidth, low latency interconnect like InfiniBand can virtually eliminate any write redundancy penalty incurred in WB+1 mode.
    • From a single workload, I/O sizes of 32KB and 256KB saw between 65% and 90% improvement on IOPS and throughput. I/O sizes of 4KB saw essentially no improvement (many concurrent 4KB workloads likely would see a benefit however)..
  • Writes using larger I/O sizes were the clear beneficiary of a fatter pipe between servers. However, the native latencies of the flash devices under larger I/O sizes could not take advantage of the low latencies of InfiniBand. In other words, with large I/O sizes, the flash device themselves, or the bus they were using were by far the major impediment lower latency and faster I/O delivery
  • The smaller pipe of 1GbE throttled back the flash device’s ability to ingest the data as fast as InfiniBand. There was always a smaller amount of outstanding writes once the test was complete, but it came at the cost of poorer performance for 1GbE.
    A few other matters can come up when attempting to accurately interpret latencies. As VMware KB 2036863 points out, reporting of latencies accurately can sometimes be a challenge. Just something to be aware of.

Conclusion
InfiniBand was my affordable way to test how a faster interconnect would improve the abilities of FVP to accelerate replica storage I/O.  It lived up to the promise of high bandwidth with low latency. However, effective latencies were ultimately crippled by the SSDs, the controller, or the bus it was using. I did not have the opportunity to test other flash technologies such as PCIe based solutions from Fusion-IO or Virident, or the memory channel based solution from Diablo Technologies. But based on the above, it seems to be clear that how the flash is able to ingest the data is crucial to the overall performance of whatever solution that is using it.

Helpful Links
Erik Bussink’s great post on using InfiniBand with vSphere 5.5
http://www.bussink.ch/?p=1306 

Vladen Seget’s post on incorporating InfiniBand into his backing storage
http://www.vladan.fr/homelab-storage-network-speedup/

Mellanox, OFED and OpenSM bundles
https://my.vmware.com/web/vmware/details/dt_esxi50_mellanox_connectx/dHRAYnRqdEBiZHAlZA==
http://www.mellanox.com/downloads/Drivers/MLNX-OFED-ESX-1.8.1.0.zip
http://files.hypervisor.fr/zip/ib-opensm-3.3.16-64.x86_64.vib

Practical tips for a Veeam Backup and Recovery deployment

I’ve been using Veeam Backup and Recovery in my production environment for a while now, and in hindsight, it was one of the best investments we’ve ever made in our IT infrastructure. It has completely changed the operational overhead of protecting our VMs, and the data they serve up. Using a data protection solution that utilizes VMware’s APIs provides the simplicity and flexibility that was always desired. Moving away from array based features for protection has enabled the protection of VMs to better reflect desired RPO and RTO requirements – not by the limitations imposed by LUN sizes, array capacity, or functionality.

While Veeam is extremely simple in many respects, it is also a versatile, feature packed application that can be configured a variety of different ways. The versatility and the features can be a little confusing to the new user, so I wanted to share 25 tips that will help make for a quick and successful deployment of Veeam Backup and Recovery in your environment.

First lets go over a few assumptions that will be the basis for my recommendations:

  • There are two sites that need protection.
  • VMs and data need to be protected at each site, locally.
  • VMs and data need to be protected at each site, remotely.
  • A NAS target exists at each site.
  • Quick deployment is important.
  • You’ve already read all of the documentation. Winking smile

    Architecture
    There are a number of different ways to set up the architecture for Veeam. I will show a few of the simplest arrangements:

    In this arrangement below there would be no physical servers – only a NAS device. This is a simplified arrangement of what I use. If one wanted a rebuilt server (Windows or Linux) acting purely as a storage target, that could be in place of where you see the NAS. The architecture would stay the same.

    image

    Optionally, a physical server not just acting as a storage target, but also as a physical proxy would look something like this below:

    image

    Below is a combination of both, where a physical server is acting as the Proxy, but like the virtual proxy, is using an SMB share to house the data. In this case, a NAS unit.

     

    image

    Implementation tips
    These tips focus not so much on ultimately what may suite your environment best (only you know that) or leveraging all of the features inside the product, but rather, getting you up and running as quickly as possible so you can start returning great results.

    Job Manager Servers & Proxies

    1.  Have the job Manager server, any proxies, and the backup targets living on their own VLAN for a dedicated backup network.

    2.  Set up SNMP monitoring on any physical ports used in the backup arrangement.  It will be helpful to understand how utilized the physical links get, and for how long.

    3.  Make sure to give the Job Manager VM enough resources to play with – especially if it will have any data mover/proxy responsibilities.  The deployment documentation has good information on this, but for starters, make it 4vCPU with 5GB of RAM.

    4.  If there is more than one cluster to protect, consider building a virtual proxy inside each cluster that it will be responsible for protecting, then assign it to jobs that protect VMs in that cluster.  In my case, I use PernixData FVP in two clusters.  I have the data stores that house those VMs only accessible by their own cluster (a constraint of FVP).  Because of that, I have a virtual proxy living in each cluster, with backup jobs configured so that it will use a specific virtual proxy.  These virtual proxies have a special setting in FVP that will instruct the VMs being backed up to flush their write cache to the backing storage

    image

    Storage and Design

    5.  Keep the design simple, even if you know you will need to adjust at a later time.  Architectural adjustments are easy to do with Veeam, so  go ahead and get Veeam pointed to the target, and start running some jobs.  Use this time to get familiar with the product, and begin protecting the jewels as quickly as possible.

    6.  Let Veeam use the default SQL Server Express instance on the Veeam Job Manager VM.  This is a very reasonable, and simple configuration that should be adequate for a lot of environments.

    7.  Question whether a physical proxy is needed.  Typically physical proxies are used for one of three reasons.  1.)  They offload job processing CPU cycles from your cluster.  2.)  In simple arrangements a Windows based Physical proxy might also be the Repository (aka storage target).   3.) They allow for one to leverage a "direct-from-SAN" feature by plugging in the system to your SAN fabric.  The last one in my opinion introduces the most hesitation.  Here is why:

    • Some storage arrays do not have a "read-only" iSCSI connection type.  When this is the case, special care needs to be taken on the physical server directly attached to the SAN to ensure that it cannot initialize the data store.  The reality is that you are one mistake away from having a very long day in front of you.  I do not like this option when there is no secondary safety mechanism from the array on a "read-only" connection type.
    • Direct-from-SAN access can be a very good method for moving data to your target.  So good that it may stress your backing storage enough (via link saturation or physical disk limits) to perhaps interfere with your production I/O requirements.
    • Additional efforts must be taken when using write buffering mechanisms that do not live on the storage array (e.g. PernixData) .

    8.  Veeam has the ability to back up to an SMB share, or an NFS mount.  If an NFS mount is chosen, make sure that it is a storage target running native Linux.  Most NAS units like a Synology are indeed just a tweaked version of Linux, and it would be easy to conclude that one should just use NFS.  However, in this case, you may run into two problems.

    • The SMB connection to a NAS unit will likely be faster (which most certainly is the first time in history that an SMB connection is faster than an NFS connection) .
    • The Job Manager might not be able to manage the jobs on that NAS unit (connected via NFS) properly.  This is due to BusyBox and Perl on the Synology not really liking each other.  For me, this resulted in Veeam being unable to remove sun setting backups.  Changing over to an SMB connection on the NAS improved the performance significantly, and allowed for job handling to work as desired.

    9.  Veeam has a great new feature (version 7.x)  called a "Backup Copy" job, which allows for the backup made locally to be shipped to a remote site.  The "Backup Copy" job achieves one of the most basic requirements of data protection in the simplest of ways.  Two copies of the data at two different locations, but with the benefit of only processing the backup job once.  It is a new feature of Version 7, and although it is a great feature, it behaves differently, and warrants some time spent before putting into production.  For a speedy deployment, it might be best simply to configure two jobs.  One to a local target, and one to a remote target.  This will give you the time to experiment with the Backup Copy job feature.

    10.  There are compelling reasons for and against using a rebuilt server as a storage target, or using a NAS unit.  Both are attractive options.  I ended using a dedicated NAS unit.  It’s form factor, drive bay count, and the overall cost of provisioning was the only option that could match my requirements.

    Operations

    11.  In Veeam B&R, "Replication Jobs" are different than "Backup Jobs."  Instead of trying to figure out all of the nuances of both right away, use just the "Backup job" function with both local and remote targets.  This will give you time to better understand the characteristics of the replication functionality. One also might find that the "Backup Job" suites the environment and need better than the replication option.

    12.  If there are daily backups going to both local and offsite targets (and you are not using the "Backup Copy" option, have them run 12 hours apart from one another to reduce RPOs.

    13.  Build up a test VM to do your testing of a backup and restore.  Restore it in the many ways that Veeam has to offer.  Best to understand this now rather than when you really need to.

    14.  I like the job chaining/dependency feature, which allows you to chain multiple jobs together.  But remember that if a job is manually started, it will run through the rest of the jobs too.  The easiest way to accommodate this is to temporarily remove it from the job chain.

    15.  Your "Backup Repository" is just that, a repository for data.  It can be a Windows Server, a Linux Server, or an SMB share.  If you don’t have a NAS unit, stuff an old server (Windows or Linux) with some drives in it and it will work quite well for you.

    16.  Devise a simple, clear job naming scheme.  Something like [BackupType]-[Descriptive Name]-[TargetLocation] will quickly tell you what it is and where it is going to.  If you use folders in vCenter to organize your VMs, and your backups reflect the same, you could also  choose to use the folder name.  An example would be "Backup-SharePointFarm-LOCAL" which quickly and accurately describes the job.

    17.  Start with a simple schedule.  Say, once per day, then watch the daily backup jobs and the synthetic fulls to see what sort of RPO/RTOs are realistic.

    18.  Repository naming.  Be descriptive, but come up with some naming scheme that remains clear even if you aren’t in the application for several weeks.  I like indicating the location of the repository, if it is intended for local jobs, or remote jobs, and what kind of repository it is (Windows, Linux, or SMB).  For example:  VeeamRepo-[LOCATION]-for-Local(SMB)

    19.  Repository organization.  Create a good tree structure for organization and scalability.  Veeam will do a very good job at handling the organization of the backups once you assign a specific location (share name) on a repository.  However, create a structure that provides the ability to continue with the same naming convention as your needs evolve.  For instance, a logical share name assigned to a repository might be \\nas01\backups\veeam\local\cluster1  This arrangement allows for different types of backups to live in different branches.

    20.  Veeam might prevent the ability of creating more than one repository going to the same share name (it would see \\nas01\backups\veeam\local\cluster1 and \\nas01\backups\veeam\local\cluster2 as the same).  Create DNS aliases to fool it, then make those two targets something like \\nascluster1\backups\veeam\local\cluster1  and  \\nascluster2\backups\veeam\local\cluster2 

    21.  When in doubt, leave the defaults.  Veeam put in great efforts to make sure that you, or the software doesn’t trip over itself.  Uncertain of job number concurrency?  Stick to the default.  Wondering about which backup mode to use? (Reverse Incremental versus Incrementals with synthetic fulls). Stay with the defaults, and save the experimentation for later.

    22.  Don’t overcomplicate the schedule (at least initially).  Veeam might give you flexibility that you never had with array based protection tools, but at the same time, there is no need to make it complicated.  Perhaps group the VMs by something that you can keep track of, such as the folders they are contained in within vCenter.

    23.  Each backup job can be adjusted so that whatever target you are using, you can optimize it for preset storage optimization type.  WAN target, LAN target, or local target.  This can easily be overlooked, but will make a difference in backup performance.

    24.  How many backups you can keep is a function of change range, frequency, dedupe and compression, and the size of your target.  Yep, that is a lot of variables.  If nothing else, find some storage that can serve as the target for say, 2 weeks.  That should give a pretty good sampling of all of the above.

    25.  Take one item/feature once a week, and spend an hour or two looking into it.  This will allow you to find out more about say, Changed block tracking, or what the application aware image processing feature does.  Your reputation (and perhaps, your job) may rely on your ability to recover systems and data.  Come up with a handful of scenarios and see if they work.

    Veeam is an extremely powerful tool that will simplify your layers of protection in your environment. Features like SureBackup, Virtual Labs, and their Replication offerings are all very good. But more than likely, they do not need to be a part of your initial deployment plan. Stay focused, and get that new backup software up and running as quickly as possible. You, and your organization, will be better off for it.

    - Pete

    Shameless Seattle VMUG meeting plug–January 2014 edition

    For all of you VMware Admins and enthusiasts in the greater Seattle area, here is a fantastic opportunity to get out of the office for a little bit, meet some new people that do the same thing you do, and learn a little something along way. First, here are the details so you can carve out time on your calendar.

    Date: Thursday, January 30th, 2014
    Time: 12:00pm – 4:30pm
    Location:
    Seattle Museum of Flight
    Registration: http://www.vmug.com/e/in/eid=1181
    Event Sponsors: Silver Peak & Zerto

    What’s in store for this VMUG? I’m glad you asked…

    Ask the Experts panel
    One of the most popular sessions each year at VMworld in San Francisco is the "Ask the Experts" Q&A session. If you have attended one, you quickly figure out why. No canned slide decks, or product promoting undertones. Just real questions from the audience to a panel of highly experienced and credentialed design experts. It is always a packed house, entertaining, and informative.

    In the spirit of copying a really good idea, the next Seattle VMUG meeting will be doing the same thing. Do you have some questions that you’d like to hear from the panel?  Here is your chance!  Among our panel is:

    Jason Horn. Jason is a Principal Systems Engineer with Starbucks Coffee Company. He was recently awarded his VCDX (#113). (twitter / blog)

    Peter Chang. Peter is a Senior Systems Engineer at PernixData. He currently is a VCAP-DCA/DCD/DTD, and vExpert for 2013. (twitter / blog)

    I may also be taking part in the panel, but will focus on doing my best to moderate, and prevent any fights from breaking out. If you prefer not to raise your hand in public, but have some burning questions for our panel, please send me a note, or submit a question on the VMUG forums at: http://www.vmug.com/p/fo/st/thread=1826

    Product Testimonials from customers
    We have two great sponsors for this VMUG. Silver Peak‘s WAN optimization products and Zertos BC/DR solution have garnered a lot of attention in their respective market segments. But here is your chance to hear a little bit more on how their products have impacted real customers. Customer stories are a great way to see if solutions deliver on their promises, and help you to see if they might be a good fit for your organization.

    So come on out to the next Seattle VMUG. Bring your questions for our "Ask the Experts" panel, and hear stories from real customers of Zerto and Silver Peak. And who knows. You might just win something too.

    - Pete

    Using the Cisco SG300-20 Layer 3 switch in a home lab

    One of the goals when building up my home lab a few years ago was to emulate a simple production environment that would give me a good platform to learn and experiment with. I’m a big fan of nested labs, and use one on my laptop often. But there are times when you need real hardware to interact with. This has come up even more than I expected, as recent trends with leveraging flash on the host have resulted in me stuffing more equipment back in the hosts for testing and product evaluations.

    Networking is the other area that can be helpful to have equipment that at least tries to mimic what you’d see in a production environment. Yet the options for networking in a home lab have typically been limited for a variety of reasons.

    • The real equipment is far too expensive, or too loud for most home lab needs.
    • Searching on eBay or Craigslist for a retired production unit can be risky. Some might opt for this strategy, but this can result in a power sucking, 1U noise maker that may have some dead ports on it, or worse, bricked upon arrival.
    • Consumer switches can be disappointing. Rig up a consumer switch that is lacking in features, and port count, and be left wishing you hadn’t gone this route.

    I wanted a fanless, full Layer 3 managed switch with a feature set similar to what you might find on an enterprise grade switch, but not at an enterprise grade price. I chose to go with a Cisco SG300-20. This is a 20 port, 1GbE, Layer 3 switch. With no fans, the unit draws as little as 10 watts.

    Read more of this post

    Effects of introducing write-back caching with PernixData FVP

    Implementing new technology that solves real problems is great. It is exciting, and you get to stand on the shoulders of the smart folks who dreamed up the solution. But with all of that glory comes new design and operation elements that may have been introduced. This isn’t a bad thing. It is just different. The magic of virtualization didn’t excuse the requirement of needing to understand the design and operational considerations of the new paradigm. The same goes for implementing host based caching in a virtualized environment.

    Implementing FVP is simple and the results can be impressive. For many, that is about all the effort they may end up putting into it. But there are design considerations that will help maximize the investment, and minimize false impressions, or costly mistakes. I want to share what has been learned against my real world workloads, so that you can understand what to look for, and possibly how to get more out of your investment. While FVP accelerates both reads and writes, it is the latter that warrants the most consideration, so that will be the focus of this post.

    When accelerating storage using FVP, the factors that I’ve found to have the most influence on how much your storage I/O is accelerated are:

    • Interconnect speed between hosts of your pooled flash
    • Performance delta between your flash tier, and your storage tier.
    • Working set size of your data
    • Duty cycle write I/O profile of your VMs (including peak writes, and duration)
    • I/O size of your writes (which can vary within each workload)
    • Likelihood or frequency of DRS or manual vMotion activities
    • Native speed and consistency of your flash (the flash itself, and the bus speed)
    • Capacity of your flash (more of an influence on read caching, but can have some impact on writes too)

    Write-back caching & vMotion
    Most know by now that to guard against any potential data loss in the event of a host failure, FVP provides redundancy of write-back caching through the use of one or more peers. The interconnect used is the vMotion network. While FVP does a good job of decoupling the VM’s need to wait for the backing datastore, a VM configured for write-back with redundancy must acknowledge the write I/O of the VM from it’s local flash, AND the one or more peers before it returns the write ACK to the VM.

    What does this mean to your environment? More traffic on your vMotion network. Take a look at the image below. In a cluster NOT accelerated by FVP, the host uplinks that serve a vMotion network might see relatively little traffic, with bursts of traffic only during vMotion activities. That would also be the case if you were running FVP in write-back mode with no peers (WB+0). This image below is what the activity on the vMotion network looks like as perceived by one of the hosts after the VMs had write-back with redundancy of one peer. In this case the writes were averaging about 12MBps across the vMotion network. You will see that the spike is where a vMotion kicked off: The spike is the peak output of a 1GbE interface; about 125MBps.

    image

    Is this bad that the traffic is running over your vMotion network? No, not necessarily. It has to run over something. But with this knowledge, it is easy to see that bandwidth for inter-server communication will be more important than ever before. Your infrastructure design may need to be tweaked to accommodate the new role that the vMotion network plays.

    Can one get away with a 1GbE link for cross server communication? Perhaps. It really depends on the factors above, which can sometimes be hard to determine. So with all of the variables to consider, it is sometimes easiest to circle back to what we do know:

    • Redundant write back caching with FVP will be using network connectivity (via vMotion network) for every single write that occurs for an accelerated VM.
    • Redundant write back caching writes are multiplied by the number of peers that are configured per accelerated VM.
    • The write accelerated I/O commit time (latency) will be as fast as the slowest connection.  Your vMotion network will likely be slower than the local bus.  A poor quality SSD or an older generation bus could be a bottleneck too.
    • vMotion activities enjoy using every bit of bandwidth it has available to it.
    • VM’s that are committing a lot of writes might also be taxing CPU resources, which may kick in DRS rules to rebalance the load – thus creating more vMotion traffic.  Those busy VMs may be using more active memory pages as well, which may increase the amount of data to move during the vMotion process.

    The multiplier of redundancy
    Lets run through a simple scenario to better understand the potential impact an undersized vMotion network can have on the performance of write-back caching with redundancy. The example is addressing writes only.

    • 4 hosts each have a group of 6 VM’s that consistently write 5MBps per VM.  Traditionally, these 24 VMs would be sending a total of 120MBps to the backing physical storage.
    • When write back is enabled without any redundancy (WB+0), the backing storage will still see the same amount of writes committed, but it will be in a slightly different way.  Sequential, and smoothed out as data is flushed to the backing physical storage.
    • When write back is enabled and a write redundancy of “local flash and 1 network flash device” (WB+1) is chosen, the backing storage will still see 120MBps go to it eventually, but there will be an additional 120MBPs of data going to the host peers, traversing the vMotion network.
    • When write back is enabled and a write redundancy of “local flash and 2 network flash devices” (WB+2) is chosen, the backing storage will still see 120MBps to it, but there will be an additional 240MBps of data going to the host peers, traversing the vMotion network.

    image

    The write-back redundancy configuration is a per-VM setting, so there not necessarily a need to change them all to one setting. Your VMs will most likely not have the same write workload either. But this is to illustrate the point that as the example shows, it is not hard to saturate a 1GbE interface. Assuming an approximate 125MBps on a single 1GbE interface, under the described arrangement, saturation would occur with each VM configured for write-back with redundancy of one peer (WB+1). This leaves little headroom for other traffic that might be traversing that network, such as vMotions, or heartbeats.

    Fortunately FVP has the smarts built in to ensure that vMotion activities and write-back caching get along. However, there is no denying the physics associated with the matter. If you have a lot of writes, and you really want to leverage the full beauty of FVP, you are best served by fast interconnects between hosts. It is a small price to pay for supreme performance. FVP might expose the fact that 1GbE not be ideal in an accelerated environment, but consider what else has changed over the years. Standard memory sizes of deployed VMs have increased significantly (The vOpenData Public Dashboard confirms this). That 1GbE vMotion network might have been good for VM’s with 512MB of RAM, but what about those with 4, 8, or 12GB of RAM?  That 1GbE vMotion network has become outdated even for what it was originally designed for.

    Destaging
    One characteristic unique with any type of write-back caching is that eventually, the data needs to be destaged to the backing physical datastore. The server-side flash that is now decoupled from the backing storage has the potential to accommodate a lot of write I/Os with minimal latency. One may or may not have the backing spindles, or conduit large enough to be sending your write I/O to the backing physical storage if this high write I/O lasts long enough. Destaging issues can occur on an arrangement like FVP, or with storage arrays and DAS arrangements that front performance I/O with flash that get pushed to slower spindles.

    Knowing the impact of this depends on the workload and the environment it runs in.

    • If the duty cycle of the write workload that is above the physical storage I/O limit allows for enough “rest time” (defined as any moment that the max I/O to the backing physical storage is below 100%) to destage before the next over commitment begins, then you have effectively increased your ability to deliver more write I/Os with less latency.
    • If the duty cycle of the write workload that is above the physical storage I/O limit is sustained for too long, the destager of that given VM will fill to capacity, and will not be able to accelerate any faster than it’s ability to destage.

    Huh?  Okay, a picture might be a better way to describe this.  The callouts below point to the two scenarios described.

    image

     

    So when looking at this write I/O duty cycle, there becomes a concept of amplitude of the maximum write I/O, and frequency of those times in which is it overcommitting. When evaluating an environment, you might see this crude sine-wave show up. This write I/O duty cycle, coupled with your physical components is the key to how much FVP can accelerate your environment.

    What happens when the writes to the destager surpass the ability of your backing storage to keep up with the writes? Once the destager for that given VM fills up, it’s acceleration will reduce to the rate that it can evacuate the data to the backing storage.  One may never see this in production, but it is possible.  It really depends on the factors listed at the beginning of the post.  The only way to clearly see this is from a synthetic workload, where I show it was able to push 5 times the write I/Os (blue line) before eventually filling up the destager to the point where it was throttled back to the rate of the datastore (purple line)

    SNAGHTML329ee44

    This will have an impact on the effective latency, shown below (blue line).  While the destager is full, it will not be able to fulfill the write at the low latency typically associated with flash, reflecting latency closer to the backing datastore (purple line).

    image

    Many workloads would never see this behavior, but those that are very write intensive (like mine), and that have a big delta between their acceleration tier and their backing storage may run into this.

    The good news is that workloads have a tendency to be bursty, which is a perfect match for an acceleration tier. In a clustered arrangement, this is much harder to predict, and bursty can be changed to steady-state quite quickly. What this demonstrates is that if there is enough of a performance delta between your acceleration tier, and your storage tier, under cases of sustained writes, there may be times where it doesn’t have the opportunity to flush enough writes to maintain it’s ability to accelerate.

    Recommendations
    My recommendations (and let me clarify that these are my opinions only) on implementing FVP would include.

    • Initially, run the VMs in write-through mode so that you can leverage the FVP analytics to better understand your workload (duty cycles, read/write ratios, maximum write throughput for a VM, IOPS, latency, etc.)
    • As you gain a better understanding of the behavior of these workloads, introduce write-back caching to see how it helps the systems changed.
    • Keep and eye on your vMotion network (in particular, those with 1GbE environments and limited physical ports) and see if one ever comes close to saturation.  Other leading indicators will be increased latency on accelerated writes.
    • Run out and buy some 10GbE NICs for your vMotion network.  If you are in a situation with a total 1GbE legacy fabric for your SAN, and your vMotion network, and perhaps you have limits on form factors that may make upgrading difficult (think blades here), consider investing in 10GbE for your vMotion network, as opposed to your backing storage. Your read caching has probably already relieved quite a bit of I/O pressure on your storage, and addressing your cross server bandwidth is ultimately a more affordable, and simpler task.
    • If possible, allocate more than one link and configure for Multi-NIC vMotion. At this time, FVP will not be able to leverage this, but it will allow vMotion to use another link if the other link is busy. Another possible option would be to bond multiple 1GbE links for vMotion. This may or may not be suitable for your environment.

    So if you haven’t done so already, plan to incorporate 10GbE for cross-server communication for your vMotion Network. Not only will your vMotioning VM’s thank you, so will the performance of FVP.

    - Pete

    Helpful links:

    Fault Tolerant Write acceleration
    http://frankdenneman.nl/2013/11/05/fault-tolerant-write-acceleration/

    Destaging Writes from Acceleration Tier to Primary Storage
    http://voiceforvirtual.com/2013/08/14/destaging-writes-i/

    Using a new tool to discover old problems

    It is interesting what can be discovered when storage is accelerated. Virtual machines that were previously restricted by the underperforming arrays now get to breath freely.  They are given the ability to pass storage I/O as quickly as the processor needs. In other words, the applications that need the CPU cycles get to dictate your storage requirements, rather than your storage imposing artificial limits on your CPU.

    With that idea in mind, a few things revealed themselves during the process of implementing PernixData FVP.  Early on, it was all about implementing and understanding the solution.  However, once the real world workloads began accelerating, there was intrigue on the analytics that FVP was providing.  What was generating the I/O that was being accelerated?  What processes were associated with the other traffic not being accelerated, and why?  What applications were behind the changing I/O sizes?  And what was causing the peculiar I/O patterns that were showing up?  Some of these were questions raised at an earlier time (see: Hunting down unnecessary I/O before you buy that next storage solution ).  The trouble was, the tools I had to discover the pattern of data I/O were limited.

    Why is this so important? In the spirit of reminding ourselves that no resource is an island, here is an example of a production code compile run, as looking from the perspective of the guest CPU. The first screen capture is the code with adequate storage I/O to support the application’s needs. A full build and is running nearly perfect CPU utilization of all 8 of it’s vCPUs.  (screen shots taken from my earlier post; Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements-Part 2)

    image

    Below is that very same code compile, under stressed backend storage. It took 46% longer to complete, and as you can see, changes the CPU utilization of the build run.

    image

    The primary goal for this environment was to accelerate the storage. However, it would have been a bit presumptuous to conclude that all existing storage traffic is good, useful I/O. There is a significant amount of traffic originating from outside of IT, and the I/O generated needed to be understood better.  With the traffic passing more freely thanks to FVP acceleration, patterns that previously could not expose themselves should be more visible. This was the basis for the first discovery

    A little “CSI” work on the IOPS
    Many continuous build systems use some variation of a polling mechanism to understand when there is new checked in code that needs to be compiled. This should be a very light weight process.  However, once storage performance was allowed to breath better, the following patterns started showing up on all of the build VMs.

    The image below shows the IOPS for one build VM during a one hour period of no compiling for that particular VM.  The VM’s were polling for new builds every 5 minutes.  Yep, that “build heartbeat” was as high as 450 IOPS on each VM.

    high-IOPS-heartbeat

    Why wasn’t this noticed before?  These spikes were being suppressed by my previously overtaxed storage, which made them more difficult to see. These were all writes, and were translating into 500 to 600 steady state IOPS just to sit idle (as seen below from the perspective of the backing storage)

    Array-VMFSvolumeIOPS

    So what was the cause? As it turned out, the polling mechanism was using some source code control (SVN) calls to help the build machines understand if it needed to execute a build. Programmatically, the Development Team has no idea that the script that they develop is going to be efficient, or not efficient. They are separated by that layer of the infrastructure. (Sadly, I have a feeling this happens more often than not in general Application Development). This resulted in a horribly inefficient method. After helping them understand the matter, it was revamped, and now polling for each VM only takes 1 to 2 IOPS every 5 minutes.

    Idle-IOPS2

    The image below shows how the accelerated cluster of 30 build VMs looks when there are no builds running.

    Idle-IOPS

    The inefficient polling mechanism wasn’t the only thing found. A few of the Linux build VMs had a rouge “Beagle” search daemon running on them. This crawler did just that, indexing data on these Linux machines, and creating unnecessary I/O.  With Windows, Indexers and other CPU and I/O hogs are typically controlled quite easily by GPO, but the equivalent services can creep into Linux systems if not careful.  It was an easy fix at least.

    The cumulative benefit
    Prior to the efforts of accelerating the storage, and looking to make it more efficient, the utilization of the arrays looked as the image shows.  (6 hour period, from the perspective of the arrays)

    Array-IOPS-before

    Now, with the combination of understanding my workload better, and acceleration through FVP, that same workload looks like this (6 hour period, from the perspective of the arrays):

    Array-IOPS-after

    Notice that the estimated workload is far under the 100% it was regularly pegged at for 24 hours a day, 6 days a week.  In fact, during the workday, the arrays might only peak at 50% to 60% utilization.  When no builds are running, the continuous build system may only be drawing 25 IOPS from the VMFS volumes that contain the build machines, which is much more reasonable than where it was at.

    With the combination of less pressure on the backing physical storage, and the magic of pooled flash on the hosts, the applications and CPU get to dictate how much storage I/O is needed.  Below is a screen capture of IOPS on a production build VM while compiling was being performed.  It was not known up until this point that a single build VM needed as much as 4,000 IOPS to compile code because the physical storage was never capable of satisfying that type of need.

    IOPS-single-VM

    Conclusion
    Could some of these discoveries have been made without FVP?  Yes, perhaps some of it. But good analysis comes from being able to interpret data in a consumable way. Its why various methods of data visualization such as bar graphs, pie charts, and X-Y-Z plots exist. FVP certainly has been doing a good job of accelerating workloads, but it is also helps the administrator understand the I/O better.  I look forward to seeing how the analytics might expand in future tools or releases from PernixData.

    A friend once said to me that the only thing better than a new tractor is a reason to use it. In many ways, the same thing goes for technology. Virtualization might not even be that fascinating unless you had real workloads to run on top of it. Ditto for for PernixData FVP. When applied to real workloads, the magic begins to happen, and you learn a lot about your data in the process.

    Observations of PernixData FVP in a production environment

    Since my last post, "Accelerating storage using PernixData’s FVP. A perspective from customer #0001" I’ve had a number of people ask me questions on what type of improvements I’ve seen with FVP.  Well, let’s take a look at how it is performing.

    The cluster I’ve applied FVP to is dedicated for the purpose of compiling code. Over two dozen 8 vCPU Linux and Windows VM’s churning out code 24 hours a day. It is probably one of the more challenging environments to improve, as accelerating code compiling is inherently a very difficult task.  Massive amounts of CPU using a highly efficient, multithreaded compiler, a ton of writes, and throw in some bursts of reads for good measure.  All of this occurs in various order depending on the job. Sounds like fun, huh.

    Our full builds benefited the most by our investment in additional CPU power earlier in the year. This is because full compiles are almost always CPU bound. But incremental builds are much more challenging to improve because of the dialog that occurs between CPU and disk. The compiler is doing all sorts of checking, throughout the compile. Some of the phases of an incremental build are not multithreaded, so while a full build offers nearly perfect multithreading on these 8 vCPU build VMs, this just isn’t the case on an incremental build.

    Enter FVP
    The screen shots below will step you through how FVP is improving these very difficult to accelerate incremental builds. They will be broken down into the categories that FVP divides them into; IOPS, Latency, and Throughput.  Also included will be a CPU utilization metric, because they all have an indelible tie to each other. Some of the screen shots are from the same compile run, while others are not. The point here it to show how it is accelerating, and more importantly how to interpret the data. The VM being used here is our standard 8 vCPU Windows VM with 8GB of RAM.  It has write-back caching enabled, with a write redundancy setting of "Local flash and 1 network flash device."

    Click on each image to see a larger version

    IOPS
    Below is an incremental compile on a build VM during the middle of the day. The magenta line is showing what is being satisfied by the backing data store, and the blue line shows the Total Effective IOPS after flash is leveraged. The key to remember on this view is that it does not distinguish between reads and writes. If you are doing a lot of "cold reads" the magenta "data store" line and blue "Total effective" line may very well overlap.

    PDIOPS-01

    This is the same metric, but toggled to the read/write view. In this case, you can see below that a significant amount of acceleration came from reads (orange). For as much writing as a build run takes, I never knew a single build VM could use 1,600 IOPS or more of reads, because my backing storage could never satisfy the request.

    PDIOPS-02

    CPU
    Allowing the CPU to pass the I/O as quickly as it needs to does one thing, it allows the multithreaded compiler to maximize CPU usage. During a full compile, it is quite easy to max out an 8 vCPU system and have a sustained 100% CPU usage, but again, these incremental compiles were much more challenging. What you see below is the CPU utilization associated with the VM running the build. It is a significant improvement of an incremental build by using acceleration. A non accelerated build would rarely get above 60% CPU utilization.

    CPU-01

    Latency
    At a distance, this screen grab probably looks like a total mess, but it has really great data behind it. Why? The need for high IOPS is dictated by the VMs demanding it. If it doesn’t demand it, you won’t see it. But where acceleration comes in more often is reduced latency, on both reads and writes. The most important line here is the blue line, which represents the total effective latency.

    PDLatency-01

    Just as with other metrics, the latency reading can often times be a bit misleading with the default "Flash/Datastore" view. This view does not distinguish between reads and writes, so a cold read pulling off of spinning disk will have traditional amounts of latency you are familiar with. This can skew your interpretation of the numbers in the default view. For all measurements (IOPS, Throughput, Latency) I often find myself toggling between this view, and the read/write view. Here you can see how a cold read sticks out like a sore thumb. The read/write view is where you would go to understand individual read and write latencies.

    PDLatency-02

    Throughput
    While a throughput chart can often look very similar to the IOPS chart, you might want to spend a moment and dig a little deeper. You might find some interesting things about your workload. Here, you can see the total effective throughput significantly improved by caching.

    PDThroughput-01

    Just as with the other metrics, toggling it into read/write view will help you better understand your reads and writes.

    PDThroughput-02

    The IOPS, Throughput & Latency relationship
    It is easy to overlook the relationship that IOPS, throughput, and latency have to each other. Let me provide an real world example of how one can influence the other. The following represents the early, and middle phases of a code compile run. This is the FVP "read/write" view of this one VM. Green indicates writes. Orange indicates reads. Blue indicates "Total Effective" (often hidden by the other lines).

    First, IOPS (green). High write IOPS at the beginning, yet relatively low write IOPS later on.

    IOPS

    Now, look at write throughput (green) below for that same time period of the build.  A modest amount of throughput at the beginning where the higher IOPS were at, then followed by much higher throughput later on when IOPS had been low. This is the indicator of changing I/O sizes from the applications generating the data.

    throughput

    Now look at write latency (green) below. Extremely low latency (sub 1ms) with smaller I/O sizes. Higher latency on the much larger I/O sizes later on. By the way, the high read latencies generally come from cold reads that were served from the backing spindles.

    latency

    The findings here show that early on in the workflow where SVN is doing a lot of it’s prep work, a 32KB I/O size for writes is typically used.  The write IOPS are high, Throughput is modest, and latency comes in at sub 1ms. Later on in the run, the compiler itself uses much larger I/O sizes (128KB to 256KB). IOPS are lower, but throughput is very high. Latency suffers (approaching 8ms) with the significantly larger I/O sizes. There are other factors influencing this, to which I will address in an upcoming post.

    This is one of the methods to determine your typical I/O size to provide a more accurate test configuration for Iometer, if you choose to do additional benchmarking. (See: Iometer.  As good as you want to make it.)

    Other observations

    1.  After you have deployed an FVP cluster into production, your SAN array monitoring tool will most likely show you an increase in your write percentage compared to your historical numbers . This is quite logical when you think about it..  All writes, even when accelerated, eventually make it to the data store (albeit in a much more efficient way). Many of your reads may be satisfied by FVP, and never hit the array.

    2.  When looking at a summary of the FVP at the cluster level, I find it helpful to click on the "Performance Map" view. This gives me a weighted view of how to distinguish what is being accelerated most during the given sampling period.

    image

    3. In addition to the GUI, controlling the VM write caching settings can easily managed via PowerShell. This might be a good step to take if the cluster tripped over to UPS power.  Backup infrastructures that do not have a VADP capable proxy living in the accelerated cluster might also need to rely on some PowerShell scripts. PernixData has some good documentation on the matter.

    Conclusion
    PernixData FVP is doing a very good job of accelerating a verify difficult workload. I would have loved to show you data from accelerating a more typical workload such as Exchange or SQL, but my other cluster containing these systems is not accelerated at this time. Stay tuned for the next installment, as I will show you what was discovered as I started looking at my workload more closely.

    - Pete

    Follow

    Get every new post delivered to your Inbox.

    Join 641 other followers