Observations with the Active Memory metric in vSphere

The subject of memory management of Operating Systems in vSphere is an enormously broad, and complex topic that has been covered quite well over the years. With all of that great information, there are characteristics with some of the metrics given that still seem to befuddle users. One of those metrics provided to us courtesy of vSphere is "Active Memory." I hope to provide a few real world examples of why this confusion occurs, and what to look out for in your own environment.

vSphere attempts to interpret how much memory is being actively used by a VM, and displays this in the form of “Active Memory.”  The VMkernel bases this estimate off of recently touched memory pages by the guest OS for a given sampling period. It then displays it as an average for that sampling period (maximums and minimums exposed with higher logging levels). It is a metric that has proven to be quite controversial. Some have grown frustrated by the perceived inaccuracies of it, but I believe the problem is not in the metric’s accuracy, but a misunderstanding of how it collects it’s data, and it’s meaning. Having additional data points to understand the behavior of your workload is a good thing. It is critical to know what it really means, and how different Operating Systems and applications may provide different results to this metric.

There are a wealth of good sources (a few links at the end of this post) on defining what Active Memory is as it relates to vSphere. The two takeaways of the Active Memory metric I like to remember is that 1.) It is a statistical estimate, and 2.) It represents a single sample period. In other words, it has no relationship to previous samplings, and therefore, may or may not represent the same memory pages accessed.

The Risk
"We have met the enemy, and he is us."  — Walt Kelly as Pogo

Since Active Memory is a unique metric outside of the paradigm of the OS, translating what it means to you, the application, or the guest OS can be prone to misinterpretation. The risk is interpreting it’s meaning incorrectly, and perhaps using it as the primary method for right sizing a VM. Interestingly enough, this can lead to both oversized VMs, and undersized VMs.

I believe that one thing that gets Administrators off on the wrong foot is vSphere’s own baked-in alarm of "Virtual Machine Memory Usage." This "Usage" metric is a percentage of total available memory for the VM, and is tied to the Active Memory metric in vSphere. It implies that when it is high, the VM is running out of memory, and when it is low, it is performing as designed with no memory issues. I will demonstrate how under certain circumstances, both of these assumptions can be wrong.

Oversizing
Oversizing a VM’s resources is not an uncommon occurrence. You would think spotting these systems might be easy and obvious. That is not always the case.

With respect to memory sizing, let’s do a little experiment. The example below is a bulk file copy (11 gigabytes worth of large and small files) from a Linux machine. The target can be local, or remote. The effect will be similar. We will observe the difference of Active Memory between the small VM (1GB of memory assigned), and the large VM (4GB of memory assigned), and what impacts it may or may not have on performance.

The Active Memory of the smaller Linux VM below

image

The Active Memory of the larger Linux VM below.

image

Note how the Active Memory increased on the 4GB Linux VM versus the 1GB Linux VM. This gives the impression that the file copy is using memory for the file copy job, and leaves less for the applications.

Now let us jump into ‘top’ inside the guest OS. It also shows figures that give the impression that the file copy using most of the memory for the copy job, and may trigger a vCenter Memory usage alarm.

image

But in this case, top is not telling the entire story either. Let’s take a look at the same resource utilization inside the guest using ‘htop’

image

Let’s look at utilization inside the guest using "free -m"

image

So what is going on here?  The Linux kernel will allocate memory that isn’t actively used by processes to other tasks like file system caches. This opportunistic use of memory will not interfere with other spawning processes. As soon as another process spawns, the Linux kernel will free that memory so that it can be used by the application. This is a clever use of resources, but as you can see, can also give the wrong impression inside the guest (via ‘top’), as well as in vSphere (via Active Memory). One can keep increasing the amount of memory assigned to a VM, and in many cases, this behavior will continue to occur. vSphere’s Active Memory metric does not attempt to distinguish what it is, beyond a change in value. In all cases, the memory statistics are not inaccurate, but just a different representation of memory usage.

The reason why I chose a bulk file copy as an experiment is because a file copy is largely perceived by the end user as being a storage I/O or network I/O matter. The behavior I described will most likely show up in Linux VMs being used as flat-file storage servers (something I see often), but is not limited to just that type of workload. I should also mention that during the testing, the ability for Linux to use memory for some of it’s file handling tasks was more noticeable when using slow backing storage in comparison to faster storage.

If you are purely a Windows shop, remember that this characteristic will show up with virtual appliances, as they are all Linux VMs. Lets take a look at that same bulk file copy in Windows, and see how it relates to Active Memory.

The Active Memory of the smaller Windows VM below.

image

The Active Memory of the larger Windows VM below.

image

Memory resources inside the guest of the larger Windows VM below.

image

The Windows Memory Manager seems to handle this same task differently.  Semantics aside, when more memory is assigned to a VM, Windows appears to carve out more for this task, but seems to cap it’s ability, in favor of leaving the remaining memory space for already cached applications and data, (seen in the screen shots as “standby” and/or “free”).  This is a simple indicator that various Operating Systems handle their memory management differently, and needs to be taken into consideration when a user is observing the Active Memory metric.

Undersizing
Undersizing a VM’s memory can stem from many reasons, but are most likely to show up on the following types of systems.

  • Server performing multiple roles and not sized accordingly. (e.g. Front end web services with backend databases on the same system, like small SharePoint deployments)
  • VMs right sized according to the Active Memory metric.
  • SQL Servers.
  • Exchange Servers.
  • Servers running one or more Java applications.

With a SQL server, one can easily find a server where the "Active Memory" is quite low. Then, look inside the guest, and you will see utilization of memory is very high, and if the system resources were assigned pretty conservatively, will act sluggish.

image

Now look at it inside the guest, and you will see quite high utilization.

clip_image002

A few steps can help this matter.

  • Use the SQL Server Monitoring Tools in Perfmon to better understand the problem. Be warned that you may have to invest significant time in this in order to get the scaling right, interpret, and validate the data correctly. Don’t rely solely on one metric to determine the state. For instance, the "SQL Server Buffer Manager: Buffer Cache Hit Ratio" is supposed to indicate insufficient memory for SQL if the ratio is a low number. However, I’ve seen memory starved systems still show this as a high value.
  • Change SQL’s default configuration for managing memory. The default setting will let SQL absorb all of the memory, and leave little for the rest of the OS or the apps Set it to a fixed number below the amount assigned to the system. For example, if one had a 12GB SQL server, assign 6GB as the maximum server memory. This will allow for sufficient resources for the server OS an any other applications that run on the system.
  • Document performance monitoring results, then increase the memory assigned to your VM. Then follow up with more performance monitoring to see any measurable results. One could simply increase the memory assigned and forget the other steps, but you’ll be relying completely on anecdotal observations to determine improvement.

Exchange is beginning to act more like SQL with each major release. Much like SQL, Exchange is now quite aggressive in its use of caching. It’s one of the reasons by the dramatic reductions in storage I/O demands over the last three major releases of Exchange. Also like SQL, having plenty of memory assigned will help compensate for slow backend storage.  Starving the system of memory will create wildly unpredictable results, as it never has an opportunity to cache what it should.

Java will use its own memory manager. Java will need available memory space in each VM for each and every JVM running. Ultimately, the JVM applications will work best when a memory reservation is at minimum, set to the sum of all JVMs running on that VM . Be mindful of the implications that memory reservations can bring to the table. You can gain more insight as to the needs of Java inside the guest, by using various tools.

Other observations from a Production environment
A few other notes worth mentioning

1.  Sometimes guest OS paging is monitored as an indicator of not enough memory. However, not all memory inside a guest OS will page when under pressure. If the applications or OS have pinned the memory, so you won’t see memory paging coming from them. One can be starving the app for memory, but it does not show via guest OS paging.

2.  VMs with larger vCPU counts need a relative increase in memory assigned to the VM. I’ve have seen this in my environment, where a VM with a high vCPU count is under tremendous load, that not having enough memory will hinder performance. Simply put, more CPU cycles needs more memory addresses to work with.

3.  Server memory might not be cheap, but neither is storage, and even fast storage is several orders of magnitude slower than memory. The performance gain of assigning more memory to specific VMs (assuming your hosts/cluster can support it) can be immediate, and dramatic. No need to induce unnecessary paging if unnecessary.

4.  Assigning more memory to a VM running a poorly designed or inefficient application will likely not help the application, and be a waste of resources. An application may be storage I/O heavy, no matter how much memory you assign it (think Exchange 2003).

One of my first and favorite VMworld breakout sessions I attended in 2010 was "Understanding Virtualization Memory Management Concepts" (TA7750 still found online) presented by Kit Colbert. Kit is now the CTO of End User Computing at VMware, but the sessions can still be found online. I recall sitting in that session, and within the first 5 minutes deciding that: 1.) I knew nothing about memory, especially with a Hypervisor, and 2.) The deep dive was so good, and the content so verbose, that any attempt at taking notes was pointless. I made it a point to attend this session each year that he presented it, as it represents the very best of what VMworld has to offer. Do yourself a favor and watch one of his sessions.

Conclusion
Memory can and will be measured differently by Hypervisors and Guest OSs. The definitions of terms related to memory may be different by the application, the guest OS, and the hypervisor. Understanding your workloads, and the characteristics of the platforms it uses will help you better size your VMs for the balance between optimal performance with a minimal footprint. Monitoring memory in a useful way can also be a time consuming, difficult task that extends well beyond just a simple metric.

Have fun

- Pete

Helpful links
Understanding vSphere Active Memory
http://blogs.vmware.com/vsphere/2013/10/understanding-vsphere-active-memory.html

Kit Colbert’s 2011 VMworld breakout session – Understanding Virtualized Memory Performance Management
https://www.youtube.com/watch?v=YKaUtoQrLjo  

Monitor Memory Usage in SQL Server
http://msdn.microsoft.com/en-us/library/ms176018.aspx

SQL Server on VMware Best Practices guide
http://www.vmware.com/files/pdf/solutions/SQL_Server_on_VMware-Best_Practices_Guide.pdf

VMware KB 1687: Excessive Page Faults Generated by Windows applications
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1687

A vSphere & memory related post would not be complete without mention of the venerable "vSphere Clustering Deepdive"
http://www.amazon.com/VMware-vSphere-5-1-Clustering-Deepdive-ebook/dp/B0092PX72C/ref=sr_1_1?ie=UTF8&qid=1404234460&sr=8-1&keywords=vsphere+5+clustering+deep+dive

Getting the big IT purchase approved

IT organizations are faced with a tantalizing array of options when it comes to hardware and software solutions. But long before anything can ever be deployed, it has to be purchased, which means at some point it had to be approved. Sometimes deploying a solution is easy compared to getting it approved. But how does one go about getting the big ticket item through? Well, here is my attempt at demystifying the process.

First, lets just say that "big purchase" is without a doubt a relative term. For an SMB, $10,000 might be a show stopper, while seven figures for a large enterprise may be part of the routine. Both offer unique challenges, but share similar tactics. Getting a big IT purchase approved typically consists of a unique set of skills and experience. A mix of preparation, clarity, delivery, timing, and attitude make up the chaotic formula that when done well, will improve the odds of success. It is a skill that can be equally important to anything you bring in your technical arsenal.

Preparation
You will serve yourself well if you think and deliver like a consultant. Life in Ops can get muddied down by internal strife, whack-a-mole fire fighting, and the occasional "look at this new feature" deployment even though nobody asked for it. Take notice of how a good consultant does things. Step back to understand the desired result, then build out your own statement defining the typical design inputs like requirements, constraints, assumptions and risks.

At some point, you will need to prioritize your own wants, and pick your battles. You typically can’t have everything, so start from the ground up of what IT’s mission statement is, and work from there. Start with bet-the-business elements like high availability, and data/system protection that won’t be spoken up for by anyone but IT. Then, if there are other needs, they may in fact be a departmental need that impacts productivity and revenue. While IT may be the enabler of the request, make sure the identity of the requester is clear.

It’s not uncommon for an SMB to have very little money allocated to IT, but this isn’t an excuse for lack of diligence in preparation. Large organizations have more money, but proportionally much more complex problems to solve, SLAs to adhere to, and regulations to comply with. If you have no idea how your organization’s IT spending compares to peers in your industry, it is time to learn, and communicate that as a part of your presentation if your funds are abnormally low.

This is also an opportunity for you to project yourself as the "solution provider" in your organization. Embrace this. Help them understand why technology costs have increased over the past 10 years. If someone says, "Why don’t we just use the cloud for this?" Rather than let smoke pour out of your ears, respond with "That is a great question Joe. IT is constantly looking for the best ways to deliver services that meets the requirements of the organization." And then go into an appropriate level of detail on why it may or may not be a good fit. (If it is a good fit, then say so!). The point here is to embrace the solution provider role for the organization.

Your biggest competitor to your proposal will be, you guessed it, doing nothing. But there is a cost of doing nothing. The key stakeholders might look at this proposed expenditure and compare it to $0. In most cases, this is completely wrong, and it is up to you to help them understand what the real cost comparison is.

One opportunity sometimes overlooked is the power of a cost deferral. Does the unbudgeted solution you are proposing delay a much larger budgeted purchase until perhaps next year? Showcase this. Good proposals typically show a TCO of 3 to 5 years. But do not underestimate the allure an immediate cost deferral has to your friendly CFO.

Get input on defining the "what" of a problem, and it’s impacts. The "how" is usually reserved for the Subject Matter Expert (e.g. you). This will minimize silly ideas from others suggesting your storage capacity issues can be solved by the Friday flier for Best Buy.

Learn to prime the pump. Do a little one-on-one campaigning. This is a common method suggested in many books on successful leadership. It is your chance to win over your constituents before any formal proposal. Trying holding an internal "Lunch and Learn" about trends in technology. Share a little about how amazing virtualization is, and help them understand some basic challenges of IT. These techniques will engage key personnel, and help in establishing a trusting relationship with IT.

The presentation – IT Shark Tank
I’m a big fan of the show, ‘Shark Tank.’ If you aren’t familiar with it, four very successful investors hear pitches by would-be entrepreneurs who are looking for investment funds in exchange for a stake in equity. The investors bring their own wealth, smarts and competitive nature to the table, and can be quite tough on prospective entrepreneurs. A few things can be gleaned from this, and applied directly to your ability to deliver a successful proposal.

  • Come prepared. Nothing kills a proposal like lack of preparation, and not knowing your facts. Lets say you are requesting more storage: You’d better believe some of the simplest questions will be asked. Many that you may overlook when entering a room. "How much storage do we have?" "How much do we have left?" "How much do we need?" "Why does it cost so much?" "what are the alternatives?"
  • Clearly state the problem, the impacts to the business, the options, and your recommendations.
  • Learn to answer the simplest of questions in the simplest of ways. "Does this proposal save us money?" "Is there a less expensive way to do this?"
  • Craft your message to your audience and appeal to their sensibilities. Flog yourself upside the head if you use any IT acronyms, or assume that technical gymnastics is going to impress them. It won’t. What will is being concise. Every word has a purpose.
  • Provide a little (but not too much) context to the problem that you are trying to solve. Leverage an analogy if you need to.
  • Know the counterpoints, and how to respond. Know how you are going to answer a question you don’t know the answer to.
  • Seek to understand their position. What might they dislike (e.g. unpredictable expenses, obligated debt, investments they don’t understand, etc.)
  • Respect everyone’s time. Make it quick, make it concise, and if they would like more detail, you can certainly do that, but don’t make it a part of the pitch.

How to deal with everyone else in the food chain
Be honest with your vendors. They have a job to do, and are trying to help you. If you show interest in a solution that is 10x more than what you can afford, it isn’t going to do anyone good to bring them in for an onsite demonstration. They will appreciate your honesty so they can perhaps focus on more cost appropriate solutions. Believe it or not, most want the right solution for you in the first place, as repeat business is the most important value they can bring back to their own organization.

If you are someone who doesn’t have deep-dive knowledge on the solution you are proposing, take advantage of the SE for the VAR or channel partner as a resource. Many of my friends in the industry are SEs and are some of the best and the brightest folks I know, and they all came from the Ops side at some point. Use them as a resource to learn about the solutions they are proposing, and ask them challenging questions.

Be honest with your organization. This isn’t about what you want. Your value will increase when you can demonstrate repeatedly that you have their best interests in mind.

After the decision
If the proposal was approved, focus on delivering at least some results fast. Then showcase the win and how IT can help solve organizational challenges. This may sound like self promotion, but it is not if done right. The wins are for the organization, not you. This establishes trust, and lays the groundwork for the future. Use company newsletters, or establish a monthly IT Review to share updates.

If it was denied, don’t take it personal. It is great to show passion, but don’t confuse passion for what you are really trying to do; helping your organization make the best strategic and financial decision for them. Would it be gratifying to get a new Datacenter revamp through only to realize it was the financial tipping point of the organization just a few months later? Keep it all in perspective. Besides, some of the best purchasing decisions I’ve been involved with were the ones that were ultimately rejected, which gave solutions a chance to mature, and me an opportunity to find a different way to solve a problem.

Try doing your own proposal or presentation retrospective. What went well and what didn’t. Ask for feedback on how it went. You might be surprised at the responses you get.

Conclusion
You have the unique opportunity to be the technology advocate for the organization rather than simply a burden to the budget.  Do I get everything approved?  Of course I don’t, but a well prepared proposal will allow you, and your organization to make the smartest decisions possible, and help IT deliver great results.

Testing InfiniBand in the home lab with PernixData FVP

One of the reasons I find the latest trends in datacenter architectures so interesting is the innovative approaches used to address deficiencies associated with more traditional arrangements. These innovations have been able to drive more of what almost everyone needs; better storage performance and better scalability.

The caveat to some of these newer arrangements is that it can put heavy stress on the plumbing that connects these servers. Distributed storage technologies like VMware VSAN, or clustered write buffering techniques used by PernixData FVP and Atlantis Computing’s USX leverage these interconnects to accelerate storage traffic. Turn-key Hyperconverged solutions do too, but they enjoy the luxury of having full control over the hardware used. Some of these software based solutions might need some retrofitting of an environment to run optimally or meet their requirements (read: 10GbE or better). The desire for the fastest interconnect possible between hosts doesn’t always align with budget or technical constraints, so it makes most sense to first see what impact there really is.

I wanted to test the impact better bandwidth would have between servers a bit more, but do to constraints in my production environment, I needed to rely on my home lab. As much as I wanted to throw 10GbE NICs in my home lab, the price points were too high. I had to do it another way. Enter InfiniBand. I’m certainly not the only one to try InfiniBand in a home lab, but I wanted to focus on two elements that are critical to the effectiveness of replica traffic. The overall bandwidth of the pipe, and equally important, the latency. While I couldn’t simulate an exact workload that I see in my production environment, I could certainly take smaller snippets of I/O patterns that I see, and model them the best I can.

InfiniBand is really interesting. As Joeb Jackson put it in a NetworkWorld.com article, "InfiniBand is architecturally sacrilegious" as it combines many layers of the OSI model. The results can be stunning. Transport latencies in the 2 microsecond neighborhood, and a healthy roadmap to 200Gbps and beyond. It’s sort of like the ’66 AC Shelby Cobra of data transports. Simple, and perhaps a little rough around the edges, but brutally fast. While it doesn’t have the ubiquity of RJ/Ethernet, it also doesn’t have the latencies that are still a part of those faster forms of Ethernet.

At the time of this writing, the InfiniBand drivers for ESXi 5.5 weren’t quite ready for VSAN testing yet, so the focus of this testing is to see how InfiniBand behaves when used in a PernixData FVP deployment. I hope to publish a VSAN edition in the future. I simply wanted to better understand if (and how much) a faster connection would improve the transmission of replica traffic when using FVP in WB+1 mode (local flash, and 1 peer). My production environment is very write intensive, and uses 1GbE for the interconnects. Any insight gained here will help in my design and purchasing roadmap for my production environment.

Testing:
Testing occurred on a two host cluster backed by a Synology DS1512+. Local flash leveraged SATA III based EMLC SSD drives using an onboard controller. 1GbE interconnects traversed a Cisco SG300-20 using a 1500 byte MTU size. For InfiniBand, each host used a Mellanox MT25418 DDR 2 port HCA that offered 10Gb per connection. They were directly connected to each other, and used a 2044 byte MTU size. InfiniBand can be set to 4092 bytes but for compatibility reasons under ESXi 5.5, 2044 is the desired size.

I tend to prefer testing that relies on observational patterns versus one final, empirical number. These tests were no different, and while they attempt to simulate a very brief snippet of a workload in my production environment, I find that I still gain a much better understanding from a time based performance graph than an insulated final number.

The test case was a simple one, but would be enough to illustrate the differences I was hoping to see. The test comprised of a 2vCPU VM using 2 workers on a 100% write, 100% random workload lasting for 1 minute. The test was run three times. First with WB+0 (no peer/replica traffic), then WB+1 (one peer) using a 1GbE connection, and finally WB+1 over a single 10Gb InfiniBand connection. Each screen capture I provide will show them in that order. That test case was repeated 3 times. First with 256KB I/O sizes, followed by 32KB, then onto 4KB. I ran the tests several times in different order to ensure I wasn’t introducing inflated or deflated performance due to previous tests or caching. All were repeated several times to flush out any anomalies.

(Click on each image for a larger view)

256KB I/O size test
Testing results using this I/O size is rarely published anywhere because it never bodes well in comparison to a smaller I/O size like 4KB. But my production workloads (compiling) often deal with these I/O sizes, so it is important for me to understand their behavior.

IOPS with 256KB I/O
256KB-IOPS

Latency with 256KB I/O
256KB-Latency

Throughput using 256KB I/O
256KB-Throughput

Observations from 256KB I/O test
Note that the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance, driving just half of the IOPS and throughput compared to InfiniBand. But also take a look at the terrible native latency (70ms) of large I/O sizes even when using WB+0 (no peer traffic. Just local flash). Also note that when peer traffic performance is improved, the larger backlog of data in the destager occurs.

32KB I/O size test
Just 1/8th the size of a 256KB I/O, this is still larger than most storage vendors like to advertise in their testing. My production workload often oscillates between 32KB and 256KB I/Os.

IOPS with 32KB I/O
32KB-IOPS

Latency with 32KB I/O
32KB-Latency

Throughput using 32KB I/O
32KB-Throughput

Observations from 32KB I/O test
Once again, the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance on throughput. Latency had only a minor improvement moving away from 1GbE, as the latency of the flash was about 6ms.

4KB I/O size test
The most common of I/O sizes that you might see, although it is more common on reads than writes. 1/64th the size of a 256KB I/O, it is tiny compared to the others, but important to test because of the attempt to learn if and how much a fatter, lower latency pipe helps in various I/O sizes.

IOPS with 4KB I/O
4KB-IOPS

Latency with 4KB I/O
4KB-Latency

Throughput using 4KB I/O
4KB-Throughput

Observations from 4KB I/O test
IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. But as the I/O sizes shrink, so does the effective total/concurrent payload size. So the differences between InfiniBand and 1GbE were less than on tests with larger I/O. Latencies of this I/O size were around 2ms.

Other observations that stood out
One of the first things that stood is illustrated below, with two 5 minutes test runs. Look at where the two arrows point. The arrow on the left points to the number of packets sent while using 1GbE. The arrow on the right shows the number of packets sent while using 10Gb InfiniBand. Quite a difference. Also notice that the effective throughput started out higher, but had to throttle back

Packetstransmitted

Findings:
The key takeaways from these tests:

    • A high bandwidth, low latency interconnect like InfiniBand can virtually eliminate any write redundancy penalty incurred in WB+1 mode.
    • From a single workload, I/O sizes of 32KB and 256KB saw between 65% and 90% improvement on IOPS and throughput. I/O sizes of 4KB saw essentially no improvement (many concurrent 4KB workloads likely would see a benefit however)..
  • Writes using larger I/O sizes were the clear beneficiary of a fatter pipe between servers. However, the native latencies of the flash devices under larger I/O sizes could not take advantage of the low latencies of InfiniBand. In other words, with large I/O sizes, the flash device themselves, or the bus they were using were by far the major impediment lower latency and faster I/O delivery
  • The smaller pipe of 1GbE throttled back the flash device’s ability to ingest the data as fast as InfiniBand. There was always a smaller amount of outstanding writes once the test was complete, but it came at the cost of poorer performance for 1GbE.
    A few other matters can come up when attempting to accurately interpret latencies. As VMware KB 2036863 points out, reporting of latencies accurately can sometimes be a challenge. Just something to be aware of.

Conclusion
InfiniBand was my affordable way to test how a faster interconnect would improve the abilities of FVP to accelerate replica storage I/O.  It lived up to the promise of high bandwidth with low latency. However, effective latencies were ultimately crippled by the SSDs, the controller, or the bus it was using. I did not have the opportunity to test other flash technologies such as PCIe based solutions from Fusion-IO or Virident, or the memory channel based solution from Diablo Technologies. But based on the above, it seems to be clear that how the flash is able to ingest the data is crucial to the overall performance of whatever solution that is using it.

Helpful Links
Erik Bussink’s great post on using InfiniBand with vSphere 5.5
http://www.bussink.ch/?p=1306 

Vladen Seget’s post on incorporating InfiniBand into his backing storage
http://www.vladan.fr/homelab-storage-network-speedup/

Mellanox, OFED and OpenSM bundles
https://my.vmware.com/web/vmware/details/dt_esxi50_mellanox_connectx/dHRAYnRqdEBiZHAlZA==
http://www.mellanox.com/downloads/Drivers/MLNX-OFED-ESX-1.8.1.0.zip
http://files.hypervisor.fr/zip/ib-opensm-3.3.16-64.x86_64.vib

Practical tips for a Veeam Backup and Recovery deployment

I’ve been using Veeam Backup and Recovery in my production environment for a while now, and in hindsight, it was one of the best investments we’ve ever made in our IT infrastructure. It has completely changed the operational overhead of protecting our VMs, and the data they serve up. Using a data protection solution that utilizes VMware’s APIs provides the simplicity and flexibility that was always desired. Moving away from array based features for protection has enabled the protection of VMs to better reflect desired RPO and RTO requirements – not by the limitations imposed by LUN sizes, array capacity, or functionality.

While Veeam is extremely simple in many respects, it is also a versatile, feature packed application that can be configured a variety of different ways. The versatility and the features can be a little confusing to the new user, so I wanted to share 25 tips that will help make for a quick and successful deployment of Veeam Backup and Recovery in your environment.

First lets go over a few assumptions that will be the basis for my recommendations:

  • There are two sites that need protection.
  • VMs and data need to be protected at each site, locally.
  • VMs and data need to be protected at each site, remotely.
  • A NAS target exists at each site.
  • Quick deployment is important.
  • You’ve already read all of the documentation. Winking smile

    Architecture
    There are a number of different ways to set up the architecture for Veeam. I will show a few of the simplest arrangements:

    In this arrangement below there would be no physical servers – only a NAS device. This is a simplified arrangement of what I use. If one wanted a rebuilt server (Windows or Linux) acting purely as a storage target, that could be in place of where you see the NAS. The architecture would stay the same.

    image

    Optionally, a physical server not just acting as a storage target, but also as a physical proxy would look something like this below:

    image

    Below is a combination of both, where a physical server is acting as the Proxy, but like the virtual proxy, is using an SMB share to house the data. In this case, a NAS unit.

     

    image

    Implementation tips
    These tips focus not so much on ultimately what may suite your environment best (only you know that) or leveraging all of the features inside the product, but rather, getting you up and running as quickly as possible so you can start returning great results.

    Job Manager Servers & Proxies

    1.  Have the job Manager server, any proxies, and the backup targets living on their own VLAN for a dedicated backup network.

    2.  Set up SNMP monitoring on any physical ports used in the backup arrangement.  It will be helpful to understand how utilized the physical links get, and for how long.

    3.  Make sure to give the Job Manager VM enough resources to play with – especially if it will have any data mover/proxy responsibilities.  The deployment documentation has good information on this, but for starters, make it 4vCPU with 5GB of RAM.

    4.  If there is more than one cluster to protect, consider building a virtual proxy inside each cluster that it will be responsible for protecting, then assign it to jobs that protect VMs in that cluster.  In my case, I use PernixData FVP in two clusters.  I have the data stores that house those VMs only accessible by their own cluster (a constraint of FVP).  Because of that, I have a virtual proxy living in each cluster, with backup jobs configured so that it will use a specific virtual proxy.  These virtual proxies have a special setting in FVP that will instruct the VMs being backed up to flush their write cache to the backing storage

    image

    Storage and Design

    5.  Keep the design simple, even if you know you will need to adjust at a later time.  Architectural adjustments are easy to do with Veeam, so  go ahead and get Veeam pointed to the target, and start running some jobs.  Use this time to get familiar with the product, and begin protecting the jewels as quickly as possible.

    6.  Let Veeam use the default SQL Server Express instance on the Veeam Job Manager VM.  This is a very reasonable, and simple configuration that should be adequate for a lot of environments.

    7.  Question whether a physical proxy is needed.  Typically physical proxies are used for one of three reasons.  1.)  They offload job processing CPU cycles from your cluster.  2.)  In simple arrangements a Windows based Physical proxy might also be the Repository (aka storage target).   3.) They allow for one to leverage a "direct-from-SAN" feature by plugging in the system to your SAN fabric.  The last one in my opinion introduces the most hesitation.  Here is why:

    • Some storage arrays do not have a "read-only" iSCSI connection type.  When this is the case, special care needs to be taken on the physical server directly attached to the SAN to ensure that it cannot initialize the data store.  The reality is that you are one mistake away from having a very long day in front of you.  I do not like this option when there is no secondary safety mechanism from the array on a "read-only" connection type.
    • Direct-from-SAN access can be a very good method for moving data to your target.  So good that it may stress your backing storage enough (via link saturation or physical disk limits) to perhaps interfere with your production I/O requirements.
    • Additional efforts must be taken when using write buffering mechanisms that do not live on the storage array (e.g. PernixData) .

    8.  Veeam has the ability to back up to an SMB share, or an NFS mount.  If an NFS mount is chosen, make sure that it is a storage target running native Linux.  Most NAS units like a Synology are indeed just a tweaked version of Linux, and it would be easy to conclude that one should just use NFS.  However, in this case, you may run into two problems.

    • The SMB connection to a NAS unit will likely be faster (which most certainly is the first time in history that an SMB connection is faster than an NFS connection) .
    • The Job Manager might not be able to manage the jobs on that NAS unit (connected via NFS) properly.  This is due to BusyBox and Perl on the Synology not really liking each other.  For me, this resulted in Veeam being unable to remove sun setting backups.  Changing over to an SMB connection on the NAS improved the performance significantly, and allowed for job handling to work as desired.

    9.  Veeam has a great new feature (version 7.x)  called a "Backup Copy" job, which allows for the backup made locally to be shipped to a remote site.  The "Backup Copy" job achieves one of the most basic requirements of data protection in the simplest of ways.  Two copies of the data at two different locations, but with the benefit of only processing the backup job once.  It is a new feature of Version 7, and although it is a great feature, it behaves differently, and warrants some time spent before putting into production.  For a speedy deployment, it might be best simply to configure two jobs.  One to a local target, and one to a remote target.  This will give you the time to experiment with the Backup Copy job feature.

    10.  There are compelling reasons for and against using a rebuilt server as a storage target, or using a NAS unit.  Both are attractive options.  I ended using a dedicated NAS unit.  It’s form factor, drive bay count, and the overall cost of provisioning was the only option that could match my requirements.

    Operations

    11.  In Veeam B&R, "Replication Jobs" are different than "Backup Jobs."  Instead of trying to figure out all of the nuances of both right away, use just the "Backup job" function with both local and remote targets.  This will give you time to better understand the characteristics of the replication functionality. One also might find that the "Backup Job" suites the environment and need better than the replication option.

    12.  If there are daily backups going to both local and offsite targets (and you are not using the "Backup Copy" option, have them run 12 hours apart from one another to reduce RPOs.

    13.  Build up a test VM to do your testing of a backup and restore.  Restore it in the many ways that Veeam has to offer.  Best to understand this now rather than when you really need to.

    14.  I like the job chaining/dependency feature, which allows you to chain multiple jobs together.  But remember that if a job is manually started, it will run through the rest of the jobs too.  The easiest way to accommodate this is to temporarily remove it from the job chain.

    15.  Your "Backup Repository" is just that, a repository for data.  It can be a Windows Server, a Linux Server, or an SMB share.  If you don’t have a NAS unit, stuff an old server (Windows or Linux) with some drives in it and it will work quite well for you.

    16.  Devise a simple, clear job naming scheme.  Something like [BackupType]-[Descriptive Name]-[TargetLocation] will quickly tell you what it is and where it is going to.  If you use folders in vCenter to organize your VMs, and your backups reflect the same, you could also  choose to use the folder name.  An example would be "Backup-SharePointFarm-LOCAL" which quickly and accurately describes the job.

    17.  Start with a simple schedule.  Say, once per day, then watch the daily backup jobs and the synthetic fulls to see what sort of RPO/RTOs are realistic.

    18.  Repository naming.  Be descriptive, but come up with some naming scheme that remains clear even if you aren’t in the application for several weeks.  I like indicating the location of the repository, if it is intended for local jobs, or remote jobs, and what kind of repository it is (Windows, Linux, or SMB).  For example:  VeeamRepo-[LOCATION]-for-Local(SMB)

    19.  Repository organization.  Create a good tree structure for organization and scalability.  Veeam will do a very good job at handling the organization of the backups once you assign a specific location (share name) on a repository.  However, create a structure that provides the ability to continue with the same naming convention as your needs evolve.  For instance, a logical share name assigned to a repository might be \\nas01\backups\veeam\local\cluster1  This arrangement allows for different types of backups to live in different branches.

    20.  Veeam might prevent the ability of creating more than one repository going to the same share name (it would see \\nas01\backups\veeam\local\cluster1 and \\nas01\backups\veeam\local\cluster2 as the same).  Create DNS aliases to fool it, then make those two targets something like \\nascluster1\backups\veeam\local\cluster1  and  \\nascluster2\backups\veeam\local\cluster2 

    21.  When in doubt, leave the defaults.  Veeam put in great efforts to make sure that you, or the software doesn’t trip over itself.  Uncertain of job number concurrency?  Stick to the default.  Wondering about which backup mode to use? (Reverse Incremental versus Incrementals with synthetic fulls). Stay with the defaults, and save the experimentation for later.

    22.  Don’t overcomplicate the schedule (at least initially).  Veeam might give you flexibility that you never had with array based protection tools, but at the same time, there is no need to make it complicated.  Perhaps group the VMs by something that you can keep track of, such as the folders they are contained in within vCenter.

    23.  Each backup job can be adjusted so that whatever target you are using, you can optimize it for preset storage optimization type.  WAN target, LAN target, or local target.  This can easily be overlooked, but will make a difference in backup performance.

    24.  How many backups you can keep is a function of change range, frequency, dedupe and compression, and the size of your target.  Yep, that is a lot of variables.  If nothing else, find some storage that can serve as the target for say, 2 weeks.  That should give a pretty good sampling of all of the above.

    25.  Take one item/feature once a week, and spend an hour or two looking into it.  This will allow you to find out more about say, Changed block tracking, or what the application aware image processing feature does.  Your reputation (and perhaps, your job) may rely on your ability to recover systems and data.  Come up with a handful of scenarios and see if they work.

    Veeam is an extremely powerful tool that will simplify your layers of protection in your environment. Features like SureBackup, Virtual Labs, and their Replication offerings are all very good. But more than likely, they do not need to be a part of your initial deployment plan. Stay focused, and get that new backup software up and running as quickly as possible. You, and your organization, will be better off for it.

    - Pete

    Using the Cisco SG300-20 Layer 3 switch in a home lab

    One of the goals when building up my home lab a few years ago was to emulate a simple production environment that would give me a good platform to learn and experiment with. I’m a big fan of nested labs, and use one on my laptop often. But there are times when you need real hardware to interact with. This has come up even more than I expected, as recent trends with leveraging flash on the host have resulted in me stuffing more equipment back in the hosts for testing and product evaluations.

    Networking is the other area that can be helpful to have equipment that at least tries to mimic what you’d see in a production environment. Yet the options for networking in a home lab have typically been limited for a variety of reasons.

    • The real equipment is far too expensive, or too loud for most home lab needs.
    • Searching on eBay or Craigslist for a retired production unit can be risky. Some might opt for this strategy, but this can result in a power sucking, 1U noise maker that may have some dead ports on it, or worse, bricked upon arrival.
    • Consumer switches can be disappointing. Rig up a consumer switch that is lacking in features, and port count, and be left wishing you hadn’t gone this route.

    I wanted a fanless, full Layer 3 managed switch with a feature set similar to what you might find on an enterprise grade switch, but not at an enterprise grade price. I chose to go with a Cisco SG300-20. This is a 20 port, 1GbE, Layer 3 switch. With no fans, the unit draws as little as 10 watts.

    Read more of this post

    Effects of introducing write-back caching with PernixData FVP

    Implementing new technology that solves real problems is great. It is exciting, and you get to stand on the shoulders of the smart folks who dreamed up the solution. But with all of that glory comes new design and operation elements that may have been introduced. This isn’t a bad thing. It is just different. The magic of virtualization didn’t excuse the requirement of needing to understand the design and operational considerations of the new paradigm. The same goes for implementing host based caching in a virtualized environment.

    Implementing FVP is simple and the results can be impressive. For many, that is about all the effort they may end up putting into it. But there are design considerations that will help maximize the investment, and minimize false impressions, or costly mistakes. I want to share what has been learned against my real world workloads, so that you can understand what to look for, and possibly how to get more out of your investment. While FVP accelerates both reads and writes, it is the latter that warrants the most consideration, so that will be the focus of this post.

    When accelerating storage using FVP, the factors that I’ve found to have the most influence on how much your storage I/O is accelerated are:

    • Interconnect speed between hosts of your pooled flash
    • Performance delta between your flash tier, and your storage tier.
    • Working set size of your data
    • Duty cycle write I/O profile of your VMs (including peak writes, and duration)
    • I/O size of your writes (which can vary within each workload)
    • Likelihood or frequency of DRS or manual vMotion activities
    • Native speed and consistency of your flash (the flash itself, and the bus speed)
    • Capacity of your flash (more of an influence on read caching, but can have some impact on writes too)

    Write-back caching & vMotion
    Most know by now that to guard against any potential data loss in the event of a host failure, FVP provides redundancy of write-back caching through the use of one or more peers. The interconnect used is the vMotion network. While FVP does a good job of decoupling the VM’s need to wait for the backing datastore, a VM configured for write-back with redundancy must acknowledge the write I/O of the VM from it’s local flash, AND the one or more peers before it returns the write ACK to the VM.

    What does this mean to your environment? More traffic on your vMotion network. Take a look at the image below. In a cluster NOT accelerated by FVP, the host uplinks that serve a vMotion network might see relatively little traffic, with bursts of traffic only during vMotion activities. That would also be the case if you were running FVP in write-back mode with no peers (WB+0). This image below is what the activity on the vMotion network looks like as perceived by one of the hosts after the VMs had write-back with redundancy of one peer. In this case the writes were averaging about 12MBps across the vMotion network. You will see that the spike is where a vMotion kicked off: The spike is the peak output of a 1GbE interface; about 125MBps.

    image

    Is this bad that the traffic is running over your vMotion network? No, not necessarily. It has to run over something. But with this knowledge, it is easy to see that bandwidth for inter-server communication will be more important than ever before. Your infrastructure design may need to be tweaked to accommodate the new role that the vMotion network plays.

    Can one get away with a 1GbE link for cross server communication? Perhaps. It really depends on the factors above, which can sometimes be hard to determine. So with all of the variables to consider, it is sometimes easiest to circle back to what we do know:

    • Redundant write back caching with FVP will be using network connectivity (via vMotion network) for every single write that occurs for an accelerated VM.
    • Redundant write back caching writes are multiplied by the number of peers that are configured per accelerated VM.
    • The write accelerated I/O commit time (latency) will be as fast as the slowest connection.  Your vMotion network will likely be slower than the local bus.  A poor quality SSD or an older generation bus could be a bottleneck too.
    • vMotion activities enjoy using every bit of bandwidth it has available to it.
    • VM’s that are committing a lot of writes might also be taxing CPU resources, which may kick in DRS rules to rebalance the load – thus creating more vMotion traffic.  Those busy VMs may be using more active memory pages as well, which may increase the amount of data to move during the vMotion process.

    The multiplier of redundancy
    Lets run through a simple scenario to better understand the potential impact an undersized vMotion network can have on the performance of write-back caching with redundancy. The example is addressing writes only.

    • 4 hosts each have a group of 6 VM’s that consistently write 5MBps per VM.  Traditionally, these 24 VMs would be sending a total of 120MBps to the backing physical storage.
    • When write back is enabled without any redundancy (WB+0), the backing storage will still see the same amount of writes committed, but it will be in a slightly different way.  Sequential, and smoothed out as data is flushed to the backing physical storage.
    • When write back is enabled and a write redundancy of “local flash and 1 network flash device” (WB+1) is chosen, the backing storage will still see 120MBps go to it eventually, but there will be an additional 120MBPs of data going to the host peers, traversing the vMotion network.
    • When write back is enabled and a write redundancy of “local flash and 2 network flash devices” (WB+2) is chosen, the backing storage will still see 120MBps to it, but there will be an additional 240MBps of data going to the host peers, traversing the vMotion network.

    image

    The write-back redundancy configuration is a per-VM setting, so there not necessarily a need to change them all to one setting. Your VMs will most likely not have the same write workload either. But this is to illustrate the point that as the example shows, it is not hard to saturate a 1GbE interface. Assuming an approximate 125MBps on a single 1GbE interface, under the described arrangement, saturation would occur with each VM configured for write-back with redundancy of one peer (WB+1). This leaves little headroom for other traffic that might be traversing that network, such as vMotions, or heartbeats.

    Fortunately FVP has the smarts built in to ensure that vMotion activities and write-back caching get along. However, there is no denying the physics associated with the matter. If you have a lot of writes, and you really want to leverage the full beauty of FVP, you are best served by fast interconnects between hosts. It is a small price to pay for supreme performance. FVP might expose the fact that 1GbE not be ideal in an accelerated environment, but consider what else has changed over the years. Standard memory sizes of deployed VMs have increased significantly (The vOpenData Public Dashboard confirms this). That 1GbE vMotion network might have been good for VM’s with 512MB of RAM, but what about those with 4, 8, or 12GB of RAM?  That 1GbE vMotion network has become outdated even for what it was originally designed for.

    Destaging
    One characteristic unique with any type of write-back caching is that eventually, the data needs to be destaged to the backing physical datastore. The server-side flash that is now decoupled from the backing storage has the potential to accommodate a lot of write I/Os with minimal latency. One may or may not have the backing spindles, or conduit large enough to be sending your write I/O to the backing physical storage if this high write I/O lasts long enough. Destaging issues can occur on an arrangement like FVP, or with storage arrays and DAS arrangements that front performance I/O with flash that get pushed to slower spindles.

    Knowing the impact of this depends on the workload and the environment it runs in.

    • If the duty cycle of the write workload that is above the physical storage I/O limit allows for enough “rest time” (defined as any moment that the max I/O to the backing physical storage is below 100%) to destage before the next over commitment begins, then you have effectively increased your ability to deliver more write I/Os with less latency.
    • If the duty cycle of the write workload that is above the physical storage I/O limit is sustained for too long, the destager of that given VM will fill to capacity, and will not be able to accelerate any faster than it’s ability to destage.

    Huh?  Okay, a picture might be a better way to describe this.  The callouts below point to the two scenarios described.

    image

     

    So when looking at this write I/O duty cycle, there becomes a concept of amplitude of the maximum write I/O, and frequency of those times in which is it overcommitting. When evaluating an environment, you might see this crude sine-wave show up. This write I/O duty cycle, coupled with your physical components is the key to how much FVP can accelerate your environment.

    What happens when the writes to the destager surpass the ability of your backing storage to keep up with the writes? Once the destager for that given VM fills up, it’s acceleration will reduce to the rate that it can evacuate the data to the backing storage.  One may never see this in production, but it is possible.  It really depends on the factors listed at the beginning of the post.  The only way to clearly see this is from a synthetic workload, where I show it was able to push 5 times the write I/Os (blue line) before eventually filling up the destager to the point where it was throttled back to the rate of the datastore (purple line)

    SNAGHTML329ee44

    This will have an impact on the effective latency, shown below (blue line).  While the destager is full, it will not be able to fulfill the write at the low latency typically associated with flash, reflecting latency closer to the backing datastore (purple line).

    image

    Many workloads would never see this behavior, but those that are very write intensive (like mine), and that have a big delta between their acceleration tier and their backing storage may run into this.

    The good news is that workloads have a tendency to be bursty, which is a perfect match for an acceleration tier. In a clustered arrangement, this is much harder to predict, and bursty can be changed to steady-state quite quickly. What this demonstrates is that if there is enough of a performance delta between your acceleration tier, and your storage tier, under cases of sustained writes, there may be times where it doesn’t have the opportunity to flush enough writes to maintain it’s ability to accelerate.

    Recommendations
    My recommendations (and let me clarify that these are my opinions only) on implementing FVP would include.

    • Initially, run the VMs in write-through mode so that you can leverage the FVP analytics to better understand your workload (duty cycles, read/write ratios, maximum write throughput for a VM, IOPS, latency, etc.)
    • As you gain a better understanding of the behavior of these workloads, introduce write-back caching to see how it helps the systems changed.
    • Keep and eye on your vMotion network (in particular, those with 1GbE environments and limited physical ports) and see if one ever comes close to saturation.  Other leading indicators will be increased latency on accelerated writes.
    • Run out and buy some 10GbE NICs for your vMotion network.  If you are in a situation with a total 1GbE legacy fabric for your SAN, and your vMotion network, and perhaps you have limits on form factors that may make upgrading difficult (think blades here), consider investing in 10GbE for your vMotion network, as opposed to your backing storage. Your read caching has probably already relieved quite a bit of I/O pressure on your storage, and addressing your cross server bandwidth is ultimately a more affordable, and simpler task.
    • If possible, allocate more than one link and configure for Multi-NIC vMotion. At this time, FVP will not be able to leverage this, but it will allow vMotion to use another link if the other link is busy. Another possible option would be to bond multiple 1GbE links for vMotion. This may or may not be suitable for your environment.

    So if you haven’t done so already, plan to incorporate 10GbE for cross-server communication for your vMotion Network. Not only will your vMotioning VM’s thank you, so will the performance of FVP.

    - Pete

    Helpful links:

    Fault Tolerant Write acceleration
    http://frankdenneman.nl/2013/11/05/fault-tolerant-write-acceleration/

    Destaging Writes from Acceleration Tier to Primary Storage
    http://voiceforvirtual.com/2013/08/14/destaging-writes-i/

    Using a new tool to discover old problems

    It is interesting what can be discovered when storage is accelerated. Virtual machines that were previously restricted by the underperforming arrays now get to breath freely.  They are given the ability to pass storage I/O as quickly as the processor needs. In other words, the applications that need the CPU cycles get to dictate your storage requirements, rather than your storage imposing artificial limits on your CPU.

    With that idea in mind, a few things revealed themselves during the process of implementing PernixData FVP.  Early on, it was all about implementing and understanding the solution.  However, once the real world workloads began accelerating, there was intrigue on the analytics that FVP was providing.  What was generating the I/O that was being accelerated?  What processes were associated with the other traffic not being accelerated, and why?  What applications were behind the changing I/O sizes?  And what was causing the peculiar I/O patterns that were showing up?  Some of these were questions raised at an earlier time (see: Hunting down unnecessary I/O before you buy that next storage solution ).  The trouble was, the tools I had to discover the pattern of data I/O were limited.

    Why is this so important? In the spirit of reminding ourselves that no resource is an island, here is an example of a production code compile run, as looking from the perspective of the guest CPU. The first screen capture is the code with adequate storage I/O to support the application’s needs. A full build and is running nearly perfect CPU utilization of all 8 of it’s vCPUs.  (screen shots taken from my earlier post; Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements-Part 2)

    image

    Below is that very same code compile, under stressed backend storage. It took 46% longer to complete, and as you can see, changes the CPU utilization of the build run.

    image

    The primary goal for this environment was to accelerate the storage. However, it would have been a bit presumptuous to conclude that all existing storage traffic is good, useful I/O. There is a significant amount of traffic originating from outside of IT, and the I/O generated needed to be understood better.  With the traffic passing more freely thanks to FVP acceleration, patterns that previously could not expose themselves should be more visible. This was the basis for the first discovery

    A little “CSI” work on the IOPS
    Many continuous build systems use some variation of a polling mechanism to understand when there is new checked in code that needs to be compiled. This should be a very light weight process.  However, once storage performance was allowed to breath better, the following patterns started showing up on all of the build VMs.

    The image below shows the IOPS for one build VM during a one hour period of no compiling for that particular VM.  The VM’s were polling for new builds every 5 minutes.  Yep, that “build heartbeat” was as high as 450 IOPS on each VM.

    high-IOPS-heartbeat

    Why wasn’t this noticed before?  These spikes were being suppressed by my previously overtaxed storage, which made them more difficult to see. These were all writes, and were translating into 500 to 600 steady state IOPS just to sit idle (as seen below from the perspective of the backing storage)

    Array-VMFSvolumeIOPS

    So what was the cause? As it turned out, the polling mechanism was using some source code control (SVN) calls to help the build machines understand if it needed to execute a build. Programmatically, the Development Team has no idea that the script that they develop is going to be efficient, or not efficient. They are separated by that layer of the infrastructure. (Sadly, I have a feeling this happens more often than not in general Application Development). This resulted in a horribly inefficient method. After helping them understand the matter, it was revamped, and now polling for each VM only takes 1 to 2 IOPS every 5 minutes.

    Idle-IOPS2

    The image below shows how the accelerated cluster of 30 build VMs looks when there are no builds running.

    Idle-IOPS

    The inefficient polling mechanism wasn’t the only thing found. A few of the Linux build VMs had a rouge “Beagle” search daemon running on them. This crawler did just that, indexing data on these Linux machines, and creating unnecessary I/O.  With Windows, Indexers and other CPU and I/O hogs are typically controlled quite easily by GPO, but the equivalent services can creep into Linux systems if not careful.  It was an easy fix at least.

    The cumulative benefit
    Prior to the efforts of accelerating the storage, and looking to make it more efficient, the utilization of the arrays looked as the image shows.  (6 hour period, from the perspective of the arrays)

    Array-IOPS-before

    Now, with the combination of understanding my workload better, and acceleration through FVP, that same workload looks like this (6 hour period, from the perspective of the arrays):

    Array-IOPS-after

    Notice that the estimated workload is far under the 100% it was regularly pegged at for 24 hours a day, 6 days a week.  In fact, during the workday, the arrays might only peak at 50% to 60% utilization.  When no builds are running, the continuous build system may only be drawing 25 IOPS from the VMFS volumes that contain the build machines, which is much more reasonable than where it was at.

    With the combination of less pressure on the backing physical storage, and the magic of pooled flash on the hosts, the applications and CPU get to dictate how much storage I/O is needed.  Below is a screen capture of IOPS on a production build VM while compiling was being performed.  It was not known up until this point that a single build VM needed as much as 4,000 IOPS to compile code because the physical storage was never capable of satisfying that type of need.

    IOPS-single-VM

    Conclusion
    Could some of these discoveries have been made without FVP?  Yes, perhaps some of it. But good analysis comes from being able to interpret data in a consumable way. Its why various methods of data visualization such as bar graphs, pie charts, and X-Y-Z plots exist. FVP certainly has been doing a good job of accelerating workloads, but it is also helps the administrator understand the I/O better.  I look forward to seeing how the analytics might expand in future tools or releases from PernixData.

    A friend once said to me that the only thing better than a new tractor is a reason to use it. In many ways, the same thing goes for technology. Virtualization might not even be that fascinating unless you had real workloads to run on top of it. Ditto for for PernixData FVP. When applied to real workloads, the magic begins to happen, and you learn a lot about your data in the process.

    Observations of PernixData FVP in a production environment

    Since my last post, "Accelerating storage using PernixData’s FVP. A perspective from customer #0001" I’ve had a number of people ask me questions on what type of improvements I’ve seen with FVP.  Well, let’s take a look at how it is performing.

    The cluster I’ve applied FVP to is dedicated for the purpose of compiling code. Over two dozen 8 vCPU Linux and Windows VM’s churning out code 24 hours a day. It is probably one of the more challenging environments to improve, as accelerating code compiling is inherently a very difficult task.  Massive amounts of CPU using a highly efficient, multithreaded compiler, a ton of writes, and throw in some bursts of reads for good measure.  All of this occurs in various order depending on the job. Sounds like fun, huh.

    Our full builds benefited the most by our investment in additional CPU power earlier in the year. This is because full compiles are almost always CPU bound. But incremental builds are much more challenging to improve because of the dialog that occurs between CPU and disk. The compiler is doing all sorts of checking, throughout the compile. Some of the phases of an incremental build are not multithreaded, so while a full build offers nearly perfect multithreading on these 8 vCPU build VMs, this just isn’t the case on an incremental build.

    Enter FVP
    The screen shots below will step you through how FVP is improving these very difficult to accelerate incremental builds. They will be broken down into the categories that FVP divides them into; IOPS, Latency, and Throughput.  Also included will be a CPU utilization metric, because they all have an indelible tie to each other. Some of the screen shots are from the same compile run, while others are not. The point here it to show how it is accelerating, and more importantly how to interpret the data. The VM being used here is our standard 8 vCPU Windows VM with 8GB of RAM.  It has write-back caching enabled, with a write redundancy setting of "Local flash and 1 network flash device."

    Click on each image to see a larger version

    IOPS
    Below is an incremental compile on a build VM during the middle of the day. The magenta line is showing what is being satisfied by the backing data store, and the blue line shows the Total Effective IOPS after flash is leveraged. The key to remember on this view is that it does not distinguish between reads and writes. If you are doing a lot of "cold reads" the magenta "data store" line and blue "Total effective" line may very well overlap.

    PDIOPS-01

    This is the same metric, but toggled to the read/write view. In this case, you can see below that a significant amount of acceleration came from reads (orange). For as much writing as a build run takes, I never knew a single build VM could use 1,600 IOPS or more of reads, because my backing storage could never satisfy the request.

    PDIOPS-02

    CPU
    Allowing the CPU to pass the I/O as quickly as it needs to does one thing, it allows the multithreaded compiler to maximize CPU usage. During a full compile, it is quite easy to max out an 8 vCPU system and have a sustained 100% CPU usage, but again, these incremental compiles were much more challenging. What you see below is the CPU utilization associated with the VM running the build. It is a significant improvement of an incremental build by using acceleration. A non accelerated build would rarely get above 60% CPU utilization.

    CPU-01

    Latency
    At a distance, this screen grab probably looks like a total mess, but it has really great data behind it. Why? The need for high IOPS is dictated by the VMs demanding it. If it doesn’t demand it, you won’t see it. But where acceleration comes in more often is reduced latency, on both reads and writes. The most important line here is the blue line, which represents the total effective latency.

    PDLatency-01

    Just as with other metrics, the latency reading can often times be a bit misleading with the default "Flash/Datastore" view. This view does not distinguish between reads and writes, so a cold read pulling off of spinning disk will have traditional amounts of latency you are familiar with. This can skew your interpretation of the numbers in the default view. For all measurements (IOPS, Throughput, Latency) I often find myself toggling between this view, and the read/write view. Here you can see how a cold read sticks out like a sore thumb. The read/write view is where you would go to understand individual read and write latencies.

    PDLatency-02

    Throughput
    While a throughput chart can often look very similar to the IOPS chart, you might want to spend a moment and dig a little deeper. You might find some interesting things about your workload. Here, you can see the total effective throughput significantly improved by caching.

    PDThroughput-01

    Just as with the other metrics, toggling it into read/write view will help you better understand your reads and writes.

    PDThroughput-02

    The IOPS, Throughput & Latency relationship
    It is easy to overlook the relationship that IOPS, throughput, and latency have to each other. Let me provide an real world example of how one can influence the other. The following represents the early, and middle phases of a code compile run. This is the FVP "read/write" view of this one VM. Green indicates writes. Orange indicates reads. Blue indicates "Total Effective" (often hidden by the other lines).

    First, IOPS (green). High write IOPS at the beginning, yet relatively low write IOPS later on.

    IOPS

    Now, look at write throughput (green) below for that same time period of the build.  A modest amount of throughput at the beginning where the higher IOPS were at, then followed by much higher throughput later on when IOPS had been low. This is the indicator of changing I/O sizes from the applications generating the data.

    throughput

    Now look at write latency (green) below. Extremely low latency (sub 1ms) with smaller I/O sizes. Higher latency on the much larger I/O sizes later on. By the way, the high read latencies generally come from cold reads that were served from the backing spindles.

    latency

    The findings here show that early on in the workflow where SVN is doing a lot of it’s prep work, a 32KB I/O size for writes is typically used.  The write IOPS are high, Throughput is modest, and latency comes in at sub 1ms. Later on in the run, the compiler itself uses much larger I/O sizes (128KB to 256KB). IOPS are lower, but throughput is very high. Latency suffers (approaching 8ms) with the significantly larger I/O sizes. There are other factors influencing this, to which I will address in an upcoming post.

    This is one of the methods to determine your typical I/O size to provide a more accurate test configuration for Iometer, if you choose to do additional benchmarking. (See: Iometer.  As good as you want to make it.)

    Other observations

    1.  After you have deployed an FVP cluster into production, your SAN array monitoring tool will most likely show you an increase in your write percentage compared to your historical numbers . This is quite logical when you think about it..  All writes, even when accelerated, eventually make it to the data store (albeit in a much more efficient way). Many of your reads may be satisfied by FVP, and never hit the array.

    2.  When looking at a summary of the FVP at the cluster level, I find it helpful to click on the "Performance Map" view. This gives me a weighted view of how to distinguish what is being accelerated most during the given sampling period.

    image

    3. In addition to the GUI, controlling the VM write caching settings can easily managed via PowerShell. This might be a good step to take if the cluster tripped over to UPS power.  Backup infrastructures that do not have a VADP capable proxy living in the accelerated cluster might also need to rely on some PowerShell scripts. PernixData has some good documentation on the matter.

    Conclusion
    PernixData FVP is doing a very good job of accelerating a verify difficult workload. I would have loved to show you data from accelerating a more typical workload such as Exchange or SQL, but my other cluster containing these systems is not accelerated at this time. Stay tuned for the next installment, as I will show you what was discovered as I started looking at my workload more closely.

    - Pete

    Accelerating storage using PernixData’s FVP. A perspective from customer #0001

    Recently, I described in "Hunting down unnecessary I/O before you buy that next storage solution" the efforts around addressing "technical debt" that was contributing to unnecessary I/O. The goal was to get better performance out of my storage infrastructure. It’s been a worthwhile endeavor that I would recommend to anyone, but at the end of the day, one might still need faster storage. That usually means, free up another 3U of rack space, and open checkbook

    Or does it?  Do I have to go the traditional route of adding more spindles, or investing heavily in a faster storage fabric?  Well, the answer was an unequivocal "yes" not too long ago, but times are a changing, and here is my way to tackle the problem in a radically different way.

    I’ve chosen to delay any purchases of an additional storage array, or the infrastructure backing it, and opted to go PernixData FVP.  In fact, I was customer #0001 after PernixData announced GA of FVP 1.0.  So why did I go this route?

    1.  Clustered host based caching.  Leveraging server side flash brings compute and data closer together, but thanks to FVP, it does so in such a way that works in a highly available clustered fashion that aligns perfectly with the feature sets of the hypervisor.

    2.  Write-back caching. The ability to deliver writes to flash is really important. Write-through caching, which waits for the acknowledgement from the underlying storage, just wasn’t good enough for my environment. Rotational latencies, as well as physical transport latencies would still be there on over 80% of all of my traffic. I needed true write-back caching that would acknowledge the write immediately, while eventually de-staging it down to the underlying storage.

    3.  Cost. The gold plated dominos of upgrading storage is not fun for anyone on the paying side of the equation. Going with PernixData FVP was going to address my needs for a fraction of the cost of a traditional solution.

    4.  It allows for a significant decoupling of "storage for capacity" versus "storage for performance" dilemma when addressing additional storage needs.

    5.  Another array would have been to a certain degree, more of the same. Incremental improvement, with less than enthusiastic results considering the amount invested.  I found myself not very excited to purchase another array. With so much volatility in the storage market, it almost seemed like an antiquated solution.

    6.  Quick to implement. FVP installation consists of installing a VIB via Update Manager or the command line, installing the Management services and vCenter plugin, and you are off to the races.

    7.  Hardware independent.  I didn’t have to wait for a special controller upgrade, firmware update, or wonder if my hardware would work with it. (a common problem with storage array solutions). Nor did I have to make a decision to perhaps go with a different storage vendor if I wanted to try a new technology.  It is purely a software solution with the flexibility of working with multiple types of flash; SSDs, or PCIe based. 

    A different way to solve a classic problem
    While my write intensive workload is pretty unique, my situation is not.  Our storage performance needs outgrew what the environment was designed for; capacity at a reasonable cost. This is an all too common problem.  With the increased capacities of spinning disks, it has actually made this problem worse, not better.  Fewer and fewer spindles are serving up more and more data.

    My goal was to deliver the results our build VMs were capable of delivering with faster storage, but unable to because of my existing infrastructure.  For me it was about reducing I/O contention to allow the build system CPU cycles to deliver the builds without waiting on storage.  For others it might delivering lower latencies to their SQL backed ERP or CRM servers.

    The allure of utilizing flash has been an intriguing one.  I often found myself looking at my vSphere hosts and all of it’s processing goodness, but disappointed those SSD sitting in the hosts couldn’t help to augment my storage performance needs.  Being an active participant in the PernixData beta program allowed me to see how it would help me in my environment, and if it would deliver the needs of the business.

    Lessons learned so far
    Don’t skimp on quality SSDs.  Would you buy an ESXi host with one physical core?  Of course you wouldn’t. Same thing goes with SSDs.  Quality flash is a must! I can tell you from first hand experience that it makes a huge difference.  I thought the Dell OEM SSDs that came with my M620 blades were fine, but by way of comparison, they were terrible. Don’t cripple a solution by going with cheap flash.  In this 4 node cluster, I went with 4 EMLC based, 400GB Intel S3700s. I also had the opportunity to test some Micron P400M EMLC SSDs, which also seemed to perform very well.

    While I went with 400GB SSDs in each host (giving approximately 1.5TB of cache space for a 4 node cluster), I did most of my testing using 100GB SSDs. They seemed adequate in that they were not showing a significant amount of cache eviction, but I wanted to leverage my purchasing opportunity to get larger drives. Knowing the best size can be a bit of a mystery until you get things in place, but having a larger cache size allows for a larger working set of data available for future reads, as well as giving head room for the per-VM write-back redundancy setting available.

    An unexpected surprise is how FVP has given me visibility into the one area of I/O monitoring that is traditional very difficult to see;  I/O patterns. See Iometer. As good as you want to make it.  Understanding this element of your I/O needs is critical, and the analytics in FVP has helped me discover some very interesting things about my I/O patterns that I will surely be investigating in the near future.

    In the read-caching world, the saying goes that the fastest storage I/O is the I/O the array never will see. Well, with write caching, it eventually needs to be de-staged to the array.  While FVP will improve delivery of storage to the array by absorbing the I/O spikes and turning random writes to sequential writes, the I/O will still eventually have to be delivered to the backend storage. In a more write intensive environment, if the delta between your fast flash and your slow storage is significant, and your duty cycle of your applications driving the I/O is also significant, there is a chance it might not be able to keep up.  It might be a corner case, but it is possible.

    What’s next
    I’ll be posting more specifics on how running PernixData FVP has helped our environment.  So, is it really "disruptive" technology?  Time will ultimately tell.  But I chose to not purchase an array along with new SAN switchgear because of it.  Using FVP has lead to less traffic on my arrays, with higher throughput and lower read and write latencies for my VMs.  Yeah, I would qualify that as disruptive.

     

    Helpful Links

    Frank Denneman – Basic elements of the flash virtualization platform – Part 1
    http://frankdenneman.nl/2013/06/18/basic-elements-of-the-flash-virtualization-platform-part-1/

    Frank Denneman – Basic elements of the flash virtualization platform – Part 2
    http://frankdenneman.nl/2013/07/02/basic-elements-of-fvp-part-2-using-own-platform-versus-in-place-file-system/

    Frank Denneman – FVP Remote Flash Access
    http://frankdenneman.nl/2013/08/07/fvp-remote-flash-access/

    Frank Dennaman – Design considerations for the host local FVP architecture
    http://frankdenneman.nl/2013/08/16/design-considerations-for-the-host-local-architecture/

    Satyam Vaghani introducing PernixData FVP at Storage Field Day 3
    http://www.pernixdata.com/SFD3/

    Write-back deepdive by Frank and Satyam
    http://www.pernixdata.com/files/wb-deepdive.html

    Iometer. As good as you want to make it.

    Most know Iometer as the go-to synthetic I/O measuring tool used to simulate real workload conditions. Well, somewhere, somehow, someone forgot the latter part of that sentence, which is why it ends up being so misused and abused.  How many of us have seen a storage solution delivering 6 figure IOPS using Iometer, only to find that they are running a 100% read, 512 byte 100% sequential access workload simulation.  Perfect for the two people on the planet that those specifications might apply to.  For the rest of us, it doesn’t help much.  So why would they bother running that sort of unrealistic test?   Pure, unapologetic number chasing.

    The unfortunate part is that sometimes this leads many to simply dismiss Iometer results.  That is a shame really, as it can provide really good data if used in the correct way.  Observing real world data will tell you a lot of things, but the sporadic nature of real workloads make it difficult to use for empirical measurement – hence the need for simulation.

    So, what are the correct settings to use in Iometer?  The answer is completely dependent on what you are trying to accomplish.  The race for a million IOPS by your favorite storage vendor really means nothing if their is no correlation between their simulated workload, and your real workload.  Maybe IOPS isn’t even an issue for you.  Perhaps your applications are struggling with poor latency.  The challenge is to emulate your environment with a synthetic workload that helps you understand how a potential upgrade, new array, or optimization might be of benefit.

    The mysteries of workloads
    Creating a synthetic workload representing your real workload assumes one thing; that you know what your real workload really is. This can be more challenging that one might think, as many storage monitoring tools do not help you understand the subtleties of patterns to the data that is being read or written.

    Most monitoring tools tend to treat all I/O equally. By that I mean, if over a given period of time, say you have 10 million I/Os occur.  Let’s say your monitoring tells you that you average 60% reads and 40% writes. What is not clear is how many of those reads are multiple reads of the same data or completely different, untouched data. It also doesn’t tell you if the writes are overwriting existing blocks (which might be read again shortly thereafter) or generating new data. As more and more tiered storage mechanisms comes into play, understanding this aspect of your workload is becoming extremely important. You may be treating your I/Os equally, but the tiered storage system using sophisticated caching algorithms certainly do not.

    How can you gain more insight?  Use every tool at your disposal.  Get to know your applications, and the duty cycles around them. What are your peak hours? Are they in the middle of the day, or in the middle of the night when backups are running?

    Suggestions on Iometer settings
    You may find that the settings you choose for Iometer yields results from your shared storage that isn’t nearly as good as you thought.  But does it matter?  If it is an accurate representation of your real workload, not really.  What matters is if are you able to deliver the payload from point a to point b to meet your acceptance criteria (such as latency, throughput, etc.).  The goal would be to represent that in a synthetic workload for accurate measurement and comparison.

    With that in mind, here are some suggestions for the next time you set up those Iometer runs.

    1.  Read/write ratio.  Choose a realistic read/write ratio representing your workload. With writes, RAID penalties can hurt your effective performance by quite a bit, so if you don’t have an idea of what this ratio currently is, it’s time for you to find out.

    2.  Transfer request size. Is your payload the size of a ball bearing, or a bowling ball? Applications and operating systems vary on what size is used. Use your monitoring systems to best determine what your environment consists of.

    3.  Disk size.  Use the "maximum disk size" in multiples of 1048576, which is a 1GB file. Throwing a bunch of zeros in there might fill up your disk with Iometer’s test file. Depending on your needs, a setting of 2 to 20 GB might be a good range to work with.

    4.  Number of outstanding I/Os.  This needs to be high enough so that the test can keep sending I/O requests to it as the storage is fulfilling requests to it. A setting of 32 is pretty common.

    5.  Alignment of I/O. Many of the standard Iometer ICF files you find were built for physical drives. It has the "Align I/Os on:" setting to "Sector boundaries"   When running tests on a storage array, this can lead to inconsistent results, so it is best to align on 4K or 512 bytes.

    6.  Ramp up time. Offer at least a minute of ramp up time.

    7.  Run time. Some might suggest running simulations long enough to exhaust all caching, so that you can see "real" throughput.  While I understand the underlying reason for this statement, I believe this is missing the point.  Caching is there in the first place to take advantage of a working set of warm and cold data, bursts, etc. If you have a storage solution that satisfies the duty cycles that exists in your environment, that is the most important part.

    8.  Number of workers.  Let this spawn automatically to the number of logical processors in your VM. It might be overkill in many cases because of terrible multithreading abilities of most applications, but its a pretty conventional practice.

    9.  Multiple Iometer instances.  Not really a setting, but more of a practice.  I’ve found running multiple tests a way to better understand how a storage solution will react under load as opposed to on it’s own. It is shared storage after all.

    Disclaimer
    If you were looking for this to be the definitive post on Iometer, that isn’t what I was shooting for.  There are many others who are much more qualified to speak to the nuances of Iometer than me.  What I hope to do is to offer a little practical perspective on it’s use, and how it can help you.  So next time you run Iometer, think about what you are trying to accomplish, and let go of the number chasing.  Understand your workloads, and use the tool to help you improve your environment.

    Follow

    Get every new post delivered to your Inbox.

    Join 737 other followers