Observations with the Active Memory metric in vSphere

The subject of memory management of Operating Systems in vSphere is an enormously broad, and complex topic that has been covered quite well over the years. With all of that great information, there are characteristics with some of the metrics given that still seem to befuddle users. One of those metrics provided to us courtesy of vSphere is "Active Memory." I hope to provide a few real world examples of why this confusion occurs, and what to look out for in your own environment.

vSphere attempts to interpret how much memory is being actively used by a VM, and displays this in the form of “Active Memory.”  The VMkernel bases this estimate off of recently touched memory pages by the guest OS for a given sampling period. It then displays it as an average for that sampling period (maximums and minimums exposed with higher logging levels). It is a metric that has proven to be quite controversial. Some have grown frustrated by the perceived inaccuracies of it, but I believe the problem is not in the metric’s accuracy, but a misunderstanding of how it collects it’s data, and it’s meaning. Having additional data points to understand the behavior of your workload is a good thing. It is critical to know what it really means, and how different Operating Systems and applications may provide different results to this metric.

There are a wealth of good sources (a few links at the end of this post) on defining what Active Memory is as it relates to vSphere. The two takeaways of the Active Memory metric I like to remember is that 1.) It is a statistical estimate, and 2.) It represents a single sample period. In other words, it has no relationship to previous samplings, and therefore, may or may not represent the same memory pages accessed.

The Risk
"We have met the enemy, and he is us."  — Walt Kelly as Pogo

Since Active Memory is a unique metric outside of the paradigm of the OS, translating what it means to you, the application, or the guest OS can be prone to misinterpretation. The risk is interpreting it’s meaning incorrectly, and perhaps using it as the primary method for right sizing a VM. Interestingly enough, this can lead to both oversized VMs, and undersized VMs.

I believe that one thing that gets Administrators off on the wrong foot is vSphere’s own baked-in alarm of "Virtual Machine Memory Usage." This "Usage" metric is a percentage of total available memory for the VM, and is tied to the Active Memory metric in vSphere. It implies that when it is high, the VM is running out of memory, and when it is low, it is performing as designed with no memory issues. I will demonstrate how under certain circumstances, both of these assumptions can be wrong.

Oversizing
Oversizing a VM’s resources is not an uncommon occurrence. You would think spotting these systems might be easy and obvious. That is not always the case.

With respect to memory sizing, let’s do a little experiment. The example below is a bulk file copy (11 gigabytes worth of large and small files) from a Linux machine. The target can be local, or remote. The effect will be similar. We will observe the difference of Active Memory between the small VM (1GB of memory assigned), and the large VM (4GB of memory assigned), and what impacts it may or may not have on performance.

The Active Memory of the smaller Linux VM below

image

The Active Memory of the larger Linux VM below.

image

Note how the Active Memory increased on the 4GB Linux VM versus the 1GB Linux VM. This gives the impression that the file copy is using memory for the file copy job, and leaves less for the applications.

Now let us jump into ‘top’ inside the guest OS. It also shows figures that give the impression that the file copy using most of the memory for the copy job, and may trigger a vCenter Memory usage alarm.

image

But in this case, top is not telling the entire story either. Let’s take a look at the same resource utilization inside the guest using ‘htop’

image

Let’s look at utilization inside the guest using "free -m"

image

So what is going on here?  The Linux kernel will allocate memory that isn’t actively used by processes to other tasks like file system caches. This opportunistic use of memory will not interfere with other spawning processes. As soon as another process spawns, the Linux kernel will free that memory so that it can be used by the application. This is a clever use of resources, but as you can see, can also give the wrong impression inside the guest (via ‘top’), as well as in vSphere (via Active Memory). One can keep increasing the amount of memory assigned to a VM, and in many cases, this behavior will continue to occur. vSphere’s Active Memory metric does not attempt to distinguish what it is, beyond a change in value. In all cases, the memory statistics are not inaccurate, but just a different representation of memory usage.

The reason why I chose a bulk file copy as an experiment is because a file copy is largely perceived by the end user as being a storage I/O or network I/O matter. The behavior I described will most likely show up in Linux VMs being used as flat-file storage servers (something I see often), but is not limited to just that type of workload. I should also mention that during the testing, the ability for Linux to use memory for some of it’s file handling tasks was more noticeable when using slow backing storage in comparison to faster storage.

If you are purely a Windows shop, remember that this characteristic will show up with virtual appliances, as they are all Linux VMs. Lets take a look at that same bulk file copy in Windows, and see how it relates to Active Memory.

The Active Memory of the smaller Windows VM below.

image

The Active Memory of the larger Windows VM below.

image

Memory resources inside the guest of the larger Windows VM below.

image

The Windows Memory Manager seems to handle this same task differently.  Semantics aside, when more memory is assigned to a VM, Windows appears to carve out more for this task, but seems to cap it’s ability, in favor of leaving the remaining memory space for already cached applications and data, (seen in the screen shots as “standby” and/or “free”).  This is a simple indicator that various Operating Systems handle their memory management differently, and needs to be taken into consideration when a user is observing the Active Memory metric.

Undersizing
Undersizing a VM’s memory can stem from many reasons, but are most likely to show up on the following types of systems.

  • Server performing multiple roles and not sized accordingly. (e.g. Front end web services with backend databases on the same system, like small SharePoint deployments)
  • VMs right sized according to the Active Memory metric.
  • SQL Servers.
  • Exchange Servers.
  • Servers running one or more Java applications.

With a SQL server, one can easily find a server where the "Active Memory" is quite low. Then, look inside the guest, and you will see utilization of memory is very high, and if the system resources were assigned pretty conservatively, will act sluggish.

image

Now look at it inside the guest, and you will see quite high utilization.

clip_image002

A few steps can help this matter.

  • Use the SQL Server Monitoring Tools in Perfmon to better understand the problem. Be warned that you may have to invest significant time in this in order to get the scaling right, interpret, and validate the data correctly. Don’t rely solely on one metric to determine the state. For instance, the "SQL Server Buffer Manager: Buffer Cache Hit Ratio" is supposed to indicate insufficient memory for SQL if the ratio is a low number. However, I’ve seen memory starved systems still show this as a high value.
  • Change SQL’s default configuration for managing memory. The default setting will let SQL absorb all of the memory, and leave little for the rest of the OS or the apps Set it to a fixed number below the amount assigned to the system. For example, if one had a 12GB SQL server, assign 6GB as the maximum server memory. This will allow for sufficient resources for the server OS an any other applications that run on the system.
  • Document performance monitoring results, then increase the memory assigned to your VM. Then follow up with more performance monitoring to see any measurable results. One could simply increase the memory assigned and forget the other steps, but you’ll be relying completely on anecdotal observations to determine improvement.

Exchange is beginning to act more like SQL with each major release. Much like SQL, Exchange is now quite aggressive in its use of caching. It’s one of the reasons by the dramatic reductions in storage I/O demands over the last three major releases of Exchange. Also like SQL, having plenty of memory assigned will help compensate for slow backend storage.  Starving the system of memory will create wildly unpredictable results, as it never has an opportunity to cache what it should.

Java will use its own memory manager. Java will need available memory space in each VM for each and every JVM running. Ultimately, the JVM applications will work best when a memory reservation is at minimum, set to the sum of all JVMs running on that VM . Be mindful of the implications that memory reservations can bring to the table. You can gain more insight as to the needs of Java inside the guest, by using various tools.

Other observations from a Production environment
A few other notes worth mentioning

1.  Sometimes guest OS paging is monitored as an indicator of not enough memory. However, not all memory inside a guest OS will page when under pressure. If the applications or OS have pinned the memory, so you won’t see memory paging coming from them. One can be starving the app for memory, but it does not show via guest OS paging.

2.  VMs with larger vCPU counts need a relative increase in memory assigned to the VM. I’ve have seen this in my environment, where a VM with a high vCPU count is under tremendous load, that not having enough memory will hinder performance. Simply put, more CPU cycles needs more memory addresses to work with.

3.  Server memory might not be cheap, but neither is storage, and even fast storage is several orders of magnitude slower than memory. The performance gain of assigning more memory to specific VMs (assuming your hosts/cluster can support it) can be immediate, and dramatic. No need to induce unnecessary paging if unnecessary.

4.  Assigning more memory to a VM running a poorly designed or inefficient application will likely not help the application, and be a waste of resources. An application may be storage I/O heavy, no matter how much memory you assign it (think Exchange 2003).

One of my first and favorite VMworld breakout sessions I attended in 2010 was "Understanding Virtualization Memory Management Concepts" (TA7750 still found online) presented by Kit Colbert. Kit is now the CTO of End User Computing at VMware, but the sessions can still be found online. I recall sitting in that session, and within the first 5 minutes deciding that: 1.) I knew nothing about memory, especially with a Hypervisor, and 2.) The deep dive was so good, and the content so verbose, that any attempt at taking notes was pointless. I made it a point to attend this session each year that he presented it, as it represents the very best of what VMworld has to offer. Do yourself a favor and watch one of his sessions.

Conclusion
Memory can and will be measured differently by Hypervisors and Guest OSs. The definitions of terms related to memory may be different by the application, the guest OS, and the hypervisor. Understanding your workloads, and the characteristics of the platforms it uses will help you better size your VMs for the balance between optimal performance with a minimal footprint. Monitoring memory in a useful way can also be a time consuming, difficult task that extends well beyond just a simple metric.

Have fun

- Pete

Helpful links
Understanding vSphere Active Memory
http://blogs.vmware.com/vsphere/2013/10/understanding-vsphere-active-memory.html

Kit Colbert’s 2011 VMworld breakout session – Understanding Virtualized Memory Performance Management
https://www.youtube.com/watch?v=YKaUtoQrLjo  

Monitor Memory Usage in SQL Server
http://msdn.microsoft.com/en-us/library/ms176018.aspx

SQL Server on VMware Best Practices guide
http://www.vmware.com/files/pdf/solutions/SQL_Server_on_VMware-Best_Practices_Guide.pdf

VMware KB 1687: Excessive Page Faults Generated by Windows applications
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1687

A vSphere & memory related post would not be complete without mention of the venerable "vSphere Clustering Deepdive"
http://www.amazon.com/VMware-vSphere-5-1-Clustering-Deepdive-ebook/dp/B0092PX72C/ref=sr_1_1?ie=UTF8&qid=1404234460&sr=8-1&keywords=vsphere+5+clustering+deep+dive

Getting the big IT purchase approved

IT organizations are faced with a tantalizing array of options when it comes to hardware and software solutions. But long before anything can ever be deployed, it has to be purchased, which means at some point it had to be approved. Sometimes deploying a solution is easy compared to getting it approved. But how does one go about getting the big ticket item through? Well, here is my attempt at demystifying the process.

First, lets just say that "big purchase" is without a doubt a relative term. For an SMB, $10,000 might be a show stopper, while seven figures for a large enterprise may be part of the routine. Both offer unique challenges, but share similar tactics. Getting a big IT purchase approved typically consists of a unique set of skills and experience. A mix of preparation, clarity, delivery, timing, and attitude make up the chaotic formula that when done well, will improve the odds of success. It is a skill that can be equally important to anything you bring in your technical arsenal.

Preparation
You will serve yourself well if you think and deliver like a consultant. Life in Ops can get muddied down by internal strife, whack-a-mole fire fighting, and the occasional "look at this new feature" deployment even though nobody asked for it. Take notice of how a good consultant does things. Step back to understand the desired result, then build out your own statement defining the typical design inputs like requirements, constraints, assumptions and risks.

At some point, you will need to prioritize your own wants, and pick your battles. You typically can’t have everything, so start from the ground up of what IT’s mission statement is, and work from there. Start with bet-the-business elements like high availability, and data/system protection that won’t be spoken up for by anyone but IT. Then, if there are other needs, they may in fact be a departmental need that impacts productivity and revenue. While IT may be the enabler of the request, make sure the identity of the requester is clear.

It’s not uncommon for an SMB to have very little money allocated to IT, but this isn’t an excuse for lack of diligence in preparation. Large organizations have more money, but proportionally much more complex problems to solve, SLAs to adhere to, and regulations to comply with. If you have no idea how your organization’s IT spending compares to peers in your industry, it is time to learn, and communicate that as a part of your presentation if your funds are abnormally low.

This is also an opportunity for you to project yourself as the "solution provider" in your organization. Embrace this. Help them understand why technology costs have increased over the past 10 years. If someone says, "Why don’t we just use the cloud for this?" Rather than let smoke pour out of your ears, respond with "That is a great question Joe. IT is constantly looking for the best ways to deliver services that meets the requirements of the organization." And then go into an appropriate level of detail on why it may or may not be a good fit. (If it is a good fit, then say so!). The point here is to embrace the solution provider role for the organization.

Your biggest competitor to your proposal will be, you guessed it, doing nothing. But there is a cost of doing nothing. The key stakeholders might look at this proposed expenditure and compare it to $0. In most cases, this is completely wrong, and it is up to you to help them understand what the real cost comparison is.

One opportunity sometimes overlooked is the power of a cost deferral. Does the unbudgeted solution you are proposing delay a much larger budgeted purchase until perhaps next year? Showcase this. Good proposals typically show a TCO of 3 to 5 years. But do not underestimate the allure an immediate cost deferral has to your friendly CFO.

Get input on defining the "what" of a problem, and it’s impacts. The "how" is usually reserved for the Subject Matter Expert (e.g. you). This will minimize silly ideas from others suggesting your storage capacity issues can be solved by the Friday flier for Best Buy.

Learn to prime the pump. Do a little one-on-one campaigning. This is a common method suggested in many books on successful leadership. It is your chance to win over your constituents before any formal proposal. Trying holding an internal "Lunch and Learn" about trends in technology. Share a little about how amazing virtualization is, and help them understand some basic challenges of IT. These techniques will engage key personnel, and help in establishing a trusting relationship with IT.

The presentation – IT Shark Tank
I’m a big fan of the show, ‘Shark Tank.’ If you aren’t familiar with it, four very successful investors hear pitches by would-be entrepreneurs who are looking for investment funds in exchange for a stake in equity. The investors bring their own wealth, smarts and competitive nature to the table, and can be quite tough on prospective entrepreneurs. A few things can be gleaned from this, and applied directly to your ability to deliver a successful proposal.

  • Come prepared. Nothing kills a proposal like lack of preparation, and not knowing your facts. Lets say you are requesting more storage: You’d better believe some of the simplest questions will be asked. Many that you may overlook when entering a room. "How much storage do we have?" "How much do we have left?" "How much do we need?" "Why does it cost so much?" "what are the alternatives?"
  • Clearly state the problem, the impacts to the business, the options, and your recommendations.
  • Learn to answer the simplest of questions in the simplest of ways. "Does this proposal save us money?" "Is there a less expensive way to do this?"
  • Craft your message to your audience and appeal to their sensibilities. Flog yourself upside the head if you use any IT acronyms, or assume that technical gymnastics is going to impress them. It won’t. What will is being concise. Every word has a purpose.
  • Provide a little (but not too much) context to the problem that you are trying to solve. Leverage an analogy if you need to.
  • Know the counterpoints, and how to respond. Know how you are going to answer a question you don’t know the answer to.
  • Seek to understand their position. What might they dislike (e.g. unpredictable expenses, obligated debt, investments they don’t understand, etc.)
  • Respect everyone’s time. Make it quick, make it concise, and if they would like more detail, you can certainly do that, but don’t make it a part of the pitch.

How to deal with everyone else in the food chain
Be honest with your vendors. They have a job to do, and are trying to help you. If you show interest in a solution that is 10x more than what you can afford, it isn’t going to do anyone good to bring them in for an onsite demonstration. They will appreciate your honesty so they can perhaps focus on more cost appropriate solutions. Believe it or not, most want the right solution for you in the first place, as repeat business is the most important value they can bring back to their own organization.

If you are someone who doesn’t have deep-dive knowledge on the solution you are proposing, take advantage of the SE for the VAR or channel partner as a resource. Many of my friends in the industry are SEs and are some of the best and the brightest folks I know, and they all came from the Ops side at some point. Use them as a resource to learn about the solutions they are proposing, and ask them challenging questions.

Be honest with your organization. This isn’t about what you want. Your value will increase when you can demonstrate repeatedly that you have their best interests in mind.

After the decision
If the proposal was approved, focus on delivering at least some results fast. Then showcase the win and how IT can help solve organizational challenges. This may sound like self promotion, but it is not if done right. The wins are for the organization, not you. This establishes trust, and lays the groundwork for the future. Use company newsletters, or establish a monthly IT Review to share updates.

If it was denied, don’t take it personal. It is great to show passion, but don’t confuse passion for what you are really trying to do; helping your organization make the best strategic and financial decision for them. Would it be gratifying to get a new Datacenter revamp through only to realize it was the financial tipping point of the organization just a few months later? Keep it all in perspective. Besides, some of the best purchasing decisions I’ve been involved with were the ones that were ultimately rejected, which gave solutions a chance to mature, and me an opportunity to find a different way to solve a problem.

Try doing your own proposal or presentation retrospective. What went well and what didn’t. Ask for feedback on how it went. You might be surprised at the responses you get.

Conclusion
You have the unique opportunity to be the technology advocate for the organization rather than simply a burden to the budget.  Do I get everything approved?  Of course I don’t, but a well prepared proposal will allow you, and your organization to make the smartest decisions possible, and help IT deliver great results.

Testing InfiniBand in the home lab with PernixData FVP

One of the reasons I find the latest trends in datacenter architectures so interesting is the innovative approaches used to address deficiencies associated with more traditional arrangements. These innovations have been able to drive more of what almost everyone needs; better storage performance and better scalability.

The caveat to some of these newer arrangements is that it can put heavy stress on the plumbing that connects these servers. Distributed storage technologies like VMware VSAN, or clustered write buffering techniques used by PernixData FVP and Atlantis Computing’s USX leverage these interconnects to accelerate storage traffic. Turn-key Hyperconverged solutions do too, but they enjoy the luxury of having full control over the hardware used. Some of these software based solutions might need some retrofitting of an environment to run optimally or meet their requirements (read: 10GbE or better). The desire for the fastest interconnect possible between hosts doesn’t always align with budget or technical constraints, so it makes most sense to first see what impact there really is.

I wanted to test the impact better bandwidth would have between servers a bit more, but do to constraints in my production environment, I needed to rely on my home lab. As much as I wanted to throw 10GbE NICs in my home lab, the price points were too high. I had to do it another way. Enter InfiniBand. I’m certainly not the only one to try InfiniBand in a home lab, but I wanted to focus on two elements that are critical to the effectiveness of replica traffic. The overall bandwidth of the pipe, and equally important, the latency. While I couldn’t simulate an exact workload that I see in my production environment, I could certainly take smaller snippets of I/O patterns that I see, and model them the best I can.

InfiniBand is really interesting. As Joeb Jackson put it in a NetworkWorld.com article, "InfiniBand is architecturally sacrilegious" as it combines many layers of the OSI model. The results can be stunning. Transport latencies in the 2 microsecond neighborhood, and a healthy roadmap to 200Gbps and beyond. It’s sort of like the ’66 AC Shelby Cobra of data transports. Simple, and perhaps a little rough around the edges, but brutally fast. While it doesn’t have the ubiquity of RJ/Ethernet, it also doesn’t have the latencies that are still a part of those faster forms of Ethernet.

At the time of this writing, the InfiniBand drivers for ESXi 5.5 weren’t quite ready for VSAN testing yet, so the focus of this testing is to see how InfiniBand behaves when used in a PernixData FVP deployment. I hope to publish a VSAN edition in the future. I simply wanted to better understand if (and how much) a faster connection would improve the transmission of replica traffic when using FVP in WB+1 mode (local flash, and 1 peer). My production environment is very write intensive, and uses 1GbE for the interconnects. Any insight gained here will help in my design and purchasing roadmap for my production environment.

Testing:
Testing occurred on a two host cluster backed by a Synology DS1512+. Local flash leveraged SATA III based EMLC SSD drives using an onboard controller. 1GbE interconnects traversed a Cisco SG300-20 using a 1500 byte MTU size. For InfiniBand, each host used a Mellanox MT25418 DDR 2 port HCA that offered 10Gb per connection. They were directly connected to each other, and used a 2044 byte MTU size. InfiniBand can be set to 4092 bytes but for compatibility reasons under ESXi 5.5, 2044 is the desired size.

I tend to prefer testing that relies on observational patterns versus one final, empirical number. These tests were no different, and while they attempt to simulate a very brief snippet of a workload in my production environment, I find that I still gain a much better understanding from a time based performance graph than an insulated final number.

The test case was a simple one, but would be enough to illustrate the differences I was hoping to see. The test comprised of a 2vCPU VM using 2 workers on a 100% write, 100% random workload lasting for 1 minute. The test was run three times. First with WB+0 (no peer/replica traffic), then WB+1 (one peer) using a 1GbE connection, and finally WB+1 over a single 10Gb InfiniBand connection. Each screen capture I provide will show them in that order. That test case was repeated 3 times. First with 256KB I/O sizes, followed by 32KB, then onto 4KB. I ran the tests several times in different order to ensure I wasn’t introducing inflated or deflated performance due to previous tests or caching. All were repeated several times to flush out any anomalies.

(Click on each image for a larger view)

256KB I/O size test
Testing results using this I/O size is rarely published anywhere because it never bodes well in comparison to a smaller I/O size like 4KB. But my production workloads (compiling) often deal with these I/O sizes, so it is important for me to understand their behavior.

IOPS with 256KB I/O
256KB-IOPS

Latency with 256KB I/O
256KB-Latency

Throughput using 256KB I/O
256KB-Throughput

Observations from 256KB I/O test
Note that the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance, driving just half of the IOPS and throughput compared to InfiniBand. But also take a look at the terrible native latency (70ms) of large I/O sizes even when using WB+0 (no peer traffic. Just local flash). Also note that when peer traffic performance is improved, the larger backlog of data in the destager occurs.

32KB I/O size test
Just 1/8th the size of a 256KB I/O, this is still larger than most storage vendors like to advertise in their testing. My production workload often oscillates between 32KB and 256KB I/Os.

IOPS with 32KB I/O
32KB-IOPS

Latency with 32KB I/O
32KB-Latency

Throughput using 32KB I/O
32KB-Throughput

Observations from 32KB I/O test
Once again, the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance on throughput. Latency had only a minor improvement moving away from 1GbE, as the latency of the flash was about 6ms.

4KB I/O size test
The most common of I/O sizes that you might see, although it is more common on reads than writes. 1/64th the size of a 256KB I/O, it is tiny compared to the others, but important to test because of the attempt to learn if and how much a fatter, lower latency pipe helps in various I/O sizes.

IOPS with 4KB I/O
4KB-IOPS

Latency with 4KB I/O
4KB-Latency

Throughput using 4KB I/O
4KB-Throughput

Observations from 4KB I/O test
IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. But as the I/O sizes shrink, so does the effective total/concurrent payload size. So the differences between InfiniBand and 1GbE were less than on tests with larger I/O. Latencies of this I/O size were around 2ms.

Other observations that stood out
One of the first things that stood is illustrated below, with two 5 minutes test runs. Look at where the two arrows point. The arrow on the left points to the number of packets sent while using 1GbE. The arrow on the right shows the number of packets sent while using 10Gb InfiniBand. Quite a difference. Also notice that the effective throughput started out higher, but had to throttle back

Packetstransmitted

Findings:
The key takeaways from these tests:

    • A high bandwidth, low latency interconnect like InfiniBand can virtually eliminate any write redundancy penalty incurred in WB+1 mode.
    • From a single workload, I/O sizes of 32KB and 256KB saw between 65% and 90% improvement on IOPS and throughput. I/O sizes of 4KB saw essentially no improvement (many concurrent 4KB workloads likely would see a benefit however)..
  • Writes using larger I/O sizes were the clear beneficiary of a fatter pipe between servers. However, the native latencies of the flash devices under larger I/O sizes could not take advantage of the low latencies of InfiniBand. In other words, with large I/O sizes, the flash device themselves, or the bus they were using were by far the major impediment lower latency and faster I/O delivery
  • The smaller pipe of 1GbE throttled back the flash device’s ability to ingest the data as fast as InfiniBand. There was always a smaller amount of outstanding writes once the test was complete, but it came at the cost of poorer performance for 1GbE.
    A few other matters can come up when attempting to accurately interpret latencies. As VMware KB 2036863 points out, reporting of latencies accurately can sometimes be a challenge. Just something to be aware of.

Conclusion
InfiniBand was my affordable way to test how a faster interconnect would improve the abilities of FVP to accelerate replica storage I/O.  It lived up to the promise of high bandwidth with low latency. However, effective latencies were ultimately crippled by the SSDs, the controller, or the bus it was using. I did not have the opportunity to test other flash technologies such as PCIe based solutions from Fusion-IO or Virident, or the memory channel based solution from Diablo Technologies. But based on the above, it seems to be clear that how the flash is able to ingest the data is crucial to the overall performance of whatever solution that is using it.

Helpful Links
Erik Bussink’s great post on using InfiniBand with vSphere 5.5
http://www.bussink.ch/?p=1306 

Vladen Seget’s post on incorporating InfiniBand into his backing storage
http://www.vladan.fr/homelab-storage-network-speedup/

Mellanox, OFED and OpenSM bundles
https://my.vmware.com/web/vmware/details/dt_esxi50_mellanox_connectx/dHRAYnRqdEBiZHAlZA==
http://www.mellanox.com/downloads/Drivers/MLNX-OFED-ESX-1.8.1.0.zip
http://files.hypervisor.fr/zip/ib-opensm-3.3.16-64.x86_64.vib

Using the Cisco SG300-20 Layer 3 switch in a home lab

One of the goals when building up my home lab a few years ago was to emulate a simple production environment that would give me a good platform to learn and experiment with. I’m a big fan of nested labs, and use one on my laptop often. But there are times when you need real hardware to interact with. This has come up even more than I expected, as recent trends with leveraging flash on the host have resulted in me stuffing more equipment back in the hosts for testing and product evaluations.

Networking is the other area that can be helpful to have equipment that at least tries to mimic what you’d see in a production environment. Yet the options for networking in a home lab have typically been limited for a variety of reasons.

  • The real equipment is far too expensive, or too loud for most home lab needs.
  • Searching on eBay or Craigslist for a retired production unit can be risky. Some might opt for this strategy, but this can result in a power sucking, 1U noise maker that may have some dead ports on it, or worse, bricked upon arrival.
  • Consumer switches can be disappointing. Rig up a consumer switch that is lacking in features, and port count, and be left wishing you hadn’t gone this route.

I wanted a fanless, full Layer 3 managed switch with a feature set similar to what you might find on an enterprise grade switch, but not at an enterprise grade price. I chose to go with a Cisco SG300-20. This is a 20 port, 1GbE, Layer 3 switch. With no fans, the unit draws as little as 10 watts.

Read more of this post

Using a new tool to discover old problems

It is interesting what can be discovered when storage is accelerated. Virtual machines that were previously restricted by the underperforming arrays now get to breath freely.  They are given the ability to pass storage I/O as quickly as the processor needs. In other words, the applications that need the CPU cycles get to dictate your storage requirements, rather than your storage imposing artificial limits on your CPU.

With that idea in mind, a few things revealed themselves during the process of implementing PernixData FVP.  Early on, it was all about implementing and understanding the solution.  However, once the real world workloads began accelerating, there was intrigue on the analytics that FVP was providing.  What was generating the I/O that was being accelerated?  What processes were associated with the other traffic not being accelerated, and why?  What applications were behind the changing I/O sizes?  And what was causing the peculiar I/O patterns that were showing up?  Some of these were questions raised at an earlier time (see: Hunting down unnecessary I/O before you buy that next storage solution ).  The trouble was, the tools I had to discover the pattern of data I/O were limited.

Why is this so important? In the spirit of reminding ourselves that no resource is an island, here is an example of a production code compile run, as looking from the perspective of the guest CPU. The first screen capture is the code with adequate storage I/O to support the application’s needs. A full build and is running nearly perfect CPU utilization of all 8 of it’s vCPUs.  (screen shots taken from my earlier post; Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements-Part 2)

image

Below is that very same code compile, under stressed backend storage. It took 46% longer to complete, and as you can see, changes the CPU utilization of the build run.

image

The primary goal for this environment was to accelerate the storage. However, it would have been a bit presumptuous to conclude that all existing storage traffic is good, useful I/O. There is a significant amount of traffic originating from outside of IT, and the I/O generated needed to be understood better.  With the traffic passing more freely thanks to FVP acceleration, patterns that previously could not expose themselves should be more visible. This was the basis for the first discovery

A little “CSI” work on the IOPS
Many continuous build systems use some variation of a polling mechanism to understand when there is new checked in code that needs to be compiled. This should be a very light weight process.  However, once storage performance was allowed to breath better, the following patterns started showing up on all of the build VMs.

The image below shows the IOPS for one build VM during a one hour period of no compiling for that particular VM.  The VM’s were polling for new builds every 5 minutes.  Yep, that “build heartbeat” was as high as 450 IOPS on each VM.

high-IOPS-heartbeat

Why wasn’t this noticed before?  These spikes were being suppressed by my previously overtaxed storage, which made them more difficult to see. These were all writes, and were translating into 500 to 600 steady state IOPS just to sit idle (as seen below from the perspective of the backing storage)

Array-VMFSvolumeIOPS

So what was the cause? As it turned out, the polling mechanism was using some source code control (SVN) calls to help the build machines understand if it needed to execute a build. Programmatically, the Development Team has no idea that the script that they develop is going to be efficient, or not efficient. They are separated by that layer of the infrastructure. (Sadly, I have a feeling this happens more often than not in general Application Development). This resulted in a horribly inefficient method. After helping them understand the matter, it was revamped, and now polling for each VM only takes 1 to 2 IOPS every 5 minutes.

Idle-IOPS2

The image below shows how the accelerated cluster of 30 build VMs looks when there are no builds running.

Idle-IOPS

The inefficient polling mechanism wasn’t the only thing found. A few of the Linux build VMs had a rouge “Beagle” search daemon running on them. This crawler did just that, indexing data on these Linux machines, and creating unnecessary I/O.  With Windows, Indexers and other CPU and I/O hogs are typically controlled quite easily by GPO, but the equivalent services can creep into Linux systems if not careful.  It was an easy fix at least.

The cumulative benefit
Prior to the efforts of accelerating the storage, and looking to make it more efficient, the utilization of the arrays looked as the image shows.  (6 hour period, from the perspective of the arrays)

Array-IOPS-before

Now, with the combination of understanding my workload better, and acceleration through FVP, that same workload looks like this (6 hour period, from the perspective of the arrays):

Array-IOPS-after

Notice that the estimated workload is far under the 100% it was regularly pegged at for 24 hours a day, 6 days a week.  In fact, during the workday, the arrays might only peak at 50% to 60% utilization.  When no builds are running, the continuous build system may only be drawing 25 IOPS from the VMFS volumes that contain the build machines, which is much more reasonable than where it was at.

With the combination of less pressure on the backing physical storage, and the magic of pooled flash on the hosts, the applications and CPU get to dictate how much storage I/O is needed.  Below is a screen capture of IOPS on a production build VM while compiling was being performed.  It was not known up until this point that a single build VM needed as much as 4,000 IOPS to compile code because the physical storage was never capable of satisfying that type of need.

IOPS-single-VM

Conclusion
Could some of these discoveries have been made without FVP?  Yes, perhaps some of it. But good analysis comes from being able to interpret data in a consumable way. Its why various methods of data visualization such as bar graphs, pie charts, and X-Y-Z plots exist. FVP certainly has been doing a good job of accelerating workloads, but it is also helps the administrator understand the I/O better.  I look forward to seeing how the analytics might expand in future tools or releases from PernixData.

A friend once said to me that the only thing better than a new tractor is a reason to use it. In many ways, the same thing goes for technology. Virtualization might not even be that fascinating unless you had real workloads to run on top of it. Ditto for for PernixData FVP. When applied to real workloads, the magic begins to happen, and you learn a lot about your data in the process.

Observations of PernixData FVP in a production environment

Since my last post, "Accelerating storage using PernixData’s FVP. A perspective from customer #0001" I’ve had a number of people ask me questions on what type of improvements I’ve seen with FVP.  Well, let’s take a look at how it is performing.

The cluster I’ve applied FVP to is dedicated for the purpose of compiling code. Over two dozen 8 vCPU Linux and Windows VM’s churning out code 24 hours a day. It is probably one of the more challenging environments to improve, as accelerating code compiling is inherently a very difficult task.  Massive amounts of CPU using a highly efficient, multithreaded compiler, a ton of writes, and throw in some bursts of reads for good measure.  All of this occurs in various order depending on the job. Sounds like fun, huh.

Our full builds benefited the most by our investment in additional CPU power earlier in the year. This is because full compiles are almost always CPU bound. But incremental builds are much more challenging to improve because of the dialog that occurs between CPU and disk. The compiler is doing all sorts of checking, throughout the compile. Some of the phases of an incremental build are not multithreaded, so while a full build offers nearly perfect multithreading on these 8 vCPU build VMs, this just isn’t the case on an incremental build.

Enter FVP
The screen shots below will step you through how FVP is improving these very difficult to accelerate incremental builds. They will be broken down into the categories that FVP divides them into; IOPS, Latency, and Throughput.  Also included will be a CPU utilization metric, because they all have an indelible tie to each other. Some of the screen shots are from the same compile run, while others are not. The point here it to show how it is accelerating, and more importantly how to interpret the data. The VM being used here is our standard 8 vCPU Windows VM with 8GB of RAM.  It has write-back caching enabled, with a write redundancy setting of "Local flash and 1 network flash device."

Click on each image to see a larger version

IOPS
Below is an incremental compile on a build VM during the middle of the day. The magenta line is showing what is being satisfied by the backing data store, and the blue line shows the Total Effective IOPS after flash is leveraged. The key to remember on this view is that it does not distinguish between reads and writes. If you are doing a lot of "cold reads" the magenta "data store" line and blue "Total effective" line may very well overlap.

PDIOPS-01

This is the same metric, but toggled to the read/write view. In this case, you can see below that a significant amount of acceleration came from reads (orange). For as much writing as a build run takes, I never knew a single build VM could use 1,600 IOPS or more of reads, because my backing storage could never satisfy the request.

PDIOPS-02

CPU
Allowing the CPU to pass the I/O as quickly as it needs to does one thing, it allows the multithreaded compiler to maximize CPU usage. During a full compile, it is quite easy to max out an 8 vCPU system and have a sustained 100% CPU usage, but again, these incremental compiles were much more challenging. What you see below is the CPU utilization associated with the VM running the build. It is a significant improvement of an incremental build by using acceleration. A non accelerated build would rarely get above 60% CPU utilization.

CPU-01

Latency
At a distance, this screen grab probably looks like a total mess, but it has really great data behind it. Why? The need for high IOPS is dictated by the VMs demanding it. If it doesn’t demand it, you won’t see it. But where acceleration comes in more often is reduced latency, on both reads and writes. The most important line here is the blue line, which represents the total effective latency.

PDLatency-01

Just as with other metrics, the latency reading can often times be a bit misleading with the default "Flash/Datastore" view. This view does not distinguish between reads and writes, so a cold read pulling off of spinning disk will have traditional amounts of latency you are familiar with. This can skew your interpretation of the numbers in the default view. For all measurements (IOPS, Throughput, Latency) I often find myself toggling between this view, and the read/write view. Here you can see how a cold read sticks out like a sore thumb. The read/write view is where you would go to understand individual read and write latencies.

PDLatency-02

Throughput
While a throughput chart can often look very similar to the IOPS chart, you might want to spend a moment and dig a little deeper. You might find some interesting things about your workload. Here, you can see the total effective throughput significantly improved by caching.

PDThroughput-01

Just as with the other metrics, toggling it into read/write view will help you better understand your reads and writes.

PDThroughput-02

The IOPS, Throughput & Latency relationship
It is easy to overlook the relationship that IOPS, throughput, and latency have to each other. Let me provide an real world example of how one can influence the other. The following represents the early, and middle phases of a code compile run. This is the FVP "read/write" view of this one VM. Green indicates writes. Orange indicates reads. Blue indicates "Total Effective" (often hidden by the other lines).

First, IOPS (green). High write IOPS at the beginning, yet relatively low write IOPS later on.

IOPS

Now, look at write throughput (green) below for that same time period of the build.  A modest amount of throughput at the beginning where the higher IOPS were at, then followed by much higher throughput later on when IOPS had been low. This is the indicator of changing I/O sizes from the applications generating the data.

throughput

Now look at write latency (green) below. Extremely low latency (sub 1ms) with smaller I/O sizes. Higher latency on the much larger I/O sizes later on. By the way, the high read latencies generally come from cold reads that were served from the backing spindles.

latency

The findings here show that early on in the workflow where SVN is doing a lot of it’s prep work, a 32KB I/O size for writes is typically used.  The write IOPS are high, Throughput is modest, and latency comes in at sub 1ms. Later on in the run, the compiler itself uses much larger I/O sizes (128KB to 256KB). IOPS are lower, but throughput is very high. Latency suffers (approaching 8ms) with the significantly larger I/O sizes. There are other factors influencing this, to which I will address in an upcoming post.

This is one of the methods to determine your typical I/O size to provide a more accurate test configuration for Iometer, if you choose to do additional benchmarking. (See: Iometer.  As good as you want to make it.)

Other observations

1.  After you have deployed an FVP cluster into production, your SAN array monitoring tool will most likely show you an increase in your write percentage compared to your historical numbers . This is quite logical when you think about it..  All writes, even when accelerated, eventually make it to the data store (albeit in a much more efficient way). Many of your reads may be satisfied by FVP, and never hit the array.

2.  When looking at a summary of the FVP at the cluster level, I find it helpful to click on the "Performance Map" view. This gives me a weighted view of how to distinguish what is being accelerated most during the given sampling period.

image

3. In addition to the GUI, controlling the VM write caching settings can easily managed via PowerShell. This might be a good step to take if the cluster tripped over to UPS power.  Backup infrastructures that do not have a VADP capable proxy living in the accelerated cluster might also need to rely on some PowerShell scripts. PernixData has some good documentation on the matter.

Conclusion
PernixData FVP is doing a very good job of accelerating a verify difficult workload. I would have loved to show you data from accelerating a more typical workload such as Exchange or SQL, but my other cluster containing these systems is not accelerated at this time. Stay tuned for the next installment, as I will show you what was discovered as I started looking at my workload more closely.

- Pete

Accelerating storage using PernixData’s FVP. A perspective from customer #0001

Recently, I described in "Hunting down unnecessary I/O before you buy that next storage solution" the efforts around addressing "technical debt" that was contributing to unnecessary I/O. The goal was to get better performance out of my storage infrastructure. It’s been a worthwhile endeavor that I would recommend to anyone, but at the end of the day, one might still need faster storage. That usually means, free up another 3U of rack space, and open checkbook

Or does it?  Do I have to go the traditional route of adding more spindles, or investing heavily in a faster storage fabric?  Well, the answer was an unequivocal "yes" not too long ago, but times are a changing, and here is my way to tackle the problem in a radically different way.

I’ve chosen to delay any purchases of an additional storage array, or the infrastructure backing it, and opted to go PernixData FVP.  In fact, I was customer #0001 after PernixData announced GA of FVP 1.0.  So why did I go this route?

1.  Clustered host based caching.  Leveraging server side flash brings compute and data closer together, but thanks to FVP, it does so in such a way that works in a highly available clustered fashion that aligns perfectly with the feature sets of the hypervisor.

2.  Write-back caching. The ability to deliver writes to flash is really important. Write-through caching, which waits for the acknowledgement from the underlying storage, just wasn’t good enough for my environment. Rotational latencies, as well as physical transport latencies would still be there on over 80% of all of my traffic. I needed true write-back caching that would acknowledge the write immediately, while eventually de-staging it down to the underlying storage.

3.  Cost. The gold plated dominos of upgrading storage is not fun for anyone on the paying side of the equation. Going with PernixData FVP was going to address my needs for a fraction of the cost of a traditional solution.

4.  It allows for a significant decoupling of "storage for capacity" versus "storage for performance" dilemma when addressing additional storage needs.

5.  Another array would have been to a certain degree, more of the same. Incremental improvement, with less than enthusiastic results considering the amount invested.  I found myself not very excited to purchase another array. With so much volatility in the storage market, it almost seemed like an antiquated solution.

6.  Quick to implement. FVP installation consists of installing a VIB via Update Manager or the command line, installing the Management services and vCenter plugin, and you are off to the races.

7.  Hardware independent.  I didn’t have to wait for a special controller upgrade, firmware update, or wonder if my hardware would work with it. (a common problem with storage array solutions). Nor did I have to make a decision to perhaps go with a different storage vendor if I wanted to try a new technology.  It is purely a software solution with the flexibility of working with multiple types of flash; SSDs, or PCIe based. 

A different way to solve a classic problem
While my write intensive workload is pretty unique, my situation is not.  Our storage performance needs outgrew what the environment was designed for; capacity at a reasonable cost. This is an all too common problem.  With the increased capacities of spinning disks, it has actually made this problem worse, not better.  Fewer and fewer spindles are serving up more and more data.

My goal was to deliver the results our build VMs were capable of delivering with faster storage, but unable to because of my existing infrastructure.  For me it was about reducing I/O contention to allow the build system CPU cycles to deliver the builds without waiting on storage.  For others it might delivering lower latencies to their SQL backed ERP or CRM servers.

The allure of utilizing flash has been an intriguing one.  I often found myself looking at my vSphere hosts and all of it’s processing goodness, but disappointed those SSD sitting in the hosts couldn’t help to augment my storage performance needs.  Being an active participant in the PernixData beta program allowed me to see how it would help me in my environment, and if it would deliver the needs of the business.

Lessons learned so far
Don’t skimp on quality SSDs.  Would you buy an ESXi host with one physical core?  Of course you wouldn’t. Same thing goes with SSDs.  Quality flash is a must! I can tell you from first hand experience that it makes a huge difference.  I thought the Dell OEM SSDs that came with my M620 blades were fine, but by way of comparison, they were terrible. Don’t cripple a solution by going with cheap flash.  In this 4 node cluster, I went with 4 EMLC based, 400GB Intel S3700s. I also had the opportunity to test some Micron P400M EMLC SSDs, which also seemed to perform very well.

While I went with 400GB SSDs in each host (giving approximately 1.5TB of cache space for a 4 node cluster), I did most of my testing using 100GB SSDs. They seemed adequate in that they were not showing a significant amount of cache eviction, but I wanted to leverage my purchasing opportunity to get larger drives. Knowing the best size can be a bit of a mystery until you get things in place, but having a larger cache size allows for a larger working set of data available for future reads, as well as giving head room for the per-VM write-back redundancy setting available.

An unexpected surprise is how FVP has given me visibility into the one area of I/O monitoring that is traditional very difficult to see;  I/O patterns. See Iometer. As good as you want to make it.  Understanding this element of your I/O needs is critical, and the analytics in FVP has helped me discover some very interesting things about my I/O patterns that I will surely be investigating in the near future.

In the read-caching world, the saying goes that the fastest storage I/O is the I/O the array never will see. Well, with write caching, it eventually needs to be de-staged to the array.  While FVP will improve delivery of storage to the array by absorbing the I/O spikes and turning random writes to sequential writes, the I/O will still eventually have to be delivered to the backend storage. In a more write intensive environment, if the delta between your fast flash and your slow storage is significant, and your duty cycle of your applications driving the I/O is also significant, there is a chance it might not be able to keep up.  It might be a corner case, but it is possible.

What’s next
I’ll be posting more specifics on how running PernixData FVP has helped our environment.  So, is it really "disruptive" technology?  Time will ultimately tell.  But I chose to not purchase an array along with new SAN switchgear because of it.  Using FVP has lead to less traffic on my arrays, with higher throughput and lower read and write latencies for my VMs.  Yeah, I would qualify that as disruptive.

 

Helpful Links

Frank Denneman – Basic elements of the flash virtualization platform – Part 1
http://frankdenneman.nl/2013/06/18/basic-elements-of-the-flash-virtualization-platform-part-1/

Frank Denneman – Basic elements of the flash virtualization platform – Part 2
http://frankdenneman.nl/2013/07/02/basic-elements-of-fvp-part-2-using-own-platform-versus-in-place-file-system/

Frank Denneman – FVP Remote Flash Access
http://frankdenneman.nl/2013/08/07/fvp-remote-flash-access/

Frank Dennaman – Design considerations for the host local FVP architecture
http://frankdenneman.nl/2013/08/16/design-considerations-for-the-host-local-architecture/

Satyam Vaghani introducing PernixData FVP at Storage Field Day 3
http://www.pernixdata.com/SFD3/

Write-back deepdive by Frank and Satyam
http://www.pernixdata.com/files/wb-deepdive.html

Hunting down unnecessary I/O before you buy that next storage solution

Are legacy processes and workflows sabotaging your storage performance? If you are on the verge of committing good money for more IOPS, or lower latency, it might be worth taking a look at what is sucking up all of those I/Os.

In my previous posts about improving the performance of our virtualized code compiling systems, it was identified that storage performance was a key factor in our ability to leverage our scaled up compute resources. The classic response to this dilemma has been to purchase faster storage. While that might be a part of the ultimate solution, there is another factor worth looking into; legacy processes, and how they might be impacting your environment.

Even though new technologies are helping deliver performance improvements, one constant is that traditional, enterprise class storage is expensive. Committing to faster storage usually means committing large chunks of dollars to the endeavor. This can be hard to swallow at budget time, or doesn’t align well with the immediate needs. And there can certainly be a domino effect when improving storage performance. If your fabric cannot support a fancy new array, the protocol type, or speed, get ready to spend even more money.

Calculated Indecision
In the optimization world, there is an approach called "delay until the last responsible moment" (LRM). Do not mistake this for a procrastinator’s creed of "kicking the can." It is a pretty effective, Agile-esque strategy in hedging against poor, or premature purchasing decisions to, in this case, the rapidly changing world of enterprise infrastructures. Even within the last few years, some companies have challenged traditional thinking when it comes to how storage and compute is architected. LRM helps with this rapid change, and has the ability to save a lot of money in the process.

Look before you buy
Writes are what you design around and pay big money for, so wouldn’t it be logical to look at your infrastructure to see if legacy processes are undermining your ability to deliver I/O? That is the step I took in an effort to squeeze out every bit of performance that I could with my existing arrays before I commit to a new solution. My quick evaluation resulted in this:

  • Using array based snapshotting for short term protection was eating up way too much capacity; 25 to 30TB. That is almost half of my total capacity, and all for a retention time that wasn’t very good. How does capacity relate to performance? Well, if one doesn’t need all of that capacity for snapshot or replica reserves, one might be able to run at a better performing RAID level. Imagine being able to cut the write penalty by 2 to 3 times if you were currently running RAID levels focused on capacity. For a write-intensive environment like mine, that is a big deal.
  • Legacy I/O intensive applications and processes identified. What are they, can they be adjusted, or are they even needed anymore.

Much of this I didn’t need to do a formal analysis of. I knew the environment well enough to know what needed to be done. Here is what the plan of action has consisted of.

  • Ditch the array based snapshots and remote replicas in favor of Veeam. This is something that I wanted to do for some time. Local and remote protection is now the responsibility of some large Synology NAS units as the backup target for Veeam. Everything about this combination has worked incredibly well. For those interested, I’ll be writing about this arrangement in the near future.
  • Convert existing Guest Attached Volumes to native VMDKs. My objective with this is to make Veeam see the data so that it can protect it. Integrated, compressed and deduped. What it does best.
  • Reclaim all of the capacity gained from no longer using snaps and replicas, and rebuild one of the arrays from RAID 50, to RAID 10. This will cut the write penalty from 4, to 2.
  • Adjust or eliminate legacy I/O intensive apps.

The Culprits
Here were the biggest influencers of legacy I/O intensive applications (“legacy” after the incorporation of Veeam).  Total time per day shown below, and may reflect different backup frequencies.

Source:  Legacy SharePoint backup solution
Cost:  300 write IOPS for 8 hours per day
Action:  This can be eliminated because of Veeam

Source:  Legacy Exchange backup solution
Cost:  300 write IOPS for 1 hour per day
Action:  This can be eliminated because of Veeam

Source:  SourceCode (SVN) hotcopies and dumps
Cost:  200-700 IOPS for 12 hours per day.
Action:  Hot copies will be eliminated, but SVN dumps will be redirected to an external target.  An optional method of protection that in a sense is unnecessary, but source code is the lifeblood of a software company, so it is worth the overhead right now.

Source:  Guest attached Volume Copies
Cost:  Heavy read IOPS on mounted array snapshots when dumping to external disk or tape.
Action:  Guest attached volumes will be converted to native VMDKs so that Veeam can see and protect the data.

Notice the theme here? Much of the opportunities for improvement in reducing I/O had to do with dealing with legacy “in-guest” methods of protecting the data.  Moving to a hypervisor centric backup solution like Veeam has also reinforced a growing feeling I’ve had about storage array specific features that focus on data protection.  I’ve grown to be disinterested in them.  Here are a few reasons why.

  • It creates an indelible tie between your virtualization infrastructure, protection, and your storage. We all love the virtues of virtualizing compute. Why do I want to make my protection mechanisms dependent on a particular kind of storage? Abstract it out, and it becomes way easier.
  • Need more replica space? Buy more arrays. Need more local snapshot space? Buy more arrays. You end up keeping protection on pretty expensive storage
  • Modern backup solutions protect the VMs, the applications, and the data better. Application awareness may have been lacking years ago, but not anymore.
  • Vendor lock-in. I’ve never bought into this argument much, mostly because you end up having to make a commitment at some point with just about every financial decision you make. However, adding more storage arrays can eat up an entire budget in an SMB/SME world. There has to be a better way.
  • Complexity. You end up having a mess of methods of how some things are protected, while other things are protected in a different way. Good protection often comes in layers, but choosing a software based solution simplifies the effort.

I used to live by array specific tools for protecting data. It was all I had, and they served a very good purpose.  I leveraged them as much as I could, but in hindsight, they can make a protection strategy very complex, fragile, and completely dependent on sticking with that line of storage solutions. Use a solution that hooks into the hypervisor via the vCenter API, and let it do the rest.  Storage vendors should focus on what they do best, which is figuring out ways to deliver bigger, better, and faster storage.

What else to look for.
Other possible sources that are robbing your array of I/Os:

  • SQL maintenance routines (dumps, indexing, etc.). While necessary, you may choose to run these at non peak hours.
  • Defrags. Surely you have a GPO shutting off this feature on all your VMs, correct? (hint, hint)
  • In-guest legacy anything. Traditional backup agents are terrible. Virus scan’s aren’t much better.
  • User practices.  Don’t be surprised if you find out some department doing all sorts of silly things that translates into heavy writes.  (.e.g. “We copy all of our data to this other directory hourly to back it up.”)
  • Guest attached volumes. While they can be technically efficient, one would have to rely on protecting these in other ways because they are not visible from vCenter. Often this results in some variation of making an array based snapshot available to a backup application. While it is "off-host" to the production system, this method takes a very long time, whether the target is disk or tape.

One might think that eventually, the data has to be committed to external disk or tape anyway, so what does it matter.  When it is file level backups, it matters a lot.  For instance, committing 9TB of guest attached volume data (millions of smaller files) directly to tape takes nearly 6 days to complete.  Committing 9TB of Veeam backups to tape takes just a little over 1 day.

    The Results

So how much did these steps improve the average load on the arrays? This is a work in progress, so I don’t have the final numbers yet. But with each step, contention is decreased on my arrays, and my protection strategy has become several orders of magnitude simpler in the process.

With all of that said, I will be addressing my storage performance needs with *something* new. What might that be?  Stay tuned.

Fixing host connection issues on Dell servers in vSphere 5.x

I had a conversation recently with a few colleagues at the Dell Enterprise Forum, and as they were describing the symptoms they were having with some Dell servers in their vSphere cluster, it sounded vaguely similar to what I had experienced recently with my new M620 hosts running vSphere 5.0 Update 2.  While I’m uncertain if their issues were related in any way to mine, it occurred to me that I might not have been the only one out there who ran into this problem.  So I thought I’d provide a post to help anyone else experiencing the behavior I encountered.

Symptoms
The new cluster Dell M620 blades running vSphere 5.0 U2 that was being used as our Development Teams code compiling cluster were randomly dropping their connections.  Yep, not good.  This wasn’t normal behavior of course, and the effects ranged anywhere from still being up (but acting odd) to complete isolation of the host with no success at a soft recovery.  The hosts themselves had the latest firmware applied to them, and I used the custom Dell ESXi ISO when building the host.  Each service (Mgmt, LAN, vMotion, storage) were meshed so that one service didn’t depend on a single, multiport NIC adapter, but they still went down.  What was creating the problem?  I won’t leave you hanging.  It was the Broadcom network drivers for ESXi.

Before I figured out what the problem was, here is what I knew:

  • The behavior was only occurring on a cluster of 4 Dell M620 hosts.  The other cluster containing M610’s never experienced this issue.
  • They had occurred on each host at least once, typically when there was a higher likelihood for heavy traffic.
  • Various services had been impacted.  One time it was storage, while the other time it was the LAN side.

Blade configuration background
To understand the symptoms, and the correction a bit better, it is worth getting an overview of what the Dell M620 blade looks like in terms of network connectivity.  What I show below reflects my 1GbE environment, and would look different if I was using 10GbE, or with switch modules instead of passthrough modules.

The M620 blades come with a built in Broadcom NetXtreme II BCM M57810 10gbps Ethernet adapter.  This provides for two 10gbps ports on fabric A of the blade enclosure.  These will negotiate down to 1GbE if you have passthroughs on the back of the enclosure, as I do.

image

There are two spots in each blade that will accept additional mezzanine adapters for fabric B, and fabric C respectively.  In my case, since I also have 1GbE passthroughs on these fabrics as well, I chose to use the Broadcom NetXtreme BCM5719gbe adapter.  Each will provide 4, 1gbe ports.  With passthroughs, only two of the four on each adapter are reachable.  The end result is 6, 1GbE ports available for use for each blade.  Two for storage.  Two for Production LAN traffic, and two for vSphere Mgmt and vMotion.  All services needed (iSCSI, Mgmt, etc.) are assigned so that in the event of a single adapter failure, you’re still good to go.

image

And yes, I’d love to go to 10GbE as much as anyone, but that is a larger matter especially when dealing with blades and the enclosure that they reside in.  Feel free to send me a check, and I’ll return the favor with a nice post.

How to diagnose, and correct
On one of the cases, this event caused an All Paths Down from the host to my storage.  I looked in my /scratch/log for the host, with the intent of looking into the vmkernel and vobd.log files to see what was up.  The following command returned several entries that looked like below

less /scratch/log/vobd.log

2013-04-03T16:17:33.849Z: [iscsiCorrelator] 6384105406222us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.2001-05.com.equallogic:0-8a0906-d0a034d04-d6b3c92ecd050e84-vmfs001 on vmhba40 @ vmk3 failed. The iSCSI initiator could not establish a network connection to the target.

2013-04-03T16:17:44.829Z: [iscsiCorrelator] 6384104156862us: [vob.iscsi.target.connect.error] vmhba40 @ vmk3 failed to login to iqn.2001-05.com.equallogic:0-8a0906-e98c21609-84a00138bf64eb18-vmfs002 because of a network connection failure.

Then I ran the following just to verify what I had for NICs and their associations

esxcfg-nics -l

Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description
vmnic0  0000:01:00.00 bnx2x       Up   1000Mbps  Full   00:22:19:9e:64:9b 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet
vmnic1  0000:01:00.01 bnx2x       Up   1000Mbps  Full   00:22:19:9e:64:9e 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet
vmnic2  0000:03:00.00 tg3         Up   1000Mbps  Full   00:22:19:9e:64:9f 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic3  0000:03:00.01 tg3         Up   1000Mbps  Full   00:22:19:9e:64:a0 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic4  0000:03:00.02 tg3         Down 0Mbps     Half   00:22:19:9e:64:a1 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic5  0000:03:00.03 tg3         Down 0Mbps     Half   00:22:19:9e:64:a2 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic6  0000:04:00.00 tg3         Up   1000Mbps  Full   00:22:19:9e:64:a3 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic7  0000:04:00.01 tg3         Up   1000Mbps  Full   00:22:19:9e:64:a4 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic8  0000:04:00.02 tg3         Down 0Mbps     Half   00:22:19:9e:64:a5 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic9  0000:04:00.03 tg3         Down 0Mbps     Half   00:22:19:9e:64:a6 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet

Knowing what vmnics were being used for storage traffic, I took a look at the driver version for vmnic3

ethtool -i vmnic3

driver: tg3
version: 3.124c.v50.1
firmware-version: FFV7.4.8 bc 5719-v1.31
bus-info: 0000:03:00.1

Time to check and see if there were updated drivers.

Finding and updating the drivers
The first step was to check the compatibility matrix out at the VMware Compatibility Guide for this particular NIC.  The good news was that there was an updated driver for this adapter; 3.129d.v50.1.  I downloaded the latest driver (vib) for that NIC to a datastore that was accessible to the host, so that it could be installed.  The process of making the driver available for installation, as well as the installation itself can certainly be done with the VMware Update Manager, but for my example, I’m performing these steps from the command line.  Remember to go into maintenance mode first. 

esxcli software vib install -v /vmfs/volumes/VMFS001/drivers/broadcom/net-tg3-3.129d.v50.1-1OEM.500.0.0.472560.x86_64.vib

Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: Broadcom_bootbank_net-tg3_3.129d.v50.1-1OEM.500.0.0.472560
VIBs Removed: Broadcom_bootbank_net-tg3_3.124c.v50.1-1OEM.500.0.0.472560
VIBs Skipped:

The final steps will be to reboot the host, and verify the results.

ethtool -i vmnic3

driver: tg3
version: 3.129d.v50.1
firmware-version: FFV7.4.8 bc 5719-v1.31
bus-info: 0000:03:00.0

Conclusion
I initially suspected that the problems were driver related, but the symptoms generated from the bad drivers made it give the impression that there was a larger issue at play.  Nevertheless, I couldn’t get these drivers loaded up fast enough, and since that time (about 3 months), they have been rock solid, and behaving normally.

Helpful links
Determining Network/Storage firmware and driver version in ESXi
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1027206

VMware Compatibility Guide
http://www.vmware.com/resources/compatibility/search.php?deviceCategory=io&productid=19946&deviceCategory=io&releases=187&keyword=bcm5719&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

My VMworld “Call for Papers” submission, and getting more involved

It is a good sign that you are in the right business when you get tremendous satisfaction from your career – whether it be from the daily challenges at work, or through professional growth, learning, or sharing.  It’s been an exciting month for me, as I’ve taken a few steps to get more involved.

First, I decided to submit my application for the 2013 VMware vExpert program.  I’ve sat on the sidelines, churning out blog posts for 4 years now, but with the encouragement of a few of my fellow VMUG comrades and friends, decided to put my hat in the game with others equally as enthusiastic as I am about what many of us do for a living.  The list has not been announced yet, so we’ll see what happens.  I’m also now officially part of the Seattle VMUG steering committee, contributing where I can to provide more value to the local VMUG community.

Next, I was honored to be recognized as a 2013 Dell TechCenter Rockstar.  Started in 2012, the DTC Rockstar program recognizes those Subject Matter Experts and enthusiasts who share their knowledge on the portfolio of Dell solutions in the Enterprise.  And I am flattered to be in great company with the others who have been recognized by their efforts.   Congratulations to the others who were recognized as well. 

And finally, I took a stab at submitting an abstract for consideration as a possible session at this year’s VMworld.  I can’t say I ever imagined a scenario in which I would be responding to VMware’s annual “Call for Papers”, but with real-life use cases comes really interesting stories. I had a really interesting story.  My session title is:

4370 – Compiling code in virtual machines: Identifying bottlenecks and optimizing performance to scale out development environments

image

This session was inspired from part 1 and part 2 of “Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements.”  What transpired from the project was a fascinating exercise in assumptions, bottleneck chasing, and a modern virtualized infrastructure’s ability to scale up computational power immediately for an organization.  I’ve received great feedback from those posts, but the posts just skimmed the surface on what was learned. What better way to demonstrate a very unique use-case than to share the details with those who really care.  Take a look out at:  http://www.vmworld.com/cfp.jspa.  My submission is under the “Customer Case Studies” track, number 4730.  Public voting is now open.  If you don’t have a VMworld account, just create one – it’s free.  Click on the session to read the abstract, and if you like what you see, click on the “thumbs up” button to put in a vote for it.

Spend enough time in IT, and it turns out you might have an opinion or two on things.  How to make it all work, and how to keep your sanity.  I haven’t quite figured out the definitive answers to either one of those yet, but when there is an opportunity to contribute, I try my best to pay it forward to the great communities of geeks out there.  Thanks for reading.

Follow

Get every new post delivered to your Inbox.

Join 737 other followers