VMware – Page 4

Accelerating storage using PernixData’s FVP. A perspective from customer #0001

Recently, I described in "Hunting down unnecessary I/O before you buy that next storage solution" the efforts around addressing "technical debt" that was contributing to unnecessary I/O. The goal was to get better performance out of my storage infrastructure. It’s been a worthwhile endeavor that I would recommend to anyone, but at the end of the day, one might still need faster storage. That usually means, free up another 3U of rack space, and open checkbook

Or does it? Do I have to go the traditional route of adding more spindles, or investing heavily in a faster storage fabric? Well, the answer was an unequivocal "yes" not too long ago, but times are a changing, and here is my way to tackle the problem in a radically different way.

I’ve chosen to delay any purchases of an additional storage array, or the infrastructure backing it, and opted to go PernixData FVP. In fact, I was customer #0001 after PernixData announced GA of FVP 1.0. So why did I go this route?

1. Clustered host based caching. Leveraging server side flash brings compute and data closer together, but thanks to FVP, it does so in such a way that works in a highly available clustered fashion that aligns perfectly with the feature sets of the hypervisor.

2. Write-back caching. The ability to deliver writes to flash is really important. Write-through caching, which waits for the acknowledgement from the underlying storage, just wasn’t good enough for my environment. Rotational latencies, as well as physical transport latencies would still be there on over 80% of all of my traffic. I needed true write-back caching that would acknowledge the write immediately, while eventually de-staging it down to the underlying storage.

3. Cost. The gold plated dominos of upgrading storage is not fun for anyone on the paying side of the equation. Going with PernixData FVP was going to address my needs for a fraction of the cost of a traditional solution.

4. It allows for a significant decoupling of "storage for capacity" versus "storage for performance" dilemma when addressing additional storage needs.

5. Another array would have been to a certain degree, more of the same. Incremental improvement, with less than enthusiastic results considering the amount invested. I found myself not very excited to purchase another array. With so much volatility in the storage market, it almost seemed like an antiquated solution.

6. Quick to implement. FVP installation consists of installing a VIB via Update Manager or the command line, installing the Management services and vCenter plugin, and you are off to the races.

7. Hardware independent. I didn’t have to wait for a special controller upgrade, firmware update, or wonder if my hardware would work with it. (a common problem with storage array solutions). Nor did I have to make a decision to perhaps go with a different storage vendor if I wanted to try a new technology. It is purely a software solution with the flexibility of working with multiple types of flash; SSDs, or PCIe based.

A different way to solve a classic problem
While my write intensive workload is pretty unique, my situation is not. Our storage performance needs outgrew what the environment was designed for; capacity at a reasonable cost. This is an all too common problem. With the increased capacities of spinning disks, it has actually made this problem worse, not better. Fewer and fewer spindles are serving up more and more data.

My goal was to deliver the results our build VMs were capable of delivering with faster storage, but unable to because of my existing infrastructure. For me it was about reducing I/O contention to allow the build system CPU cycles to deliver the builds without waiting on storage. For others it might delivering lower latencies to their SQL backed ERP or CRM servers.

The allure of utilizing flash has been an intriguing one. I often found myself looking at my vSphere hosts and all of it’s processing goodness, but disappointed those SSD sitting in the hosts couldn’t help to augment my storage performance needs. Being an active participant in the PernixData beta program allowed me to see how it would help me in my environment, and if it would deliver the needs of the business.

Lessons learned so far
Don’t skimp on quality SSDs. Would you buy an ESXi host with one physical core? Of course you wouldn’t. Same thing goes with SSDs. Quality flash is a must! I can tell you from first hand experience that it makes a huge difference. I thought the Dell OEM SSDs that came with my M620 blades were fine, but by way of comparison, they were terrible. Don’t cripple a solution by going with cheap flash. In this 4 node cluster, I went with 4 EMLC based, 400GB Intel S3700s. I also had the opportunity to test some Micron P400M EMLC SSDs, which also seemed to perform very well.

While I went with 400GB SSDs in each host (giving approximately 1.5TB of cache space for a 4 node cluster), I did most of my testing using 100GB SSDs. They seemed adequate in that they were not showing a significant amount of cache eviction, but I wanted to leverage my purchasing opportunity to get larger drives. Knowing the best size can be a bit of a mystery until you get things in place, but having a larger cache size allows for a larger working set of data available for future reads, as well as giving head room for the per-VM write-back redundancy setting available.

An unexpected surprise is how FVP has given me visibility into the one area of I/O monitoring that is traditional very difficult to see; I/O patterns. See Iometer. As good as you want to make it. Understanding this element of your I/O needs is critical, and the analytics in FVP has helped me discover some very interesting things about my I/O patterns that I will surely be investigating in the near future.

In the read-caching world, the saying goes that the fastest storage I/O is the I/O the array never will see. Well, with write caching, it eventually needs to be de-staged to the array. While FVP will improve delivery of storage to the array by absorbing the I/O spikes and turning random writes to sequential writes, the I/O will still eventually have to be delivered to the backend storage. In a more write intensive environment, if the delta between your fast flash and your slow storage is significant, and your duty cycle of your applications driving the I/O is also significant, there is a chance it might not be able to keep up. It might be a corner case, but it is possible.

What’s next
I’ll be posting more specifics on how running PernixData FVP has helped our environment. So, is it really "disruptive" technology? Time will ultimately tell. But I chose to not purchase an array along with new SAN switchgear because of it. Using FVP has lead to less traffic on my arrays, with higher throughput and lower read and write latencies for my VMs. Yeah, I would qualify that as disruptive.

Helpful Links

Frank Denneman – Basic elements of the flash virtualization platform – Part 1
http://frankdenneman.nl/2013/06/18/basic-elements-of-the-flash-virtualization-platform-part-1/

Frank Denneman – Basic elements of the flash virtualization platform – Part 2
http://frankdenneman.nl/2013/07/02/basic-elements-of-fvp-part-2-using-own-platform-versus-in-place-file-system/

Frank Denneman – FVP Remote Flash Access
http://frankdenneman.nl/2013/08/07/fvp-remote-flash-access/

Frank Dennaman – Design considerations for the host local FVP architecture
http://frankdenneman.nl/2013/08/16/design-considerations-for-the-host-local-architecture/

Satyam Vaghani introducing PernixData FVP at Storage Field Day 3
http://www.pernixdata.com/SFD3/

Write-back deepdive by Frank and Satyam
http://www.pernixdata.com/files/wb-deepdive.html

Iometer. As good as you want to make it.

Most know Iometer as the go-to synthetic I/O measuring tool used to simulate real workload conditions. Well, somewhere, somehow, someone forgot the latter part of that sentence, which is why it ends up being so misused and abused. How many of us have seen a storage solution delivering 6 figure IOPS using Iometer, only to find that they are running a 100% read, 512 byte 100% sequential access workload simulation. Perfect for the two people on the planet that those specifications might apply to. For the rest of us, it doesn’t help much. So why would they bother running that sort of unrealistic test? Pure, unapologetic number chasing.

The unfortunate part is that sometimes this leads many to simply dismiss Iometer results. That is a shame really, as it can provide really good data if used in the correct way. Observing real world data will tell you a lot of things, but the sporadic nature of real workloads make it difficult to use for empirical measurement – hence the need for simulation.

So, what are the correct settings to use in Iometer? The answer is completely dependent on what you are trying to accomplish. The race for a million IOPS by your favorite storage vendor really means nothing if their is no correlation between their simulated workload, and your real workload. Maybe IOPS isn’t even an issue for you. Perhaps your applications are struggling with poor latency. The challenge is to emulate your environment with a synthetic workload that helps you understand how a potential upgrade, new array, or optimization might be of benefit.

The mysteries of workloads
Creating a synthetic workload representing your real workload assumes one thing; that you know what your real workload really is. This can be more challenging that one might think, as many storage monitoring tools do not help you understand the subtleties of patterns to the data that is being read or written.

Most monitoring tools tend to treat all I/O equally. By that I mean, if over a given period of time, say you have 10 million I/Os occur. Let’s say your monitoring tells you that you average 60% reads and 40% writes. What is not clear is how many of those reads are multiple reads of the same data or completely different, untouched data. It also doesn’t tell you if the writes are overwriting existing blocks (which might be read again shortly thereafter) or generating new data. As more and more tiered storage mechanisms comes into play, understanding this aspect of your workload is becoming extremely important. You may be treating your I/Os equally, but the tiered storage system using sophisticated caching algorithms certainly do not.

How can you gain more insight? Use every tool at your disposal. Get to know your applications, and the duty cycles around them. What are your peak hours? Are they in the middle of the day, or in the middle of the night when backups are running?

Suggestions on Iometer settings
You may find that the settings you choose for Iometer yields results from your shared storage that isn’t nearly as good as you thought. But does it matter? If it is an accurate representation of your real workload, not really. What matters is if are you able to deliver the payload from point a to point b to meet your acceptance criteria (such as latency, throughput, etc.). The goal would be to represent that in a synthetic workload for accurate measurement and comparison.

With that in mind, here are some suggestions for the next time you set up those Iometer runs.

1. Read/write ratio. Choose a realistic read/write ratio representing your workload. With writes, RAID penalties can hurt your effective performance by quite a bit, so if you don’t have an idea of what this ratio currently is, it’s time for you to find out.

2. Transfer request size. Is your payload the size of a ball bearing, or a bowling ball? Applications and operating systems vary on what size is used. Use your monitoring systems to best determine what your environment consists of.

3. Disk size. Use the "maximum disk size" in multiples of 1048576, which is a 1GB file. Throwing a bunch of zeros in there might fill up your disk with Iometer’s test file. Depending on your needs, a setting of 2 to 20 GB might be a good range to work with.

4. Number of outstanding I/Os. This needs to be high enough so that the test can keep sending I/O requests to it as the storage is fulfilling requests to it. A setting of 32 is pretty common.

5. Alignment of I/O. Many of the standard Iometer ICF files you find were built for physical drives. It has the "Align I/Os on:" setting to "Sector boundaries" When running tests on a storage array, this can lead to inconsistent results, so it is best to align on 4K or 512 bytes.

6. Ramp up time. Offer at least a minute of ramp up time.

7. Run time. Some might suggest running simulations long enough to exhaust all caching, so that you can see "real" throughput. While I understand the underlying reason for this statement, I believe this is missing the point. Caching is there in the first place to take advantage of a working set of warm and cold data, bursts, etc. If you have a storage solution that satisfies the duty cycles that exists in your environment, that is the most important part.

8. Number of workers. Let this spawn automatically to the number of logical processors in your VM. It might be overkill in many cases because of terrible multithreading abilities of most applications, but its a pretty conventional practice.

9. Multiple Iometer instances. Not really a setting, but more of a practice. I’ve found running multiple tests a way to better understand how a storage solution will react under load as opposed to on it’s own. It is shared storage after all.

Disclaimer
If you were looking for this to be the definitive post on Iometer, that isn’t what I was shooting for. There are many others who are much more qualified to speak to the nuances of Iometer than me. What I hope to do is to offer a little practical perspective on it’s use, and how it can help you. So next time you run Iometer, think about what you are trying to accomplish, and let go of the number chasing. Understand your workloads, and use the tool to help you improve your environment.

Seattle VMUG Meeting for August 2013

VMworld 2013 is just a few short weeks away, and with that comes a lot of excitement in the virtualization community. Whether or not you plan on attending VMworld, there is plenty to be excited for with the next Seattle VMUG on Wednesday, August 14th (5:00pm to 8:00pm) at the Bellevue Art Museum. If you are located in the greater Seattle area, here are a few reasons why you should block out your calendar for this event.

Great content is what the Seattle VMUG has been striving for, and I think this next one is a fine example of that. Two fantastic sponsors will be presenting.

Veeam Backup & Replication has a tremendously loyal following. If you get a chance to use it, you quickly understand why. Big improvements are coming in Veeam Backup & Replication v7 that may alter your onsite and offsite protection strategies for the better. This is a great chance to find out more about it.
PernixData. Billed as the first flash hypervisor, see how this solution (known as FVP) leverages server-side cache to accelerate storage performance for vSphere environments, and why it is turning heads in the industry. It just may change how you design your infrastructure

Need a few more reasons to go? Okay, well how about:

You have a great chance at one of two iPad Minis that will be given away. Yes, the Seattle VMUG manages to find some pretty classy prizes to give away. You can’t win if you don’t show up.
Chat it up with your fellow VMUGers. Come on over and introduce yourself. Find out what others are doing in their environments, and share a little bit about what your environment is like. If you work with virtualization or manage infrastructures, and claim that nobody understands what you do, here is your chance to surround yourself with others who do!
Soak up the sights at the Bellevue Art Museum. Located in the heart of downtown Bellevue, it’s a great venue that has plenty of parking, and will give you an opportunity to see the exhibits

So be sure to

register

and come by and say hi. You might walk away with a new iPad mini, and new ideas on how to improve your environment.

Hunting down unnecessary I/O before you buy that next storage solution

Are legacy processes and workflows sabotaging your storage performance? If you are on the verge of committing good money for more IOPS, or lower latency, it might be worth taking a look at what is sucking up all of those I/Os.

In my previous posts about improving the performance of our virtualized code compiling systems, it was identified that storage performance was a key factor in our ability to leverage our scaled up compute resources. The classic response to this dilemma has been to purchase faster storage. While that might be a part of the ultimate solution, there is another factor worth looking into; legacy processes, and how they might be impacting your environment.

Even though new technologies are helping deliver performance improvements, one constant is that traditional, enterprise class storage is expensive. Committing to faster storage usually means committing large chunks of dollars to the endeavor. This can be hard to swallow at budget time, or doesn’t align well with the immediate needs. And there can certainly be a domino effect when improving storage performance. If your fabric cannot support a fancy new array, the protocol type, or speed, get ready to spend even more money.

Calculated Indecision
In the optimization world, there is an approach called "delay until the last responsible moment" (LRM). Do not mistake this for a procrastinator’s creed of "kicking the can." It is a pretty effective, Agile-esque strategy in hedging against poor, or premature purchasing decisions to, in this case, the rapidly changing world of enterprise infrastructures. Even within the last few years, some companies have challenged traditional thinking when it comes to how storage and compute is architected. LRM helps with this rapid change, and has the ability to save a lot of money in the process.

Look before you buy
Writes are what you design around and pay big money for, so wouldn’t it be logical to look at your infrastructure to see if legacy processes are undermining your ability to deliver I/O? That is the step I took in an effort to squeeze out every bit of performance that I could with my existing arrays before I commit to a new solution. My quick evaluation resulted in this:

Using array based snapshotting for short term protection was eating up way too much capacity; 25 to 30TB. That is almost half of my total capacity, and all for a retention time that wasn’t very good. How does capacity relate to performance? Well, if one doesn’t need all of that capacity for snapshot or replica reserves, one might be able to run at a better performing RAID level. Imagine being able to cut the write penalty by 2 to 3 times if you were currently running RAID levels focused on capacity. For a write-intensive environment like mine, that is a big deal.
Legacy I/O intensive applications and processes identified. What are they, can they be adjusted, or are they even needed anymore.

Much of this I didn’t need to do a formal analysis of. I knew the environment well enough to know what needed to be done. Here is what the plan of action has consisted of.

Ditch the array based snapshots and remote replicas in favor of Veeam. This is something that I wanted to do for some time. Local and remote protection is now the responsibility of some large Synology NAS units as the backup target for Veeam. Everything about this combination has worked incredibly well. For those interested, I’ll be writing about this arrangement in the near future.
Convert existing Guest Attached Volumes to native VMDKs. My objective with this is to make Veeam see the data so that it can protect it. Integrated, compressed and deduped. What it does best.
Reclaim all of the capacity gained from no longer using snaps and replicas, and rebuild one of the arrays from RAID 50, to RAID 10. This will cut the write penalty from 4, to 2.
Adjust or eliminate legacy I/O intensive apps.

The Culprits
Here were the biggest influencers of legacy I/O intensive applications (“legacy” after the incorporation of Veeam). Total time per day shown below, and may reflect different backup frequencies.

Source: Legacy SharePoint backup solution
Cost: 300 write IOPS for 8 hours per day
Action: This can be eliminated because of Veeam

Source: Legacy Exchange backup solution
Cost: 300 write IOPS for 1 hour per day
Action: This can be eliminated because of Veeam

Source: SourceCode (SVN) hotcopies and dumps
Cost: 200-700 IOPS for 12 hours per day.
Action: Hot copies will be eliminated, but SVN dumps will be redirected to an external target. An optional method of protection that in a sense is unnecessary, but source code is the lifeblood of a software company, so it is worth the overhead right now.

Source: Guest attached Volume Copies
Cost: Heavy read IOPS on mounted array snapshots when dumping to external disk or tape.
Action: Guest attached volumes will be converted to native VMDKs so that Veeam can see and protect the data.

Notice the theme here? Much of the opportunities for improvement in reducing I/O had to do with dealing with legacy “in-guest” methods of protecting the data. Moving to a hypervisor centric backup solution like Veeam has also reinforced a growing feeling I’ve had about storage array specific features that focus on data protection. I’ve grown to be disinterested in them. Here are a few reasons why.

It creates an indelible tie between your virtualization infrastructure, protection, and your storage. We all love the virtues of virtualizing compute. Why do I want to make my protection mechanisms dependent on a particular kind of storage? Abstract it out, and it becomes way easier.
Need more replica space? Buy more arrays. Need more local snapshot space? Buy more arrays. You end up keeping protection on pretty expensive storage
Modern backup solutions protect the VMs, the applications, and the data better. Application awareness may have been lacking years ago, but not anymore.
Vendor lock-in. I’ve never bought into this argument much, mostly because you end up having to make a commitment at some point with just about every financial decision you make. However, adding more storage arrays can eat up an entire budget in an SMB/SME world. There has to be a better way.
Complexity. You end up having a mess of methods of how some things are protected, while other things are protected in a different way. Good protection often comes in layers, but choosing a software based solution simplifies the effort.

I used to live by array specific tools for protecting data. It was all I had, and they served a very good purpose. I leveraged them as much as I could, but in hindsight, they can make a protection strategy very complex, fragile, and completely dependent on sticking with that line of storage solutions. Use a solution that hooks into the hypervisor via the vCenter API, and let it do the rest. Storage vendors should focus on what they do best, which is figuring out ways to deliver bigger, better, and faster storage.

What else to look for.
Other possible sources that are robbing your array of I/Os:

SQL maintenance routines (dumps, indexing, etc.). While necessary, you may choose to run these at non peak hours.
Defrags. Surely you have a GPO shutting off this feature on all your VMs, correct? (hint, hint)
In-guest legacy anything. Traditional backup agents are terrible. Virus scan’s aren’t much better.
User practices. Don’t be surprised if you find out some department doing all sorts of silly things that translates into heavy writes. (.e.g. “We copy all of our data to this other directory hourly to back it up.”)
Guest attached volumes. While they can be technically efficient, one would have to rely on protecting these in other ways because they are not visible from vCenter. Often this results in some variation of making an array based snapshot available to a backup application. While it is "off-host" to the production system, this method takes a very long time, whether the target is disk or tape.

One might think that eventually, the data has to be committed to external disk or tape anyway, so what does it matter. When it is file level backups, it matters a lot. For instance, committing 9TB of guest attached volume data (millions of smaller files) directly to tape takes nearly 6 days to complete. Committing 9TB of Veeam backups to tape takes just a little over 1 day.

The Results

So how much did these steps improve the average load on the arrays? This is a work in progress, so I don’t have the final numbers yet. But with each step, contention is decreased on my arrays, and my protection strategy has become several orders of magnitude simpler in the process.

With all of that said, I will be addressing my storage performance needs with *something* new. What might that be? Stay tuned.

Fixing host connection issues on Dell servers in vSphere 5.x

I had a conversation recently with a few colleagues at the Dell Enterprise Forum, and as they were describing the symptoms they were having with some Dell servers in their vSphere cluster, it sounded vaguely similar to what I had experienced recently with my new M620 hosts running vSphere 5.0 Update 2. While I’m uncertain if their issues were related in any way to mine, it occurred to me that I might not have been the only one out there who ran into this problem. So I thought I’d provide a post to help anyone else experiencing the behavior I encountered.

Symptoms
The new cluster Dell M620 blades running vSphere 5.0 U2 that was being used as our Development Teams code compiling cluster were randomly dropping their connections. Yep, not good. This wasn’t normal behavior of course, and the effects ranged anywhere from still being up (but acting odd) to complete isolation of the host with no success at a soft recovery. The hosts themselves had the latest firmware applied to them, and I used the custom Dell ESXi ISO when building the host. Each service (Mgmt, LAN, vMotion, storage) were meshed so that one service didn’t depend on a single, multiport NIC adapter, but they still went down. What was creating the problem? I won’t leave you hanging. It was the Broadcom network drivers for ESXi.

Before I figured out what the problem was, here is what I knew:

The behavior was only occurring on a cluster of 4 Dell M620 hosts. The other cluster containing M610’s never experienced this issue.
They had occurred on each host at least once, typically when there was a higher likelihood for heavy traffic.
Various services had been impacted. One time it was storage, while the other time it was the LAN side.

Blade configuration background
To understand the symptoms, and the correction a bit better, it is worth getting an overview of what the Dell M620 blade looks like in terms of network connectivity. What I show below reflects my 1GbE environment, and would look different if I was using 10GbE, or with switch modules instead of passthrough modules.

The M620 blades come with a built in Broadcom NetXtreme II BCM M57810 10gbps Ethernet adapter. This provides for two 10gbps ports on fabric A of the blade enclosure. These will negotiate down to 1GbE if you have passthroughs on the back of the enclosure, as I do.

There are two spots in each blade that will accept additional mezzanine adapters for fabric B, and fabric C respectively. In my case, since I also have 1GbE passthroughs on these fabrics as well, I chose to use the Broadcom NetXtreme BCM5719gbe adapter. Each will provide 4, 1gbe ports. With passthroughs, only two of the four on each adapter are reachable. The end result is 6, 1GbE ports available for use for each blade. Two for storage. Two for Production LAN traffic, and two for vSphere Mgmt and vMotion. All services needed (iSCSI, Mgmt, etc.) are assigned so that in the event of a single adapter failure, you’re still good to go.

And yes, I’d love to go to 10GbE as much as anyone, but that is a larger matter especially when dealing with blades and the enclosure that they reside in. Feel free to send me a check, and I’ll return the favor with a nice post.

How to diagnose, and correct
On one of the cases, this event caused an All Paths Down from the host to my storage. I looked in my /scratch/log for the host, with the intent of looking into the vmkernel and vobd.log files to see what was up. The following command returned several entries that looked like below

less /scratch/log/vobd.log

2013-04-03T16:17:33.849Z: [iscsiCorrelator] 6384105406222us: [esx.problem.storage.iscsi.target.connect.error] Login to iSCSI target iqn.2001-05.com.equallogic:0-8a0906-d0a034d04-d6b3c92ecd050e84-vmfs001 on vmhba40 @ vmk3 failed. The iSCSI initiator could not establish a network connection to the target.

2013-04-03T16:17:44.829Z: [iscsiCorrelator] 6384104156862us: [vob.iscsi.target.connect.error] vmhba40 @ vmk3 failed to login to iqn.2001-05.com.equallogic:0-8a0906-e98c21609-84a00138bf64eb18-vmfs002 because of a network connection failure.

Then I ran the following just to verify what I had for NICs and their associations

esxcfg-nics -l

Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description
vmnic0 0000:01:00.00 bnx2x       Up   1000Mbps Full   00:22:19:9e:64:9b 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet
vmnic1 0000:01:00.01 bnx2x       Up   1000Mbps Full   00:22:19:9e:64:9e 1500   Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet
vmnic2 0000:03:00.00 tg3         Up   1000Mbps Full   00:22:19:9e:64:9f 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic3 0000:03:00.01 tg3         Up   1000Mbps Full   00:22:19:9e:64:a0 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic4 0000:03:00.02 tg3         Down 0Mbps     Half   00:22:19:9e:64:a1 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic5 0000:03:00.03 tg3         Down 0Mbps     Half   00:22:19:9e:64:a2 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic6 0000:04:00.00 tg3         Up   1000Mbps Full   00:22:19:9e:64:a3 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic7 0000:04:00.01 tg3         Up   1000Mbps Full   00:22:19:9e:64:a4 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic8 0000:04:00.02 tg3         Down 0Mbps     Half   00:22:19:9e:64:a5 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic9 0000:04:00.03 tg3         Down 0Mbps     Half   00:22:19:9e:64:a6 1500   Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet

Knowing what vmnics were being used for storage traffic, I took a look at the driver version for vmnic3

ethtool -i vmnic3

driver: tg3
version: 3.124c.v50.1
firmware-version: FFV7.4.8 bc 5719-v1.31
bus-info: 0000:03:00.1

Time to check and see if there were updated drivers.

Finding and updating the drivers
The first step was to check the compatibility matrix out at the VMware Compatibility Guide for this particular NIC. The good news was that there was an updated driver for this adapter; 3.129d.v50.1. I downloaded the latest driver (vib) for that NIC to a datastore that was accessible to the host, so that it could be installed. The process of making the driver available for installation, as well as the installation itself can certainly be done with the VMware Update Manager, but for my example, I’m performing these steps from the command line. Remember to go into maintenance mode first.

esxcli software vib install -v /vmfs/volumes/VMFS001/drivers/broadcom/net-tg3-3.129d.v50.1-1OEM.500.0.0.472560.x86_64.vib

Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: Broadcom_bootbank_net-tg3_3.129d.v50.1-1OEM.500.0.0.472560
VIBs Removed: Broadcom_bootbank_net-tg3_3.124c.v50.1-1OEM.500.0.0.472560
VIBs Skipped:

The final steps will be to reboot the host, and verify the results.

ethtool -i vmnic3

driver: tg3
version: 3.129d.v50.1
firmware-version: FFV7.4.8 bc 5719-v1.31
bus-info: 0000:03:00.0

Conclusion
I initially suspected that the problems were driver related, but the symptoms generated from the bad drivers made it give the impression that there was a larger issue at play. Nevertheless, I couldn’t get these drivers loaded up fast enough, and since that time (about 3 months), they have been rock solid, and behaving normally.

Helpful links
Determining Network/Storage firmware and driver version in ESXi
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1027206

VMware Compatibility Guide
http://www.vmware.com/resources/compatibility/search.php?deviceCategory=io&productid=19946&deviceCategory=io&releases=187&keyword=bcm5719&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

My VMworld “Call for Papers” submission, and getting more involved

It is a good sign that you are in the right business when you get tremendous satisfaction from your career – whether it be from the daily challenges at work, or through professional growth, learning, or sharing. It’s been an exciting month for me, as I’ve taken a few steps to get more involved.

First, I decided to submit my application for the 2013 VMware vExpert program. I’ve sat on the sidelines, churning out blog posts for 4 years now, but with the encouragement of a few of my fellow VMUG comrades and friends, decided to put my hat in the game with others equally as enthusiastic as I am about what many of us do for a living. The list has not been announced yet, so we’ll see what happens. I’m also now officially part of the Seattle VMUG steering committee, contributing where I can to provide more value to the local VMUG community.

Next, I was honored to be recognized as a 2013 Dell TechCenter Rockstar. Started in 2012, the DTC Rockstar program recognizes those Subject Matter Experts and enthusiasts who share their knowledge on the portfolio of Dell solutions in the Enterprise. And I am flattered to be in great company with the others who have been recognized by their efforts. Congratulations to the others who were recognized as well.

And finally, I took a stab at submitting an abstract for consideration as a possible session at this year’s VMworld. I can’t say I ever imagined a scenario in which I would be responding to VMware’s annual “Call for Papers”, but with real-life use cases comes really interesting stories. I had a really interesting story. My session title is:

4370 – Compiling code in virtual machines: Identifying bottlenecks and optimizing performance to scale out development environments

This session was inspired from part 1 and part 2 of “Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements.” What transpired from the project was a fascinating exercise in assumptions, bottleneck chasing, and a modern virtualized infrastructure’s ability to scale up computational power immediately for an organization. I’ve received great feedback from those posts, but the posts just skimmed the surface on what was learned. What better way to demonstrate a very unique use-case than to share the details with those who really care. Take a look out at: http://www.vmworld.com/cfp.jspa. My submission is under the “Customer Case Studies” track, number 4730. Public voting is now open. If you don’t have a VMworld account, just create one – it’s free. Click on the session to read the abstract, and if you like what you see, click on the “thumbs up” button to put in a vote for it.

Spend enough time in IT, and it turns out you might have an opinion or two on things. How to make it all work, and how to keep your sanity. I haven’t quite figured out the definitive answers to either one of those yet, but when there is an opportunity to contribute, I try my best to pay it forward to the great communities of geeks out there. Thanks for reading.

Configuring a VM for SNMP monitoring using Cacti

There are a number of things that I don’t miss with old physical infrastructures. One near the top of the list is a general lack of visibility for each and every system. Horribly underutilized hardware running happily along side overtaxed or misconfigured systems, and it all looked the same. Fortunately, virtualization has changed much of that nonsense, and performance trending data of VMs and hosts are a given.

Partners in the VMware ecosystem are able to take advantage of the extensibility by offering useful tools to improve management and monitoring of other components throughout the stack. The Dell Management Plug-in for VMware vCenter is a great example of that. It does a good job of integrating side-band management and event driven alerting inside of vCenter. However, in many cases you still need to look at performance trending data of devices that may not inherently have that ability on it’s own. Switchgear is a great example of a resource that can be left in the dark. SNMP can be used to monitor switchgear and other types of devices, but it’s use is almost always absent in smaller environments. But there are simple options to help provide better visibility even for the smallest of shops. This post will provide what you need to know to get started.

In this example, I will be setting up a general purpose SNMP management system running Cacti to monitor the performance of some Dell PowerConnect switchgear. Cacti leverages RRDTool’s framework to deliver time based performance monitoring and graphing. It can monitor a number of different types of systems supporting SNMP, but switchgear provides the best example that most everyone can relate to. At a very affordable price (free), Cacti will work just fine in helping with these visibility gaps.

Monitoring VM
The first thing to do is to build a simple Linux VM for the purpose of SNMP management. One would think there would be a free Virtual Appliance out on the VMware Virtual Appliance Marektplace for this purpose, but if there is, I couldn’t find it. Any distribution will work, but my instructions will cater toward the Debian distributions – particularly Ubuntu, or a Ubuntu clone like Linux Mint (my personal favorite). Set it for 1vCPU and 512 MB of RAM. Assign it a static address on your network management VLAN (if you have one). Otherwise, your production LAN will be fine. While it is a single purpose built VM, you still have to live with it, so no need to punish yourself by leaving it bare bones. Go ahead and install the typical packages (e.g. vim, ssh, ntp, etc.) for convenience or functionality.

Templates are an option that extend the functionality in Cacti. In the case of the PowerConnect switches, the template will assist in providing information on CPU, memory, and temperature. A template for the PowerConnect 6200 line of switches can be found here. The instructions below will include how to install this.

Prepping SNMP on the switchgear

In the simplest of configurations (which I will show here), there really isn’t much to SNMP. For this scenario, one will be providing read-only access of SNMP via a shared community name. The monitoring VM will poll these devices and update the database accordingly.

If your switchgear is isolated, as your SAN switchgear might be, then there are a few options to make the switches visible in the right way. Regardless of what option you use, the key is to make sure that your iSCSI storage traffic lives on a different VLAN from your management interface of the device. I outline a good way to do this at “Reworking my PowerConnect 6200 switches for my iSCSI SAN”

There are a couple of options in connecting the isolated storage switches to gather SNMP data:

Option 1: Connect a dedicated management port on your SAN switch stack back to your LAN switch stack.

Option 2: Expose the SAN switch management VLAN using a port group on your iSCSI vSwitch.

I prefer option 1, but regardless, if it is iSCSI switches you are dealing with, you will want to make sure that management traffic is on a different VLAN than your iSCSI traffic to maintain the proper isolation of iSCSI traffic.

Once the communication is in place, just make a few changes to your PowerConnect switchgear. Note that community names are case sensitive, so decide on a name, and stick with it.

enable

configure

snmp-server location "Headquarters"

snmp-server contact "IT"

snmp-server community mycompany ro ipaddress 192.168.10.12

Monitoring VM – Pre Cacti configuration
Perform the following steps on the VM you will be using to install Cacti.

1. Install and configure SNMPD

apt-get update

mv /etc/snmp/snmpd.conf /etc/snmp/snmpd.conf.old

2. Create a new /etc/snmp/snmpd.conf with the following contents:

rocommunity mycompanyt

syslocation Headquarters

syscontact IT

3. Edit /etc/default/snmpd to allow snmpd to listen on all interfaces and use the config file. Comment out the first line below and replace it with the second line:

SNMPDOPTS=’-Lsd -Lf /dev/null -u snmp -g snmp -I -smux -p /var/run/snmpd.pid 127.0.0.1′

SNMPDOPTS=’-Lsd -Lf /dev/null -u snmp -g snmp -I -smux -p /var/run/snmpd.pid -c /etc/snmp/snmpd.conf’

4. Restart the snmpd daemon.

sudo /etc/init.d/snmpd restart

5. Install additional perl packages:

apt-get install libsnmp-perl

apt-get install libnet-snmp-perl

Monitoring VM – Cacti Installation
6. Perform the following steps on the VM you will be using to install Cacti.

apt-get update

apt-get install cacti

During the installation process, MySQL will be installed, and the installation will ask what you would like the MySQL root password to be. Then the installer will ask what you would like cacti’s MySQL password to be. Choose passwords as desired.

Now, the Cacti installation is available via http://[cactiservername]/cacti with a username and password of "admin" Cacti will now ask you to change the admin password. Choose whatever you wish.

7. Download PowerConnect add-on from http://docs.cacti.net/usertemplate:host:dell:powerconnect:62xx and unpack both zip files

8. Import the host template via the GUI interface. Log into Cacti, and go to Console > Import Templates, select the desired file (in this case, cacti_host_template_dell_powerconnect_62xx_switch.xml), and click Import.

9. Copy the 62xx_cpu.pl script into the Cacti script directory on server (/usr/share/cacti/site/scripts). This may need executable permissions. If you downloaded it to a Windows machine, but need to copy it to the Linux VM, WinSCP works nicely for this.

10. Depending on how things were copied, there might be some line endings in the .pl file. You can clean up that 62xx_cpu.pl file by running the following:

dos2unix 62xx_cpu.pl

Using Cacti
You are now ready to run Cacti so that you can connect and monitor your devices. This example shows how to add the device to Cacti, then monitor CPU and a specific data port on the switch.

1. Launch Cacti from your workstation by browsing out to http://[cactiservername]/cacti and enter your credentials.

2. Create a new Graph Tree via Console > Graph Trees > Add. You can call it something like “Switches” then click Create.

3. Create a new device via Console > Devices > Add. Give it a friendly description, and the host name of the device. Enter the SNMP Community name you decided upon earlier. In my example above, I show the community name as being “mycompany” but choose whatever fits. Remember that community names are case sensitive.

4. To create a graph for monitoring CPU of the switch, click Console > Create New Graphs. In the host box, select the device you just added. In the “Create” box, select “Dell Powerconnect 62xx – CPU” and click Create to complete.

5. To create a graph for monitoring a specific Ethernet port, click Console > Create New Graphs. In the Host box, select the device you just added. Put a check mark next to the port number desired, and select In/Out bits with total bandwidth. Click Create > Create to complete.

6. To add the chart to the proper graph tree, click Console > Graph Management. Put a check mark next to the Graphs desired, and change the “Choose and action” box to “Place on a Tree [Tree name]”

Now when you click on Graphs, you will see your two items to be monitored

By clicking on the magnifying glass icon, or by the “Graph Filters” near the top of the screen, one can easily zoom or zoom out to various sampling periods to suite your needs.

Conclusion
Using SNMP and a tool like Cacti can provide historical performance data for non virtualized devices and systems in ways you’ve grown accustomed to in vSphere environments. How hard are your switches running? How much internet bandwidth does your organization use? This will tell you. Give it a try. You might be surprised at what you find.

Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements–Part 2

In my original post, Scaling up Virtual Machines in vSphere to meet performance requirements, I described a unique need for the Software Development Team to have a lot of horsepower to improve the speed of their already virtualized code compiling systems. My plan of attack was simple. Address the CPU bound systems with more powerful blades, and scale up the VMs accordingly. Budget constraints axed the storage array included in my proposal, and also kept this effort limited to keeping the same number of vSphere hosts for the task.

The four new Dell M620 blades arrived and were quickly built up with vSphere 5.0 U2 (Enterprise Plus Licensing) with the EqualLogic MEM installed. A separate cluster was created to insure all build systems were kept separate, and so that I didn’t have to mask any CPU features to make them work with previous generation blades. Next up was to make sure each build VM was running VM hardware level 8. Prior to vSphere 5, the guest VM was unaware of the NUMA architecture behind it. Without the guest OS understanding memory locality, one could introduce problems into otherwise efficient processes. While I could find no evidence that the compilers for either OS are NUMA aware, I knew the Operating Systems understood NUMA.

Each build VM has a separate vmdk for its compiling activities. Their D:\ drive (or /home for Linux) is where the local sandboxes live. I typically have this second drive on a “Virtual Device Node” changed to something other than 0:x. This has proven beneficial in previous performance optimization efforts.

I figured the testing would be somewhat trivial, and would be wrapped up in a few days. After all, the blades were purchased to quickly deliver CPU power for a production environment, and I didn’t want to hold that up. But the data the tests returned had some interesting surprises. It is not every day that you get to test 16vCPU VMs for a production environment that can actually use the power. My home lab certainly doesn’t allow me to do this, so I wanted to make this count.

Testing
The baseline tests would be to run code compiling on two of the production build systems (one Linux, and the other Windows) on an old blade, then the same set with the same source code on the new blades. This would help in better understanding if there were speed improvements from the newer generation chips. Most of the existing build VMs are similar in their configuration. The two test VMs will start out with 4vCPUs and 4GB of RAM. Once the baselines were established, the virtual resources of each VM would be dialed up to see how they respond. The systems will be compiling the very same source code.

For the tests, I isolated each blade so they were not serving up other needs. The test VMs resided in an isolated datastore, but lived on a group of EqualLogic arrays that were part of the production environment. Tests were run at all times during the day and night to simulate real world scenarios, as well as demonstrate any variability in SAN performance.

Build times would be officially recorded in the Developers Build Dashboard. All resources would be observed in vSphere in real time, with screen captures made of things like CPU, disk and memory, and dumped into my favorite brain-dump application; Microsoft OneNote. I decided to do this on a whim when I began testing, but it immediately proved incredibly valuable later on as I found myself looking at dozens of screen captures constantly.

The one thing I didn’t have time to test was the nearly limitless possible scenarios in which multiple monster VMs were contending for CPUs at the same time. But the primary interest for now was to see how the build systems scaled. I would then make my sizing judgments off of the results, and off of previous experience with smaller build VMs on smaller hosts.

The [n/n] title of each test result column indicates the number of vCPUs followed by the amount of vRAM associated. Stacked bar graphs show a lighter color at the top of each bar. This indicates the difference in time between the best result and the worst result. The biggest factor of course would be the SAN.

Bottleneck cat and mouse
Performance testing is a great exercise for anyone, because it helps challenge your own assumptions on where the bottleneck really is. No resource lives as an island, and this project showcased that perfectly. Improving the performance of these CPU bound systems may very well shift the contention elsewhere. However, it may expose other bottlenecks that you were not aware of, as resources are just one element of bottleneck chasing. Applications and the Operating Systems they run on are not perfect, nor are the scripts that kick them off. Keep this in mind when looking at the results.

Test Results – Windows
The following are test results are with Windows 7, running the Visual Studio Compiler. Showing three generations of blades. The Dell M600 (HarperTown), M610, (Nehalem), and M620 (SandyBridge).

Comparing a Windows code compile across blades without any virtual resource modifications.

Yes, that is right. The old M600 blades were that terrible when it came to running VMs that were compiling. This would explain the inconsistent build time results we had seen in the past. While there was improvement in the M620 over the M610s, the real power of the M620s is that they have double the number of physical cores (16) than the previous generations. Also noteworthy is the significant impact the SAN (up to 50%) was affecting the end result.

Comparing a Windows code compile on new blade, but scaling up virtual resources

Several interesting observations about this image (above).

When the SAN can’t keep up, it can easily give back the improvements made in raw compute power.
Performance degraded when compiling with more than 8vCPUs. It was so bad that I quit running tests when it became clear they weren’t compiling efficiently (which is why you do not see SAN variability when I started getting negative returns)
Doubling the vCPUs from 4 to 8, and the vRAM from 4 to 8 only improved the build time by about 30%, even though the compile showed nearly perfect multithreading (shown below) and 100% CPU usage. Why the degradation? Keep reading!

On a different note, it was becoming quite clear already I needed to take a little corrective action in my testing. The SAN was being overworked at all times of the day, and it was impacting my ability to get accurate test results in raw compute power. The more samples I ran the more consistent the inconsistency was. Each of the M620’s had a 100GB SSD, so I decided to run the D:\ drive (where the build sandbox lives) on there to see a lack of storage contention impacted times. The purple line indicates the build times of the given configuration, but with the D:\ drive of the VM living on the local SSD drive.

The difference between a slow run on the SAN and a run with faster storage was spreading.

Test Results – Linux
The following are test results are with Linux, running the GCC compiler. Showing three generations of blades. The Dell M600 (HarperTown), M610, (Nehalem), and M620 (SandyBridge).

Comparing a Linux code compile across blades without any virtual resource modifications.

The Linux compiler showed a a much more linear improvement, along with being faster than it’s Windows counterpart. Noticeable improvements across the newer generations of blades, with no modifications in virtual resources. However, the margin of variability from the SAN is a concern.

Comparing a Linux code compile on new blade, but scaling up virtual resources

At first glance it looks as if the Linux GCC compiler scales up well, but not in a linear way. But take a look at the next graph, where similar to the experiment with the Windows VM, I changed the location of the vmdk file used for the /home drive (where the build sandbox lives) over to the local SSD drive.

This shows very linear scalability with Linux and a GCC compiler. A 4vCPU with 4GB RAM was able to compile 2.2x faster with 8vCPUs and 8GB of RAM. Total build time was just 12 minutes. Triple the virtual resources to 12/12, and it is an almost linear 2.9x faster than the original configuration. Bump it up to 16vCPUs, and diminishing returns begin to show up, where it is 3.4x faster than the original configuration. I suspect crossing NUMA nodes and the architecture of the code itself was impacting this a bit. Although, don’t lose sight of the fact that a build that could take up to 45 minutes on the old configuration took only 7 minutes with 16vCPUs.

The big takeaways from these results are the differences in scalability in compilers, and how overtaxed the storage is. Lets take a look at each one of these.

The compilers
Internally it had long been known that Linux compiled the same code faster than Windows. Way faster. But for various reasons it had been difficult to pinpoint why. The data returned made it obvious. It was the compiler.

While it was clear that the real separation in multithreaded compiling occurred after 8vCPUs, the real problem with the Windows Visual Studio compiler begins after 4vCPUs. This surprised me a bit because when monitoring the vCPU usage (in stacked graph format) in vCenter, it was using every CPU cycle given to it, and multithreading quite evenly. The testing used Visual Studio 2008, but I also tested newer versions of Visual Studio, with nearly the same results.

Storage
The original proposal included storage to support the additional compute horsepower. The existing set of arrays had served our needs very well, but were really targeted at general purpose I/O needs with a focus of capacity in mind. During the budget review process, I had received many questions as to why we needed a storage array. Boiling it down to even the simplest of terms didn’t allow for that line item to survive the last round of cuts. Sure, there was a price to pay for the array, but the results show there is a price to pay for not buying the array.

I knew storage was going to be an issue, but when contention occurs, its hard to determine how much of an impact it will have. Think of a busy freeway, where throughput is pretty easy to predict up to a certain threshold. Hit critical mass, and predicting commute times becomes very difficult. Same thing with storage. But how did I know storage was going to be an issue? The free tool provided to all Dell EqualLogic customers; SAN HQ. This tool has been a trusted resource for me in the past, and removes ALL speculation when it comes to historical usage of the arrays, and other valuable statistics. IOPS, read/write ratios, latency etc. You name it.

Historical data of Estimated Workload over the period of 1 month

Historical data of Estimated Workload over the period of 12 months

Both images show that with the exception of weekends, the SAN arrays are maxed out to 100% of their estimated workload. The overtaxing shows up on the lower part of each screen capture the read and writes surpassing the brown line indicating the estimated maximum IOPS of the array. The 12 month history showed that our storage performance needs were trending upward.

Storage contention and how it relates to used CPU cycles is also worth noting. Look at how inadequate storage I/O influences compute. The image below shows the CPU utilization for one of the Linux builds using 8vCPUs and 8GB RAM when the /home drive was using fast storage (the local SSD on the vSphere host)

Now look at the same build when running against a busy SAN array. It completely changes the CPU usage profile, and thus took 46% longer to complete.

General Observations and lessons

If you are running any hosts using pre-Nehalem architectures, now is a good time to question why. They may not be worth wasting vSphere licensing on. The core count and architectural improvements on the newer chips put the nails in the coffin on these older chips.
Storage Storage Storage. If you have CPU intensive operations, deal with the CPU, but don’t neglect storage. The test results above demonstrate how one can easily give back the entire amount of performance gains in CPU by not having storage performance to support it.
Giving a Windows code compiling VM a lot of CPU, but not increasing the RAM seemed to make the compiler trip on it’s own toes. This makes sense, as more CPUs need more memory addresses to work with.
The testing showcased another element of virtualization that I love. It often helps you understand problems that you might otherwise be blind to. After establishing baseline testing, I noticed some of the Linux build systems were not multithreading the way they should. Turns out it was some scripting errors by our Developers. Easily corrected.

Conclusion
The new Dell M620 blades provided an immediate performance return. All of the build VMs have been scaled up to 8vCPUs and 8GB of RAM to get the best return while providing good scalability of the cluster. Even with that modest doubling of virtual resources, we now have nearly 30 build VMs that when storage performance is no longer an issue, will run between 4 and 4.3 times faster than the same VMs on the old M600 blades. The primary objective moving forward is to target storage that will adequately support these build VMs, as well as looking into ways to improve multithreaded code compiling in Windows.

Helpful Links
Kitware blog post on multithreaded code compiling options
http://www.kitware.com/blog/home/post/434

Using a Synology NAS as an emergency backup DNS server for vSphere

Powering up a highly virtualized infrastructure can sometimes be an interesting experience. Interesting in that “crossing-the-fingers” sort of way. Maybe it’s an outdated run book, or an automated power-on of particular VMs that didn’t occur as planned. Sometimes it is nothing more than a lack of patience between each power-on/initialization step. Whatever the case, if it is a production environment, there is at least a modest amount of anxiety that goes along with this task. How often does this even happen? For those who have extended power outages, far too often.

One element that can affect power-up scenarios is availability of DNS. A funny thing happens though when everything is virtualized. Equipment that powers the infrastructure may need DNS, but DNS is inside of the infrastructure that needs to be powered up. A simple way around this circular referencing problem is to have another backup DNS server that supplements your normal DNS infrastructure. This backup DNS server acts as a slave to the server having authoritative control for that DNS zone, and would handle at minimum recursive DNS queries for critical infrastructure equipment, and vSphere hosts. While all production systems would use your normal primary and secondary DNS, this backup DNS server could be used as the secondary name server a few key components:

vSphere hosts
Server and enclosure Management for IPMI or similar side-band needs
Monitoring nodes
SAN components (optional)
Switchgear (optional)

vSphere certainly isn’t as picky as it once was when it comes to DNS. Thank goodness. But guaranteeing immediate availability of name resolution will help your environment during these planned, or unplanned power-up scenarios. Those that do not have to deal with this often have at least one physical Domain Controller with integrated DNS in place. That option is fine for many organizations, and certainly accomplishes more than just availability of name resolution. AD design is a pretty big subject all by itself, and way beyond the scope of this post. But running a spare physical AD server isn’t my favorite option for a number of different reasons, especially for smaller organizations. Some folks way smarter than me might disagree with my position. Here are a few reasons why it isn’t my preferred option.

One may be limited in Windows licensing
There might be limited availability of physical enterprise grade servers.
One may have no clue as to if, or how a physical AD server might fit into their DR strategy.

As time marches on, I also have a feeling that this approach will be falling out of favor anyway. During a breakout session for optimizing virtualized AD infrastructures at the 2012 VMWorld, it was interesting to hear that the VMware Mothership still has some physical AD servers running the PDCe role. However, they were actively in the process of eliminating this final, physical element, and building recommendations around doing so. And lets face it, a physical DC doesn’t align with the vision of elastic, virtualized datacenters anyway.

To make DNS immediately available during these power-up scenarios, the prevailing method in the “Keep it Simple Stupid” category has been running a separate physical DNS server. Either a Windows member server with a DNS role, or a Linux server with BIND. But it is a physical server, and us virtualization nuts hate that sort of thing. But wait! …There is one more option. Use your Synology NAS as an emergency backup DNS server. The intention of this is not to supplant your normal DNS infrastructure. it’s simply to help a few critical pieces of equipment start up.

The latest version of Synology’s DSM (4.1) comes with a beta version of a DNS package. It is pretty straight forward, but I will walk you through the steps of setting it up anyway.

1. Verify that your Windows DNS servers allow to transfer to the IP address of the NAS. Jump into the Windows Server DNS MMC snap in, highlight the zone you want to setup a zone transfer to, and click properties. Add or verify that the settings allow a zone transfer to the new slave server

2. In the Synology DSM, open the Package Center, and install DNS package.

3. Enable Synology DSM Firewall to allow for DNS traffic. In the Synology DSM, open the Control Panel > Firewall. Highlight the interface desired, and click Create. Choose “Select from a built in list of applications” and choose “DNS Server” Save the rule, and exit out of the Firewall application.

4. Open up “DNS Server” from the Synology launch menu.

5. Click on “Zones” and click Create > Slave Zone. Choose a “Forward Zone” type, and select the desired domain name, and Master DNS server

6. Verify the population of recourse records by selecting the new zone, clicking Edit > Resource Records.

7. If you want, or need to have this forward DNS requests, enable the forwarders checkbox. (In my Home Lab, I enable this. In my production environment, I do not)

8. Complete the configuration, and test with a client using this IP address only for DNS, simply to verify that it is operating correctly. Then, go back and tighten up some of the security mechanisms as you see fit. Once that is completed, jump back into your ESXi hosts (and any other equipment) and configure your secondary DNS to use this server.

In my case, I had my Synology NAS to try this out in my home lab, as well as newly deployed unit at work (serving the primary purpose of a Veeam backup target). In both cases, it has worked exactly as expected, and allowed me to junk an old server at work running BIND.

If the NAS lived on an isolated storage network that wasn’t routable, then this option wouldn’t work, but if you have one living on a routable network somewhere, then it’s a great option. The arrangement simplifies the number of components in the infrastructure while insuring service availability.

Even if you have multiple internal zones, you may want to have this slave server only handling your primary zone. No need to make it more complicated than it needs to be. You also may choose to set up the respective reverse lookup zone as a slave. Possible, but not necessary for this purpose.

There you have it. Nothing ground breaking, but a simple way to make a more resilient environment during power-up scenarios.

Helpful Links:

VMWorld 2012. Virtualizing Active Directory Best Practices (APP-BCA1373). (Accessible by VMWorld attendees only)
http://www.vmworld.com/community/sessions/2012/

Vroom! Scaling up Virtual Machines in vSphere to meet performance requirements

A typical conversation with one of our Developers goes like this. “Hey, that new VM you gave us is great, but can you make it say, 10 times faster?” Another day, and another request by our Development Team to make our build infrastructure faster. What is a build infrastructure, and what does it have to do with vSphere? I’ll tell you…

Software Developers have to compile, or “build” their source code before it is really usable by anyone. Compiling can involve just a small bit of code, or millions of lines. Developers will often perform builds on their own workstations, as well as designated “build” systems. These dedicated build systems are often part of a farm of systems that are churning out builds by fixed schedule, or on demand. Each might be responsible for different products, platforms, versions, or build purposes. This can result in dozens of build machines. Most of this is orchestrated by a lot of scripting or build automation tools. This type of practice is often referred to as Continuous Integration (CI), and are all driven off of Test Driven Development and Lean/Agile Development practices.

In the software world, waiting for builds is wasting money. Slower turn around time, and longer cycles leave less time or willingness to validate that changes to the code didn’t’ break anything. So there is a constant desire to make all of this faster.

Not long after I started virtualizing our environment, I demonstrated the benefits of virtualizing our build systems. Often times the physical build systems were on tired old machines lacking uniformity, protection, revision control, or performance monitoring. That is not exactly a desired recipe for business critical systems. We have benefited in so many ways with these systems being virtualized. Whether it is cloning a system in just a couple of minutes, or knowing they replicated offsite without even thinking about it.

But one problem. Code compiling takes CPU. Massive amounts of it. It has been my observation that nothing makes better use of parallelizing with multiple cores better than compilers. Many applications simply aren’t able to multi-thread, while other applications can, but don’t do it very well – including well known enterprise application software. Throw the right command line switch on a compiler, and it will peg out your latest rocket of a workstation.

Take a look below. This is a 4vCPU VM. That solid line pegged at 100% nearly the entire time is pretty much the way the system will run during the compile. There are exceptions, as tasks like linking are single threaded. What you see here can go on for hours at a time.

This is a different view of that same VM above, showing a nearly perfect distribution of threading across the vCPUs assigned to the VM.

So, as you can see, the efficiency of the compilers actually present a bit of a problem in the virtualized world. Lets face it, one of the values virtualization provides is the unbelievable ability to use otherwise wasted CPU cycles for other systems that really need it. But what happens if you really need it? Well, consolidation ratios go down, and sizing becomes really important.

Compiling from source code can involve handling literally millions of little tiny files. You might think there is a ton of disk activity. There certainly can be I/O, but it is rarely disk bound. This stuck out loud and clear after some of the Developer’s physical workstations had SSDs installed. After an initial hiccup with some bad SSDs, further testing showed almost no speed improvement. Looking at some of the performance data on those workstations showed that SSDs had no affect because the systems were always CPU bound.

Even with the above, some evidence suggests that the pool of Dell EqualLogic arrays (PS6100 and PS600) used in this environment were nearing their performance thresholds. Ideally, I would like to incorporate the EqualLogic hybrid array. The SSD/SAS combo would give me the IOPS needed if I started running into I/O issues. Unfortunately, I have to plan for incorporating this into the mix perhaps a bit later in the year.

RAM for each build system is a bit more predictable. Most systems are not memory hogs when compiling. 4 to 6 Gigabytes of RAM used during a build is quite typical. Linux has a tendency to utilize it more if it has it available, especially when it comes to file IO.

The other variable is the compiler. Windows platforms may use something like Visual Studio, while Linux will use a GCC compiler. The differences in performance can be startling. Compile the exact same source code on two machines with the exact same specs, with one running Windows/Visual Studio, and the other running Linux/GCC, and the Linux machine will finish the build in 1/3rd the time. I can’t do anything about that, but it is a worthy data point when trying to speed up builds.

The Existing Arrangement
All of the build VMs (along with the rest of the VMs) currently run in a cluster of 7 Dell M6xx blades inside a Dell M1000e enclosure. Four of them are Dell M600s with dual socket, Harper Town based chips. Three others are Dell M610s running Nehalem chips. The Harper Town chips didn’t support hyper threading, so in vSphere, that means it will see just a total of 8 logical cores. The Nehalem based systems show 16 logical cores.

All of the build systems (25 as of right now, running a mix of Windows and Linux) run no greater than 4vCPUs. I’ve held firm on this limit of going no greater than 50% of the total physical core count of a host. I’ve gotten some heat from it, but I’ve been rewarded with very acceptable CPU Ready times. After all, this cluster had to support the rest of our infrastructure as well. By physical workstation standards (especially expensive Development workstations), they are pathetically slow. Time to do something about it.

The Plan
The plan is simple. Increase CPU resources. For the cluster, I could either scale up (bigger hosts) or scale out (more hosts). In my case, I was really limited on the capabilities on the host, plus, I wanted to refrain from buying more vSphere licenses unless I had to, so it was well worth it to replace the 4 oldest M600 blades (using Intel Harper Town chips). The new blades, which will be Dell M620s, will have 192GB of RAM versus just 32GB in the old M600s. And lastly, in order to take advantage of some of the new chip architectures in the new blades, I will be splitting this off into a dedicated 4 host cluster.

	New M620 Blades	Old M600 Blades
Chip	Intel Xeon E5-2680	Intel Xeon E5430
Clock Speed	2.7GHz (or faster)	2.66GHz
# of physical cores	16	8
# of logical cores	32	8
RAM	192 GB	32 GB

The new blades will have dual 8 core Sandy Bridge processors, giving me 16 physical cores, and 32 logical cores with hyper threading for each host. This is double the physical cores, and 4 times the logical cores against the older hosts. I will also be paying the premium price for clock speed. I rarely get the fastest clock speed of anything, but in this case, it can truly make a difference.

I have to resist throwing in the blades and just turning up the dials on the VMs. I want to understand to what level I will be getting the greatest return. I also want to see to what level does the dreaded CPU Ready value start cranking up. I’m under no illusion that a given host only has so many CPU cycles, no matter how powerful it is. But in this case, it might be worth tolerating some degree of contention if it means that the majority of time it finishes the builds some measurable amount faster.

So how powerful can I make these VMs? Do I dare go past 8 vCPUs? 12 vCPUs? How about 16? Any guesses? What about NUMA, and the negative impact that might occur if one goes beyond a NUMA node? Stay tuned! …I intend to find out.