How CPU related metrics in vSphere may be misinterpreted

Most Data Center Administrators are accustomed to looking for high CPU utilization rates on VMs, and the hosts in which they reside. This shouldn’t be a big surprise. After all, vCenter, and other monitoring tools have default alarms to alert against high CPU usage statistics. Features like DRS, or products that claim DRS-like functionality factor in CPU related metrics as a part of their ability to redistribute VMs under periods of contention. All of these alerts and activities suggest that high CPU values are bad, and low values are good. But what if conventional wisdom on the consumption of CPU resources is wrong?

Why should you care
Infrastructure metrics can certainly be a good leading indicator of a problem. Over the years, high CPU usage alarms have helped correctly identified many rogue processes on VMs ("Hey, who enabled the screen saver via GPO?…"). But a CPU alarm trigger assumes that high CPU usage is always bad. It also implies that the absence of an alarm condition means that there is not an issue. Both assumptions can be incorrect, which may lead to bad decision making in the Data Center.

The subtleties of performance metrics can reveal problems somewhere else in the stack – if you know how and where to look. Unfortunately, when metrics are looked at in isolation, the problems remain hidden in plain sight. This post will demonstrate how a few common metrics related to CPU utilization can be misinterpreted. Take a look at the post Observations with the Active Memory metric in vSphere to see how this can happen with other metrics as well.

The testing
There are a number of CPU related metrics to monitor in the hypervisor, and at least a couple of different ways to look at them (vCenter, and esxtop). For brevity, lets focus on two metrics that readily visible in vCenter; CPU Usage and CPU Ready. This doesn’t dismiss the importance of other CPU related metrics, or the various ways to gather them, but it is a good start to understanding the relationship between metrics. As a quick refresher, CPU Usage as it relates to vCenter has two definitions. From the host, the usage is the percentage of CPU cycles in use against the total CPU cycles available on the host. On the VM, usage shows the percent of CPU resources in use against the total available CPU cycles of the vCPUs visible to the VM. CPU Ready in vCenter measures in summation form, the amount of time that the virtual machine was ready, but could not get scheduled to run on the CPU.

A few notes about the test conditions and results:

  • The tests here comprise of activities that are scheduled inside each guest, and are repeated 5 times over a 1 hour period.
  • There are no synthetic tools used here to generate storage I/O load or consume CPU cycles. (iometer, StressLinux, etc.)
  • The activities performed are using processes that are only partially multithreaded. This approach is most reflective of real world environments.
  • The "slower" storage depicted in the testing were actually SSDs, while the "faster" storage was by leveraging PernixData FVP and distributed fault tolerant memory (DFTM) as a storage acceleration tier.
  • The absolute numbers are not necessarily important for this testing. The focus is more about comparing values when a variable like storage performance changes.
  • No shares, reservations, or limits were used on the test VMs.

The complex demands of real world environments may exhibit a much greater impact than what the testing below reveals. I reference a few actual cases of production workloads later on in the post. Synthetic load generators were not used here because they cannot properly simulate a pattern of activity that is reflective of a real environment. Synthetic load generators are good at stressing resources – not simulating real world workloads, or the time it takes for those workloads to complete their tasks.

Interpreting impacts on CPU usage and CPU Ready with changing storage performance
Looking at CPU utilization can be challenging because not all applications, nor the workloads they generate are the same. Most applications are a complex mix of some processes being multithreaded, while others are not. Some processes initiate storage I/O, while others do not. It is for this reason that we will look at CPU Usage and CPU Ready over a task that is repeated on the same sets of VMs, but using storage that performs differently.

For all practical purposes, CPU Ready doesn’t become meaningful until a host is running a large number of single vCPU VMs concurrently, or a number of multiple vCPU VMs concurrently. CPU Ready can sometimes be terribly tricky to decipher because it can be influenced in so many ways. Sometimes it may align with CPU utilization, while other times it may not. It may be affected by other resources, or it may not. It really depends on the environmental conditions. I find it a good supporting metric, but definitely not one that should stand on its own merit, without proper context of other metrics. We are measuring it here because it is generally regarded as important, and one that may contribute to load distribution activities.

Test 1: Single vCPU VM on a Host with no other activity
First let’s look at one of the very simplest of comparisons. A single vCPU VM with no other activity occurring on the host, where one test is using slower storage (blue), and the other test it is using faster storage (orange). A task was completed 5 times over the course of one hour. The image below shows that from the host perspective, peak CPU utilization increased by 79% when using the faster storage. CPU Ready demonstrated very little change, which was as expected due to the nature of this test (no other VMs running on the host).

hoststats-1vcpu

When we look at the individual VMs, the results are similar. The images below show that CPU usage maximums for the VM increased by 24% when using the faster storage. CPU Ready demonstrated very little change here because there were no other VMs to contend with on that host. The "Storage Latency" column shows the average storage latency the VM was seeing during this time period.

vmstats-1vcpu

You might think that higher latency may not be realistic of today’s storage technologies. The "slower" storage in this case did in fact come from SSD based storage. But remember that Flash of any kind can suffer in performance when committing larger block I/O which is quite common with real workloads. Take a look at "Understanding block sizes in a virtualized environment" for more information.

But wait… how long did the task, set to run 5 times over the period of one hour take? Well, the task took just half the time to run with the faster storage. The same amount of cycles were processing the same amount of I/Os, but just for a shorter period of time. This faster completion of a task will free up those CPU cycles for other VMs. This is the primary reason why the averages for CPU Usage and CPU Ready changed very little. Looking at this data in a timeline form in vCenter illustrates it quite clearly. There is a clear distinction of the characteristics of the task on the fast storage. Much more difficult to decipher on the run with slower storage.

combined-singlevCPU

Test 2: Multiple vCPU VM on a host with other activity
Now let’s let the same workload run on VMs with assigned multiple (4) vCPUs, along with other multi-vCPU VMs running in the background. This is to simulate a bit of "chatter" or activity that one might experience in a production environment.

As we can see from the images below, on the host level, both CPU usage and CPU ready values increased as storage performance increased. CPU usage maximums increased by 39% on the host. CPU Ready maximums increased by 34% on the host, which was a noticeable difference than testing without any other systems running.

hoststats-4vcpu

When we look at the individual VMs, the results are similar. The images below show that CPU usage maximums increased by 39% with the faster storage. CPU Ready maximums increased by 51% while running on the faster storage. Considering the typical VM to host consolidation ratio, the effects can be profound.

vmstats-4vcpu

Now let’s take a look at the timeline in vCenter to get an appreciation of how those CPU cycles were used. On the image below, you can see that like the single vCPU VM testing, the VM running on faster storage allowed for much higher CPU usage than when running on slower storage, but that it was for a much shorter period of time (about half). You will notice that in this test, the CPU Ready measurements generally increases as the CPU usage increased.

combined-multivCPU

Real world examples
This all brings me back to what I witnessed years ago while administering a vSphere environment consisting of extremely CPU and storage I/O intensive workloads. Dozens of resource intensive VMs built for the purpose of compiling code. These were systems using that could multithread to near perfection – assuming storage performance was sufficient.

fasterprocess

Now let’s look at what CPU utilization rates looked like on that same VM, running the same code compiling job where the storage environment wasn’t able to satisfy reads and writes fast enough. The same job took 46% longer to complete, all because the available CPU cycles couldn’t be used.

longprocess

Still not a believer? Take a look at a presentation at the OpenStack summit by Charter Communications in April 2016, where they demonstrate exactly the effect I describe. Their Cassandra cluster deployed with VMware Integrated OpenStack, and the effects of CPU utilization when providing lower latency, higher performing storage. (key information beginning at 17:10). Their more freely breathing storage allowed CPU cycles related to storage I/O to be committed more quickly, thereby finishing the tasks much more quickly. High CPU usage was a desired result of theirs.

You might be thinking to yourself, "Won’t I have more CPU contention with faster storage?"   Well, yes and no. Faster storage will give power back to the Administrator to control the usage of resources as needed, and deliver the SLAs required. And moving the point of contention to the CPU allows for what it does best; time slicing processes to complete the tasks as quickly as possible.

Sample what?
The rate at which telemetry data is sampled is a factor that can dramatically change your impression of the behavior of these resources used in the Data Center. It’s a big topic, and one that will be touched on in an upcoming post, but there is one thing to note here. When leveraging faster, lower latency storage, there are many times where CPU utilization and CPU Ready will stay the same. Why? In a real workload that involve CPU cycles executing to commit storage I/O, a workflow can may consist of a given amount of those I/Os, regardless of how long it takes. If that process took 18 seconds on slow storage, but 5 seconds on faster storage, the 20 second sampling rate within vCenter may render it in the same way. One often has to employ other tools to see these figures at a higher sampling rate. Tools such as vscsiStats and esxtop are good examples of this.

Takeaways
The testing, and examples above should make it easy to imagine a scenario in which a storage system is upgraded, and CPU related alarms are tripped more frequently, even though the processes that support a workflow have completed much more quickly. So with that, it’s good to keep the following in mind.

  • Slow storage will suppress CPU utilization rates – giving you the impression that from a host, or VM perspective, everything is fine.
  • Conversely, Fast storage will allow those CPU cycles related to storage I/O to execute, thereby increasing utilization rates – albeit for a shorter period of time.  High CPU statistics are not necessarily a bad thing.
  • Averages and peaks can be misleading because increased utilization rates may not be recognizable in the vCenter CPU charts if it completes within the smallest sampling size (20 seconds)
  • Traditional methods of monitoring and balancing host resources can be misleading
  • Higher CPU utilization rates may not be a leading indicator of an issue. They are often be a trailing indicator of well-designed processes, or free breathing storage. Again, high CPU can be a good thing!!!
  • Application behavior, and the results are what counts. If a batch job in SQL takes 30 minutes, defining success should be around the desired time of that batch job. Infrastructure related metrics should help you diagnose issues and assist with achieving a desired result, but not be the one and only KPI.
  • Storage performance will generally impact every VM and host accessing the cluster. Whereas host based resource contention will only impact other VMs living on that same host.

Thanks for reading

– Pete

Viewing the impact of block sizes with PernixData Architect

In the post Understanding block sizes in a virtualized environment, I describe what block sizes are as they relate to storage I/O, and how it became one of the most overlooked metrics of virtualized environments.  The first step in recognizing their importance is providing visibility to them across the entire Data Center, but with enough granularity to view individual workloads.  However, visibility into a metric like block sizes isn’t enough. The data itself has little value if it cannot be interpreted correctly. The data must be:

  • Easy to access
  • Easy to interpret
  • Accurate
  • Easy to understand how it relates to other relevant metrics
    Future posts will cover specific scenarios detailing how this information can be used to better understand, and tune your environment for better application performance. Let’s first learn how PernixData Architect presents block size when looking at very basic, but common read/write activity.

Block size frequencies at the Summary View
When looking at a particular VM, the "Overview" view in Architect can be used to show the frequency of I/O sizes across the spectrum of small blocks to large blocks for any time period you wish. This I/O frequency will show up on the main Summary page when viewing across the entire cluster. The image below focuses on just a single VM.

(Click on images for a full size view)

vmpete-summary

What is most interesting about the image above is that the frequency of block sizes, based on reads and writes are different. This is a common characteristic that has largely gone unnoticed because there has been no way to easily view that data in the first place.

Block sizes using a "Workload" View
The "Workload" view in Architect presents a distribution of block sizes in a percentage form as they occur on a single workload, a group of workloads, or across an entire vSphere cluster. The time frame can be any period that you wish. This view tells more clearly and quickly than any other view as to how complex, unique, and dynamic the distribution of block sizes are for any given VM. The example below represents a single workload, across a 15 minute period of time. Much like read/write ratios, or other metrics like CPU utilization, it’s important to understand these changes as they occur, and not just a single summation or percentage over a long period of time.

vmpete-workloadview

When viewing the "Workload" view in any real world environment, it will instantly provide new perspective on the limitations of Synthetic I/O generators. Their general lack of ability to emulate the very complex distribution of block sizes in your own environment limit their value for that purpose. The "Workload" view also shows how dramatic, and continuous the changes in workloads can be. This speaks volumes as to why one-time storage assessments are not enough. Nobody treats their CPU or memory resources in that way. Why would we limit ourselves that way for storage?

Keep in mind that the "Workload" view illustrates this distribution in a percentage form. Views that are percentage based aim to illustrate proportion relative to a whole. They do not show absolute values behind those percentages. The distribution could represent 50 IOPS, or 5,000 IOPS. However, this type of view can be an incredibly effective in identifying subtle changes in a workload or across an environment for short term analysis, or long term trending.

Block sizes in an IOPS and Throughput view
First let’s take a look at this VM by looking at IOPS, based on a default "Read/Write" breakdown. The image below shows a series of reads before a series of mostly writes. When you look back at the "Workload" view above, you can see how these I/Os were represented by block size distribution.

vmpete-IOPSrw

Staying on this IOPS view and selecting the predefined "Block Size" breakdown, we can see the absolute numbers that are occurring based on block size. The image below shows that unlike the "Workload" view above, this shows the actual number of I/Os issued for the given block size.

vmpete-IOPSblocksize

But that doesn’t tell the whole story. A block size is an attribute for a single I/O. So in an IOPS view 10 IOPS of 4K blocks looks the same as 10 IOPS of 256K blocks. In reality, the latter is 64 times the amount of data. The way to view this from a "payload amount transmitted" perspective is using the Throughput view with the "Block Size" breakdown, as shown below.

vmpete-TPblocksize

When viewing block size by its payload size (Throughput) as shown above, it provides a much better representation of the dominance of large block sizes, and the relatively small payload of the smaller block sizes.

Here is another way Architect can help you visualize this data. We can click on the "Performance Grid" view and change the view so that we have IOPS and Throughput but for specific block sizes. As the image below illustrates, the top row shows IOPS and Throughput for 4K to <8K blocks, while the bottom row shows IOPS and Throughput for blocks over 256K in size.

vmpete-performancegrid-IOPSTP

What the image above shows is that while the number of IOPS for block sizes in the 4K to <8K range at it’s peak were similar to the number of IOPS for block sizes of 256K and above, there was an enormous amount of payload delivered.

Why does it matter?
Let’s let PernixData Architect tell us why all of this matters. We will look at the effective latency of the VM over that same time period. We can see from the image below that the effective latency of the VM definitely increased as it transitioned to predominately writes. (Read/Write boxes unticked for clarity).

vmpete-Latencyrw

Now, let’s look at the image below, which shows latency by block size using the "Block Size" breakdown.

vmpete-Latency-blocksize

There you see it. Latency was by in large a result of the larger block sizes. The flexibility of these views can take an otherwise innocent looking latency metric and tell you what was contributing most to that latency.

Now let’s take it a step further. With Architect, the "Block Size" breakdown is a predefined view that shows block size characteristic of both reads, and writes combined – whether you are looking at Latency, IOPS, or Throughput. However, you can use a custom breakdown to not only show block sizes, but show them specifically for reads or writes, as shown in the image below.

vmpete-Latency-blocksize-custom

The "Custom" Breakdown for the Latency view shown above had all of the reads and writes of individual block sizes enabled, but some of them were simply "unticked" for clarity. This view confirms that the majority of latency was the result of writes that were 64K and above. In this case, we can clearly demonstrate that latency seen by the VM was the result of larger block sizes issued by writes. It’s impact however is not limited to just the higher latency of those larger blocks, as those large block latencies can impact the smaller block I/Os as well. Stay tuned for more information on that subject.

As shown in the image below, Architect also allows you to simply click on a single point, and drill in for more insight. This can be done on a per VM basis, or across the entire cluster. By hovering over each vertical bar representing various block sizes, it will tell you how many IOs were issued at that time, and the corresponding latency.

vmpete-Latency-Drilldown

Flash to the rescue?
It’s pretty clear that block size can have significant impact on the latency your applications see. Flash to the rescue, right?  Well, not exactly. All of the examples above come from VMs running on Flash. Flash, and how it is implemented in a storage solution is part of what makes this so interesting, and so impactful to the performance of your VMs. We also know that the storage media is just one component of your storage infrastructure. These components, and their abilities to hinder performance, exist regardless if one is using a traditional three-tier architecture, or distributed storage architectures like Hyper Converged environments.

Block sizes in the Performance Matrix
One unique view in Architect is the Performance Matrix. Unique in what it presents, and how it can be used. Your storage solution might have been optimized from the Manufacturer based on certain assumptions that may not align with your workloads. Typically there is no way of knowing that. As shown below, Architect can help you understand what type of workload characteristics in which the array begins to suffer.

vmpete-PeformanceMatrix

The Performance Matrix can be viewed on a per VM basis (as shown above) or in an aggregate form. It’s a great view to see what block size thresholds your storage infrastructure may be suffering, as the VMs see it. This is very different than statistics provided by an array, as Architect offers a complete, end-to-end understanding of these metrics with extraordinary granularity. Arrays are not in the correct place to accurately understand, or measure this type of data.

Summary
Block sizes have a profound impact on the performance of your VMs, and is a metric that should be treated as a first class citizen just like compute and other storage metrics. The stakes are far too high to leave this up to speculation, or words from a vendor that say little more than "Our solution is fast. Trust us.Architect leverages it’s visibility of block sizes in ways that have never been possible. It takes advantage of this visibility to help you translate what it is, to what it means for your environment.

Understanding block sizes in a virtualized environment

Cracking the mysteries of the Data Center is a bit like space exploration. You think you understand what everything is, and how it all works together, but struggle to understand where fact and speculation intersect. The topic of block sizes, as they relate to storage infrastructures is one such mystery. The term being familiar to some, but elusive enough to remain uncertain as to what it is, or why it matters.

This inconspicuous, but all too important characteristic of storage I/O has often been misunderstood (if not completely overlooked) by well-intentioned Administrators attempting to design, optimize, or troubleshoot storage performance. Much like the topic of Working Set Sizes, block sizes are not of great concern to an Administrator or Architect because of this lack of visibility and understanding. Sadly, myth turns into conventional wisdom – in not only what is typical in an environment, but how applications and storage systems behave, and how to design, optimize, and troubleshoot for such conditions.

Let’s step through this process to better understand what a block is, and why it is so important to understand it’s impact on the Data Center.

What is it?
Without diving deeper than necessary, a block is simply a chunk of data. In the context of storage I/O, it would be a unit in a data stream; a read or a write from a single I/O operation. Block size refers the payload size of a single unit. We can blame a bit of this confusion on what a block is by a bit of overlap in industry nomenclature. Commonly used terms like blocks sizes, cluster sizes, pages, latency, etc. may be used in disparate conversations, but what is being referred to, how it is measured, and by whom may often vary. Within the context of discussing file systems, storage media characteristics, hypervisors, or Operating Systems, these terms are used interchangeably, but do not have universal meaning.

Most who are responsible for Data Center design and operation know the term as an asterisk on a performance specification sheet of a storage system, or a configuration setting in a synthetic I/O generator. Performance specifications on a storage system are often the result of a synthetic test using the most favorable block size (often 4K or smaller) for an array to maximize the number of IOPS that an array can service. Synthetic I/O generators typically allow one to set this, but users often have no idea what the distribution of block sizes are across their workloads, or if it is even possibly to simulate that with synthetic I/O. The reality is that many applications draw a unique mix of block sizes at any given time, depending on the activity.

I first wrote about the impact of block sizes back in 2013 when introducing FVP into my production environment at the time. (See section "The IOPS, Throughput & Latency relationship")  FVP provided a tertiary glimpse of the impact of block sizes in my environment. Countless hours with the performance graphs, and using vscsistats provided new insight about those workloads, and the environment in which they ran. However, neither tool was necessarily built for real time analysis or long term trending of block sizes for a single VM, or across the Data Center. I had always wished for an easier way.

Why does it matter?
The best way to think of block sizes is how much of a storage payload consisting in a single unit.  The physics of it becomes obvious when you think about the size of a 4KB payload, versus a 256KB payload, or even a 512KB payload. Since we refer to them as a block, let’s use a square to represent their relative capacities.

image

Throughput is the result of IOPS, and the block size for each I/O being sent or received. It’s not just the fact that a 256KB block has 64 times the amount of data that a 4K block has, it is the amount of additional effort throughout the storage stack it takes to handle that. Whether it be bandwidth on the fabric, the protocol, or processing overhead on the HBAs, switches, or storage controllers. And let’s not forget the burden it has on the persistent media.

This variability in performance is more prominent with Flash than traditional spinning disk.  Reads are relatively easy for Flash, but the methods used for writing to NAND Flash can inhibit the same performance results from reads, especially with writes using large blocks. (For more detail on the basic anatomy and behavior of Flash, take a look at Frank Denneman’s post on Flash wear leveling, garbage collection, and write amplification. Here is another primer on the basics of Flash.)  A very small number of writes using large blocks can trigger all sorts of activity on the Flash devices that obstructs the effective performance from behaving as it does with smaller block I/O. This volatility in performance is a surprise to just about everyone when they first see it.

Block size can impact storage performance regardless of the type of storage architecture used. Whether it is a traditional SAN infrastructure, or a distributed storage solution used in a Hyper Converged environment, the factors, and the challenges remain. Storage systems may be optimized for different block size that may not necessarily align with your workloads. This could be the result of design assumptions of the storage system, or limits of their architecture.  The abilities of storage solutions to cope with certain workload patterns varies greatly as well.  The difference between a good storage system and a poor one often comes down to the abilities of it to handle large block I/O.  Insight into this information should be a part of the design and operation of any environment.

The applications that generate them
What makes the topic of block sizes so interesting are the Operating Systems, the applications, and the workloads that generate them. The block sizes are often dictated by the processes of the OS and the applications that are running in them.

Unlike what many might think, there is often a wide mix of block sizes that are being used at any given time on a single VM, and it can change dramatically by the second. These changes have profound impact on the ability for the VM and the infrastructure it lives on to deliver the I/O in a timely manner. It’s not enough to know that perhaps 30% of the blocks are 64KB in size. One must understand how they are distributed over time, and how latencies or other attributes of those blocks of various sizes relate to each other. Stay tuned for future posts that dive deeper into this topic.

Traditional methods capable of visibility
The traditional methods for viewing block sizes have been limited. They provide an incomplete picture of their impact – whether it be across the Data Center, or against a single workload.

1. Kernel statistics courtesy of vscsistats. This utility is a part of ESXi, and can be executed via the command line of an ESXi host. The utility provides a summary of block sizes for a given period of time, but suffers from a few significant problems.

  • Not ideal for anything but a very short snippet of time, against a specific vmdk.
  • Cannot present data in real-time.  It is essentially a post-processing tool.
  • Not intended to show data over time.  vscsistats will show a sum total of I/O metrics for a given period of time, but it’s of a single sample period.  It has no way to track this over time.  One must script this to create results for more than a single period of time.
  • No context.  It treats that workload (actually, just the VMDK) in isolation.  It is missing the context necessary to properly interpret.
  • No way to visually understand the data.  This requires the use of other tools to help visualize the data.

The result, especially at scale, is a very labor intensive exercise that is an incomplete solution. It is extremely rare that an Administrator runs through this exercise on even a single VM to understand their I/O characteristics.

2. Storage array. This would be a vendor specific "value add" feature that might present some simplified summary of data with regards to block sizes, but this too is an incomplete solution:

  • Not VM aware.  Since most intelligence is lost the moment storage I/O leaves a host HBA, a storage array would have no idea what block sizes were associated with a VM, or what order they were delivered in.
  • Measuring at the wrong place.  The array is simply the wrong place to measure the impact of block sizes in the first place.  Think about all of the queues storage traffic must go through before the writes are committed to the storage, and reads are fetched. (It also assumes no caching tiers outside of the storage system exist).  The desire would be to measure at a location that takes all of this into consideration; the hypervisor.  Incidentally, this is often why an array can show great performance on the array, but suffer in the observed latency of the VM.  This speaks to the importance of measuring data at the correct location. 
  • Unknown and possibly inconsistent method of measurement.  Showing any block size information is not a storage array’s primary mission, and doesn’t necessarily provide the same method of measurement as where the I/O originates (the VM, and the host it lives on). Therefore, how it is measured, and how often it is measured is generally of low importance, and not disclosed.
  • Dependent on the storage array.  If different types of storage are used in an environment, this doesn’t provide adequate coverage for all of the workloads.

The Hypervisor is an ideal control plane to analyze the data. It focuses on the results of the VMs without being dependent on nuances of in-guest metrics or a feature of a storage solution. It is inherently the ideal position in the Data Center for proper, holistic understanding of your environment.

Eyes wide shut – Storage design mistakes from the start
The flaw with many design exercises is we assume we know what our assumptions are. Let’s consider typical inputs when it comes to storage design. This includes factors such as

  • Peak IOPS and Throughput.
  • Read/Write ratios
  • RAID penalties
  • Perhaps some physical latencies of components, if we wanted to get fancy.

Most who have designed or managed environments have gone through some variation of this exercise, followed by a little math to come up with the correct blend of disks, RAID levels, and fabric to support the desired performance. Known figures are used when they are available, and the others might be filled in with assumptions.  But yet, block sizes, and everything they impact are nowhere to be found. Why? Lack of visibility, and understanding.

If we know that block sizes can dramatically impact the performance of a storage system (as will be shown in future posts) shouldn’t it be a part of any design, optimization, or troubleshooting exercise?  Of course it should.  Just as with working set sizes, lack of visibility doesn’t excuse lack of consideration.  An infrastructure only exists because of the need to run services and applications on it. Let those applications and workloads help tell you what type of storage fits your environment best. Not the other way around.

Is there a better way?
The ideal approach for measuring the impact of block sizes will always include measuring from the location of the hypervisor, as this will provide these measurements in the right way, and from the right location.  vscsiStats and vCenter related metrics are an incredible resource to tap into, and will provide the best understanding on impacts of block sizes in a storage system.  There may be some time investment to decipher block size characteristics of a workload, but the payoff is generally worth the effort.

My vSphere Home Lab. 2016 edition

Here we go again. I had no intention of writing a follow-up to my "Home Lab 2015 edition" post last year, as I didn’t foresee any changes to the lab in the coming year that would be interesting enough to write about. 

So much for predicting the future.

Sometimes Home Lab environments tend to border on vanity projects. I would like to think the recent changes in my lab were done out of need, but rationalizing wants into needs is common enough to be considered a national pastime. Nevertheless, my profession now has me testing workloads and new technologies on a daily basis, and this was a driving force behind these upgrades. Honest.

Demand often drives change. This is where the evolution of my Home Lab continues to mimic a production environment – just at a smaller scale. Budget, performance, capacity, space, and heat are all elements of a Home Lab design that are almost laughably similar to a production environment. Workloads evolve, and needs grow – quickly making previously used design inputs as inadequate. That is exactly what happened to me, and knew I had to invest in a few upgrades.

Compute – Performance/Testing Cluster
It was finally time to replace a few of the oldest components of the lab. My primary hosts that were built off of Intel Sandy Bridge processors used motherboards limited to just 32GB of RAM, and PCIe 2.0. I didn’t have any 10Gb connectivity without my old InfiniBand gear, and I was consistently pushing the CPUs to their limit.

I decided to go with a pair of SuperMIcro 5018D-FN4T rack mounted units. These are an incredibly small 1U form factor that feature built-in dual 10GbE and dual 1GbE interfaces, a dedicated IPMI port, a PCIe 3.0 slot, 4 drive bays, and can pack in up to 128GB of DDR4 memory. The motherboard uses the soldered on 8 core Xeon D-1540 chip and the power supply is built into the chassis. Both items reduce flexibility, but improve the no-brainer simplicity of the unit. What is most surprising when you get your hands on them is that they are incredibly small, yet still half empty when the case is cracked open. A third host will probably be in the works at some point, but it’s not necessary at this time.

image

It probably will come as no surprise that multiple PernixData FVP based acceleration tiers are an integral component of my infrastructure, so a few changes occurred in that realm.

1. Adding NVMe cards to use as a Flash based acceleration tier for FVP. For this lab arrangement, I used the Intel 750 NVMe based PCIe 3.0 card. While they are not officially on the VMware HCL, they are fine for the Home Lab, as they borrow heavily from the Intel DC P3xxxx line of NVMe cards that are on the VMware HCL. Intel NVMe cards are outstanding performers. Enjoy the benefits of completely bypassing all of the legacy elements the traditional storage stack on a host such as storage controllers and SCSI commands. NVMe based Flash devices is still limited by the physics of NAND Flash, but it is an incredible performer that can make any SSD based Flash drive look quite feeble in comparison. Just make sure to use Intel’s driver for vSphere.

2. More RAM to use as a DFTM acceleration tier in FVP. I placed 64GB of Micron Memory which allows me to allocate a nice chunk of RAM for FVP acceleration. The beauty of using memory as an acceleration tier avoiding all characteristics of NAND Flash, and the ability for it to leverage compression techniques. This typically increases the effective tier size between 30% and 70% depending on workload. The larger the tier size, the more content that can live in the tier, and the less eviction that occurs against the working set of data.

Compute – Management Cluster
A management cluster in a Home Lab is great. It has allowed me to really experiment with testing workloads and new technologies without any impact to the components that run the infrastructure. My Management Cluster now comprises of three Intel NUCs. I would have been perfectly happy with just a couple of NUCs as a Management cluster, but unfortunately the 16GB RAM limitation makes that a bit tough. Eventually, the NUCs will outlive their usefulness in the lab, but the great part about them is that they can easily be used as a desktop workstation, or media server. For now, they will continue to serve their purpose as a Management Cluster.

Switching
Upgrading my network meant adding 10GbE connectivity. For this, I chose a Netgear XS708E, 8 port, 10GbE switch. This would serve as a fast interconnect for east-west traffic between hosts. My adventures with InfiniBand were always interesting and educational. It’s an amazing technology, but there was just too much administrative overhead to the gear I was using. Unfortunately, there are not too many small, affordable 10GbE switches out there. The Dell 12 port X4012 10GbE switch looked really appealing based on the specs, but the ports are SFP+, so that would have meant rethinking a number of things. As for the Netgear, what do I think of it?  After configuring the product, I’m convinced the folks at Netgear wanted to punish anyone who buys the unit. All of the configuration items that should be so basic in a CLI or web based UI are obfuscated in a proprietary interface that seems to be missing half of the options you’d expect. Dear Netgear, please let me configure LAGs, trunks, MTU size, and VLANs with something remotely resembling common sense. It does work, but if I could do it over, I’d choose something else.

My network core still consists of a Cisco SG300-20 Layer 3 switch. Moving away from hosts that had 6, 1GbE ports down to hosts that had just two 1GbE ports and two 10GbE ports meant that I was able to free up space on this switch. That switch still has a bit of a premium price for a 20 port L3 switch, but it has been a rock solid component of my lab for over 4 years now.

Ancillary Components
One thing I was tired of dealing with was my wireless gateway. I’ve grown sour on any consumer based WiFi/Router solutions available. Most aren’t stable, and lack features that require one to crack them with a DD-WRT build. Memory leaks and other reboot inducing behaviors are not what you want to deal with when attempting to access the lab remotely, so it was time to take a new approach. I went with the following for my gateway and wireless needs.

Motorola SB6121 DOCSIS 3.0 Cable Modem. This was purchased to replace the oversized cable modem provided by the service provider. It’s small, affordable, and prevents the cable company from changing settings on me, as they often would with their own unit.

Ubiquiti EdgeRouter PoE. This 5 port unit serves as my gateway, where one leg feeds downstream to my core switch, and another leg is used as a DMZ for my WiFi. This is a great switch that offers everything that I was looking for. Trunking, static routes, NAT and Firewalling. The multiple PoE ports makes it easy to add new wireless access points.

Ubiquiti UniFi AP Wireless Access Point. These access points pair nicely with the PoE based router above.

It’s been a rock solid, winning combination. Always on, with no random need to reboot. Total control over configuration, and no silliness from the cable provider. Mission accomplished.

Storage
This was one of the few components that didn’t change. Storage is served up by two, 5-bay Synology units with a mix of SSDs and spinning disk. I had plenty of capacity, with enough options to test various media if needed.

Mounting
Until this latest refresh, a $25 utility rack had housed the assortment of oddly shaped lab gear pretty well. With the changeover to small 1U rackmount servers and additional switchgear, it was time for an official enclosure. I went with a Tripp Lite 9U Wall Mount Cabinet. It will eventually be wall mounted, but for the time being, sits perfectly on a $12 moving dolly from Harbor Freight. The cabinet has some nice mounting ports for supplementary exhaust fans should the need arise.

Relocation
Within the first few minutes of powering up the new hosts, I realized the arrangement was going to need a new home. Server room loud?  No. But moving from 38dB to 50+dB is loud enough that you wouldn’t want to be working by it all day. There is no way 1U fans spinning at 8,000 RPM will ever be soothing. I had been quite proud of how quiet my lab gear had been up until this point. I stayed away from 1U anything, and when with quiet fans wherever I could. I tried desperately to suppress the noise, replacing all of the fans with ultra-quiet Noctua fans. Unfortunately, ultra-quiet can also mean they don’t move much air. It’s not good to disregard any delta in CFM between fans. The heat alarms made it very clear this wasn’t going to work, and I didn’t want to burn up perfectly good gear. I chose to place all of the factory fans back in the 1U servers, and the 10GbE switch, and used the Noctua fans as supplementary fans in each device. They do help the primary fans to spin at a lower rate, so the effort wasn’t a total waste. The 9U cabinet will be relocated to a more permanent location than it is now, but for the time being, its making a coat closet nice and warm.

What it looks like
The entire lab, including the UPS is now self-contained, which should make its final relocation straight forward. The entire arrangement (5 hosts, 2 switches, 2 Synology NAS units, etc.) draws between 250 – 300 watts depending upon the load. Considering the old, much less capable arrangement ran at about 200 watts, I was pretty happy with the result.

image

In the spirit of full disclosure, the cabinet door does cover up some rather careless cable management practices. Regardless, I am thrilled with the end result and how it performs. A space efficient arrangement that is extremely powerful.

No matter how little, or how much you decide to invest in a Home Lab, I’ve learned that the satisfaction seems to be directly proportional to how much value it brings to you. Whether it be a hobby, used for professional growth, or a part of your day-to-day job duties, any sense of buyer’s remorse only seems to creep in when it’s not used. For my circumstances, that doesn’t seem to be a problem.

A closer look at the new UI for PernixData FVP, and beyond

In many ways, making a good User Interface (UI) seems like a simple task.  As evident by so many software makers over the years, it is anything but simple. A good UI looks elegant to the eye, and will become a part of muscle memory without even realizing it. A bad UI can feel like a cruel joke; designed to tease the brain, and frustrate the user. It’s never done intentionally of course. In fact, bad visual and functional designs happen in any industry all the time. Just think of your favorite ugly car. At some point there was an entire committee that gave it a thumbs up. User Experience (UX) design is also an imperfect science, and the impressions are subject to the eyes of the beholder.

A good UI should present function effortlessly. Make the complex simple. However, it is more than just buttons and menus that factor into a user experience. That larger encompassing UX design is what incorporates among other things, functional requirements with a visual interface that is productive and intuitive. PernixData FVP has always received high marks for that user experience. The product not only accelerated storage I/O, but presented itself in such a way that made it informative and desirable to use.

Why the change?
PernixData products (FVP, and the up and coming Architect) now have a standalone, HTML5 interface using your favorite browser. Moving away from the vSphere Web client was a deliberate move that at first impression might be a bit surprising. With changes in needs and expectations comes the challenge of understanding what is the best way to achieve a desired result. Standalone, traditionally compiled clients are not as appealing as they once were for numerous reasons, so adopting a modern web based framework was important.

Moving to a standalone, pure HTML5 UI built from the ground up allowed for these interactions to be built just the way they should be. It removes limits explicitly or implicitly imposed by someone else’s standards. PernixData gets to step away from the shadows of VMware’s current implementation of FLEX. Removing limitations allows for more flexibility now, and in the future.

UI characteristics
One of the first impressions that will get will be the performance of the UI. It is quick and snappy. UX pain often begins with performance – whether it is the technical speed, or the ability for a user to find what they want quickly. The new UI continues where the older UI left off; telling more with less, and doing so very quickly.

Looking at the image below, you will also see that the UI was designed for the use with multiple products. The framework is used not only for FVP, but for the upcoming release of PernixData Architect. This allows for transitions between products to be fluid, and intuitive.

vmpete-Hub

New search capabilities
In larger environments, isolating and filtering VMs for deeper review is a valuable feature. Not that big of a deal with a few dozen VMs, but get a few hundred or more VMs, and it becomes difficult to keep track. The quick search abilities allow for real time filtering down of VMs based on search criteria. Highlighting those VMs then allows for easy comparison.

vmpete-quicksearch

More granularity with the hero numbers
Hero numbers have been a great way to see how much offload has occurred in an infrastructure. How many I/Os offloaded from the Datastore, how much bandwidth never touched your storage infrastructure due to this offload, and how many writes were accelerated. In previous versions, that number started counting from the moment the FVP cluster was created. In FVP 3.0, you get to choose to see how much offload has occurred over a more granular period of time.

vmpete-heronumbers

New graphs to show cache handling
Previously, the "Hit Rate and Eviction Rate" metric helped express cache usage, and were combined in a single graph. Hit Rate indicated the percentage of reads that were serviced by the acceleration tier. It didn’t measure writes in any way. Eviction Rate indicated the percentage of data that was being evicted from the acceleration tier to make room for new incoming hot data. Each of them now have their own graphs that are more expansive in the information they provide.

As shown below, "Acceleration Rate" is in place of "Hit Rate."  This new metric now accounts for both reads and writes. One thing to note is that writes will only show "accelerated" here when in Write Back mode." Even though Write Back and Write Through populate the cache with the same approach, the green "write" line will only indicate acceleration when the VM or VMs are using a Write Back policy.

vmpete-accelerationrate

"Population and Eviction" (as shown below) replaces the latter half of the "Hit Rate and Eviction Rate" metric. Note that Eviction Rate is no longer measured as a percentage, but by actual amount in GB. This is a better way to view it, as the sizes of acceleration tiers vary, and thus the percentage value varied. Now you can tell more accurately how much data is being evicted at any given time. Population rate is exactly as it sounds. This is going to account for write data being placed into the cache regardless of its Write Policy (Write Back or Write Through), as well as data read for the first time from the backing storage, and placed into the cache (known as a "false write"). This graph provides much more detail about how the cache is being utilized in your environment.

vmpete-populationeviction

Now, if you really want to see some magical charts, and the insights that can be gleaned from them, go take a look at PernixData Architect.  I’ll be covering those graphs in more detail in upcoming posts.

Summary
A lot of new goodies have been packed into the latest version of FVP, but this covers a bit about why the UI was changed, and how PernixData products are in a great position to evolve and meet the demands of the user and the environment.

Related Links

Inside PernixData Engineering – User Interaction Design

Inside PernixData Engineering – UI and Web Technologies

Understanding PernixData FVP’s clustered read caching functionality

When PernixData debuted FVP back in August 2013, for me there was one innovation in particular that stood out above the rest.  The ability to accelerate writes (known as “Write Back” caching) on the server side, and do so in a fault tolerant way.  Leverage fast media on the server side to drive microsecond write latencies to a VM while enjoying all of the benefits of VMware clustering (vMotion, HA, DRS, etc.).  Give the VM the advantage of physics by presenting a local acknowledgement of the write, but maintain all of the benefits of keeping your compute and storage layers separate.

But sometimes overlooked with this innovation is the effectiveness that comes with how FVP clusters acceleration devices to create a pool of resources for read caching (known as “Write Through” caching with FVP). For new and existing FVP users, it is good to get familiar with the basics of how to interpret the effectiveness of clustered read caching, and how to look for opportunities to improve the results of it in an environment. For those who will be trying out the upcoming FVP Freedom edition, this will also serve as an additional primer for interpreting the metrics. Announced at Virtualization Field Day 5, the Freedom Edition is a free edition of FVP with a few limitations, such as read caching only, and a maximum of 128GB tier size using RAM.

The power of read caching done the right way
Read caching alone can sometimes be perceived as a helpful way to improve performance, but temporary, and only addressing one side of the I/O dialogue. Unfortunately, this assertion tells an incomplete story. It is often criticized, but let’s remember that caching in some form is used by almost everyone, and everything.  Storage arrays of all types, Hyper Converged solutions, and even DAS.  Dig a little deeper, and you realize its perceived shortcomings are most often attributed to how it has been implemented. By that I mean:

  • Limited, non-adjustable cache sizes in arrays or Hyper Converged environments.
  • Limited to a single host in server side solutions.  (operations like vMotion undermining its effectiveness)
  • Not VM or workload aware.

Existing solutions address some of these shortcomings, but fall short in addressing all three in order to deliver read caching in a truly effective way. FVP’s architecture address all three, giving you the agility to quickly adjust the performance tier while letting your centralized storage do what it does best; store data.

Since FVP allows you to choose the size of the acceleration tier, this impact alone can be profound. For instance, current NVMe based Flash cards are 2TB in size, and are expected to grow dramatically in the near future. Imagine a 10 node cluster that would have perhaps 20-40TB of an acceleration tier that may be serving up just 50TB of persistent storage. Compare this to a hybrid array that may only put in a few hundred GB of flash devices in an array serving up that same 50TB, and funneling through a pair of array controllers. Flash that the I/Os would still have traverse the network and storage stack to get to, and cached data that is arbitrarily evicted for new incoming hot blocks.

Unlike other host side caching solutions, FVP treats the collection of acceleration devices on each host as a pool. As workloads are being actively moved across hosts in the vSphere cluster, those workloads will still be able to fetch the cached content from that pool using a light weight protocol. Traditionally host based caching would have to re-warm the data from the backend storage using the entire storage stack and traditional protocols if something like a vMotion event occurred.

FVP is also VM aware. This means it understands the identity of each cached block – where it is coming from, and going to -  and has many ways to maintain cache coherency (See Frank Denneman’s post Solving Cache Pollution). Traditional approaches to providing a caching tier meant that they were largely unaware of who the blocks of data were associated with. Intelligence was typically lost the moment the block exits the HBA on the host. This sets up one of the most common but often overlooked scenarios in a real environment. One or more noisy neighbor VMs can easily pollute, and force eviction of hot blocks in the cache used by other VMs. The arbitrary nature of this means potentially unpredictable performance with these traditional approaches.

How it works
The logic behind FVP’s clustered read caching approach is incredibly resilient and efficient. Cached reads for a VM can be fetched from any host participating in the cluster, which allows for a seamless leveraging of cache content regardless of where the VM lives in the cluster. Frank Denneman’s post on FVP’s remote cache access describes this in great detail.

Adjusting the charts
Since we will be looking at the FVP charts to better understand the benefit of just read caching alone, let’s create a custom view. This will allow us to really focus on read I/Os and not get them confused with any other write I/O activity occurring at the same time.

image

 

Note that when you choose a "Custom Breakdown", the same colors used to represent both reads and writes in the default "Storage Type" view will now be representing ONLY reads from their respective resource type. Something to keep in mind as you toggle between the default "Storage Type" view, and this custom view.

SNAGHTML16994fda

Looking at Offload
The goal for any well designed storage system is to deliver optimal performance to the applications.  With FVP, I/Os are offloaded from the array to the acceleration tier on the server side.  Read requests will be delivered to the VMs faster, reducing latency, and speeding up your applications. 

From a financial investment perspective, let’s not forget the benefit of I/O “offload.”  Or in other words, read requests that were satisfied from the acceleration tier. Using FVP, offload from the storage arrays serving the persistent storage tier, from the array controllers, from the fabric, and the HBAs. The more offload there is, the less work for your storage arrays and fabric, which means you can target more affordable backend storage. The hero numbers showcase the sum of this offload nicely.image

Looking at Network acceleration reads
Unlike other host based solutions, FVP allows for common activities such as vMotions, DRS, and HA to work seamlessly without forcing any sort of rewarming of the cache from the backend storage. Below is an example of read I/O from 3 VMs in a production environment, and their ability to access cached reads on an acceleration device on a remote host.

SNAGHTML1bcf7ee

Note how the Latency maintains its low latency on those read requests that came from a remote acceleration device (the green line).

How good is my read caching working?
Regardless of which write policy (Write Through or Write Back) is being used in FVP, the cache is populated in the same way.

  • All read requests from the backing array will place the data into the acceleration tier as it fetches it from the backing storage.
  • All write I/O is placed in the cache as it is written to the physical storage.
    Therefore, it is easy to conclude that if read I/Os did NOT come from acceleration tier, it is from one of three reasons.
  • A block of data had been requested that had never been requested before.
  • The block of data had not been written recently, and thus, not residing in cache.
  • A block of data had once lived in the cache (via a read or write), but had been evicted due to cache size.

The first two items reflect the workload characteristics, while the last one is a result of a design decision – that being the cache size. With FVP you get to choose how large the devices are that make up the caching tier, so you can determine ultimately how much the solution will benefit you. Cache size can have a dramatic impact on performance because there is less pressure to evict previous data that have already been cached to make room for new data.

Visualizing the read cache usage
This is where the FVP metrics can tell the story. When looking at the "Custom Breakdown" view described earlier in this post, you can clearly see on the image below that while a sizable amount of reads were being serviced from the caching tier, the majority of reads (3,500+ IOPS sustained) in this time frame (1 week) came from the backing datastore.

SNAGHTML168a6c5f

Now, let’s contrast this to another environment and another workload. The image below clearly shows a large amount of data over the period of 1 day that is served from the acceleration tier. Nearly all of the read I/Os and over 60MBps of throughput that never touched the array.

SNAGHTML16897d40

When evaluating read cache sizing, this is one of the reasons why I like this particular “Custom Breakdown” view so much. Not only does it tell you how well FVP is working at offloading reads. It tells you the POTENTIAL of all reads that *could* be offloaded from the array.  You get to choose how much offload occurs, because you decide on how large your tier size is, or how many VMs participate in that tier.

Hit Rate will also tell you the percentage of reads that are coming from the acceleration tier at any point and time. This can be an effective way to few cache hit frequency, but to gain more insight, I often rely on this "Custom Breakdown" to get better context of how much data is coming from the cache and backing datastores at any point in time. Eviction rate can also provide complimentary information if it shows the eviction rate creeping upward.  But there can be cases were lower eviction percentages may evict enough cached data over time that it can still impact if it is still in cache.  Thus the reason why this particular "Custom Breakdown" is my favorite for evaluating reads.

What might be a scenario for seeing a lot of reads coming from a backing datastore, and not from cache? Imagine running 500 VMs in an acceleration tier size of just a few GB. The working set sizes are likely much larger than the cache size, and will result in churning through the cache and not show significant demonstrable benefit. Something to keep in mind if you are trying out FVP with a very small amount of RAM as an acceleration resource. Two effective ways to make this more efficient would be to 1.) increase the cache size or 2.) decrease the number of VMs participating in acceleration. Both will achieve the same thing; providing more potential cache tier size for each VM accelerated. The idea for any caching layer is to have it large enough to hold most of the active data (aka "working set") in the tier. With FVP, you get to easily adjust the tier size, or the VMs participating in it.

Don’t know what your working set sizes are?  Stay tuned for PernixData Architect!

Summary
Once you have a good plan for read caching with FVP, and arrange for a setup with maximum offload, you can drive the best performance possible from clustered read caching. On it’s own, clustered read caching implemented the way FVP does it can change the architectural discussion of how you design and spend those IT dollars.  Pair this with write-buffering with the full edition of FVP, and it can change the game completely.

Dogs, Rush hour traffic, and the history of storage I/O benchmarking–Part 2

Part one of "History of storage I/O Benchmarking" attempted to demonstrate how Synthetic Benchmarks on their own simply cannot generate or measure storage performance in a meaningful way. Emulating real workloads seems like such a simple matter to solve, but in truth it is a complex problem involving technical, and non-technical challenges.

  • Assessing workload characteristics is often left for conjecture.  Understanding the correct elements to observe is the first step to simulating them for testing, but how the problem is viewed is often left for whatever tool is readily available.  Many of these tools may look at the wrong variables.  A storage array might have great tools for monitoring the array, but is an incomplete view as it relates to the VM or application performance.
  • Understanding performance in a Datacenter crosses boundaries of subject matter expertise.  A traditional Storage Administrator will see the world in the same way the array views it.  Blocks, LUNS, queues, and transport protocols.  Ask them about performance and be prepared for a monologue on rotational latencies, RAID striping efficiencies and read/write handling.  What about the latency as seen by the VM?  Don’t be surprised if that is never mentioned.  It may not even be their fault, since their view of the infrastructure may be limited by access control.
  • When introducing a new solution that uses a technology like Flash, the word itself is seen as a superlative, not a technology.  The name implies instant, fast, and other super-hero like qualities.  Brilliant industry marketing, but it comes at a cost.  Storage solutions are often improperly tested after some technology with Flash is introduced because conventional wisdom says it is universally faster than anything in the past.  A simplified and incorrect assertion.

Evaluating performance demands a balance of understanding the infrastructure, the workloads, and the software platforms they run on. This takes time and the correct tools for insight – something most are lacking. Part one described the characteristics of real workloads that are difficult to emulate, plus the flawed approach of testing in a clustered compute environment. Unfortunately, it doesn’t end there. There is another factor to be considered; the physical characteristics of storage performance tiering layers, and the logic moving data between those layers.

Storage Performance tiering
Most Datacenters deliver storage performance using multiple persistent storage tiers and various forms of caching and buffering. Synthetic benchmarks force a behavior on these tiers that may be unrealistic. Many times this is difficult to decipher, as the tier sizes and data handling can be obfuscated by a storage vendor or unknown by the tester. What we do know is that storage tiering can certainly come in all shapes and sizes. Whether it traditional array with data progression techniques, a hybrid array, a decoupled architecture like PernixData FVP, or a Hyper Converged solution. The reality is that this tiering occurs all the time

With that in mind, there are two distinct approaches to test these environments.

  • Testing storage in a way to guarantee no I/O data comes from and goes to a top performing tier.
  • Testing storage in a way to guarantee that all I/O data comes from and goes to a top performing tier.

Which method is right for you? Both methods are neither right nor wrong as each can serve a purpose. Let’s use the car analogy again

  • Some might be averse to driving an electric car that only has a 100 mile range.  But what if you had a commute that rarely ever went more than 30 miles a day?  Think of that as like a caching/buffering tier.  If a caching layer is large enough that it might serve that I/O 95% of the time, well then, it may not be necessary to focus on testing performance from that lower tier of storage. 
  • In that same spirit, let’s say that same owner changed jobs and drove 200 miles a day.  That same car is a pretty poor solution for the job.  Similarly, if a storage array had just 20GB of caching/buffering for 100TB of persistent storage, the realistic working set size of each of the VMs that live on that storage would realize very little benefit from that 20GB of caching space.  In that case, it would be better to test the performance of the lower tier of storage.

What about testing the storage in a way to guarantee that data comes from all tiers?  Mixing a combination of the two sounds ideal, but often will not simulate the way real data will reside on the tiers, and produces a result that is difficult to determine if it reflects the way a real workload will behave. Due to the lack of identifying these caching tier sizes, or no true way to isolate a tier, this ironically ends up being the approach most commonly used – by accident alone.

When generating synthetic workloads that have a large proportion of writes, it can often be quite easy to hit buffer limit thresholds. Once again this is due to a benchmark committing every CPU cycle as a write I/O and for unrealistic periods of time. Even in extremely write intensive environments, this is completely unrealistic. It is for that reason that one can create a behavior with a synthetic benchmark against a tiered storage solution that rarely, if ever, happens in a real world environment.

When generating read I/O based synthetic tests using a large test file, those reads may sometimes hit the caching tier, and other times hit the slowest tier, which may show sporadic results. The reaction to this result often leads to running the test longer. The problem however is the testing approach, not the length of the test. Understanding the working set size of a VM is key, and should dictate how best to test in your environment. How do you determine a working set size? Let’s save that for a future post. Ultimately it is real workloads that matter, so the more you can emulate the real workloads, the better.

Storage caching population and eviction. Not all caching is the same
Caching layers in storage solutions can come in all shapes and sizes, but they depend on rules of engagement that may be difficult to determine. An example of two very important characteristics would be:

  • How they place data in cache.  Is some sort of predictive "data progression" algorithm being used?  Are the tiers using Write-Through caching to populate the cache in addition to population from data fetched from the backend storage. 
  • How they choose to evict data from cache.  Does the tier use "First-in-First-Out" (FIFO), Least Recently Used (LRU), Least Frequently Used (LFU) or some other approach for eviction?

Synthetic benchmarks do not accommodate this well.  Real world workloads will depend highly on them however, and the differences show up only in production environments.

Other testing mistakes
As if there weren’t enough ways to screw up a test, here are a few other common storage performance testing mistakes.

  • Not testing as close to the application level as possible.  This sort of functional testing will be more in line with how the application (and OS it lives on) handles real world data.
  • Long test durations.  Synthetic benchmarks are of little use when running an exhaustive (multi-hour) set of tests.  It tells very little, and just wastes time.
  • Overlooking a parameter on a benchmark.  Settings matter because they can yield very different results.
    Misunderstanding the read/write ratios of an environment.  Are you calculating your ratio by IOPS, or Throughput?  This alone can lead to two very different results.
  • Misunderstanding of typical I/O sizes in organization for reads and writes.  How are you choosing to determine what the typical I/O size is?
  • Testing reads and writes like two independent objectives.  Real workloads do not work like this, so there is little reason to test like this.
  • Using a final ‘score’ provided by a benchmark.  The focus should be on the behavior for the duration of the test.  Especially with technologies like Flash, careful attention should be paid to side effects from garbage collection techniques and other events that cause latency spikes. Those spikes matter.

Testing organizations often are vying for a position as a testing authority, or push methods or standards that somehow eliminate the mistakes described in this blog post series. Unfortunately that is not the case, but it does not matter anyway, as it is your data, and your workloads that count.

Making good use of synthetic benchmarks
It may come across that Synthetic Benchmarks or Synthetic Load Generators are useless. That is untrue. In fact, I use them all the time. Just not the way they conventional wisdom indicates. The real benefit comes once you accept the fact that they do not simulate real workloads. Here are a few scenarios in which they are quite useful.

  • Steady-state load generation.  This is especially useful in virtualized environments when you are trying to create load against a few systems.  It can be a great way to learn and troubleshoot.
  • Micro-benchmarking.  This is really about taking a small snippet of a workload, and attempting to emulate it for testing and evaluation.  Often times the test may only be 5 to 30 seconds, but will provide a chance to capture what is needed.  It’s more about generating I/O to observe behavior than testing absolute performance.  Look here for a good example.
  • Comparing independent hardware components.  This is a great way to show differences an old and new SSD.
  • Help provide broader insight to the bigger architectural picture.

Observe, Learn and Test
To avoid wasting time "testing" meaningless conditions, spend some time in vCenter, esxtop, and other methods to capture statistics. Learn about your existing workloads before running a benchmark. Collaborating with an internal application owner can make better use of your testing efforts. For instance, if you are looking to improve your SQL performance, create a series of tests or modify an existing batch job to run inside of SQL to establish baselines and observe behavior. Test at the busiest time and the quietest time of the day, as they both provide great data points. This approach was incredibly helpful for me when I was optimizing an environment for code compiling.

Summary
Try not to lose sight of the fact that testing storage performance is not about testing an array. It’s about testing how your workloads behave against your storage architecture. Real applications always tell the real story. The reason why most dislike this answer is that it is difficult to repeat, and challenging to measure the right way. Testing the correct way can mean you might spend a little time better understanding the demand your applications put on your environment.

And here you thought you ran out of things to do for the day. Happy testing.

Decisions

 


 

 

 



 

Dogs, Rush hour traffic, and the history of storage I/O benchmarking–Part 1

Evaluating performance of x86 based servers and workstations has had a history of deficiency. Twenty years ago, Administrators who tested system performance usually did little more than run a simple CPU benchmark to see how much faster a 50MHz system was than a 25 MHz system. Rarely did testing go beyond this. Nostalgia aside, it really was a simpler time.

Fast forward a few years, and testing became slightly more sophisticated. Someone figured out it might be good to test the slowest part of the system (storage), so methods and tools were created to accommodate. Storage moved beyond the physical confines of the server by using dedicated LUNS in a SAN array. The LUNS may have not been shared, but the fabric, and entry points to the array were. However, testing storage generally marched forward with little change. Virtualization changed the landscape even further by changing the notion of a dedicated LUN for a single system. Now, the fabric and every component on the storage system was shared.

Testing tools came and went, with some being nothing more than orphaned side projects. Some tools have more dials to turn, but many still run under the assumption that they are testing a physical host on local spinning disk. They do little to try to emulate a real workload, as they have no idea what that means. Many times these tools try to combine load generation with a single, final number for performance measurement. Almost as if whatever happened in between the start and finish didn’t matter.

Testing methods didn’t evolve much either. The quest for “top speed” was never supplanted by any other method. Noteworthy considering a critical measurement of anything shared is performance under load or contention. Storage architectures and the media used has evolved, but rarely is it properly accounted for in testing. Often lost in the speeds and feeds discussion is the part that really counts – The performance of the applications and the VMs they live on.

This post will point out the flaws of synthetic testing of storage performance (the tools, and the techniques), but it may incorrectly give the impression that they are useless.  Quite the contrary actually.  They can be very helpful when used the correct way, and for the right reasons.  More on this later.

Deficiencies of benchmarks as a meaningful measuring stick
“I don’t use benchmarks. I have users” — Ancient Twitter Proverb

There is no substitute for the value of observing real world performance characteristics, but it does little to address the difficulty with measuring that performance in a repeatable way. Real workloads are a collection of widely moving variables that all have different types of impact on an environment and a user experience. Testing system performance is important, but only when it is properly understood what the testing tools are producing.

Synthetic benchmarks offer a number of benefits. They are typically very easy to run, and often produce some dashboard result that can be referenced later. But these tools and test methods share common characteristics that rarely generate anything resembling real data patterns. Among those distinctions are.

  • I/O generated from them is not a closed loop dialogue
  • They do not mimic dynamic variables of real workloads
  • Improper testing practices in a clustered compute environment

All of these warrant more detail, so let’s elaborate on each one.

I/O generated from them is not a closed loop dialogue
A simplified way to describe typical I/O dialogue be this; Data is fetched, it is processed in some way, then is sent on its way. The I/O “signature” of a workload could be described as to what pattern and degree this dialogue occurs in.  It can be a pattern that is often repeated frequently if you observe workloads long enough.

Consider the fetching of some data, the processing of some data, and the writing of data. One might liken this process to a dog fetching a ball, you wiping the slobber off, and throwing it out again. Over and over again, and in that order. Single threaded of course.

n1whm

Synthetic load generators attack this quite differently. The one and only job of a synthetic I/O generator is to fill up the queues as fast as possible using every CPU cycle. The I/O generator has no regard for anything. The data is not processed in any way because that is not what was asked of it. By comparison, Synthetic I/O pretty much looks like this:

n1wck

With a Synthetic I/O generator, every CPU cycle will be pushed to perform a singular action. Reads are requested and writes are issued in an all or nothing fashion. Sure, some generators allow you to mix reads and writes, but the problem still remains.  They do not reflect any meaningful dialogue, and cannot mimic a real workload.

They do not mimic dynamic variables of real workloads
Real workloads consume resources (CPU, Memory, Storage) far differently than their synthetic counterparts. At any given time, storage I/Os will be varying mixes of reads versus writes, I/O sizes, and coming from one or many CPU threads. The two images below shows a 6 1/2 minute snippet of real I/O taken from a single VM in a production environment (using vscsiStats and Excel surface charts). During this time of heavy activity, notice how much the type of I/O that is in play varies.

Below you see the number of read I/Os, and the respective I/O sizes. Typically between 4K and 32K in size.

image

Below you see the number of write I/Os, and the respective I/O sizes. The majority of sizes range from 32K to 512K in size. This is occurring at the same time on the same VM.

image

Here you can see that read/write ratios vary for just this single VM, and more importantly, the size and the number of I/Os are all over the place. I/O sizes can have an enormous impact on storage performance, so one can imagine the difficulty in emulating it accurately. VMware’s I/O Analyzer attempts to simulate courtesy of a trace file created and replayed from a real workload, but it still will not behave in the same way as multiple VMs from multiple hosts generating widely varying I/O patterns.  Analytics from storage arrays doesn’t help much either, as they are unable to see the pattern of data in this way.

Improper testing practices in a clustered compute environment
A typical Administrator who tests storage performance usually does so by setting up a single system (VM) to test peak performance (IOPS, Throughput & Latency) on a shared storage backend. It sounds logical at first, but this method doesn’t reflect the way data is handled in a clustered compute environment. Storage I/O from a clustered compute arrangement behaves in a way that is not unlike congestion on an interstate freeway. The performance of a freeway cannot be evaluated by a single car driving on it. It’s performance measurement is derived when it is under load with multiple cars, with different intentions, destinations, sizes, and all of the other variables that introduce congestion. Modern traffic simulation and modeling solutions account for all of these variables to measure and improve what matters most – real traffic.

Unfortunately, most testers and tools take this same “single car” approach, and do not account for one of the most important elements in modern virtualized infrastructures; the clustered compute layer. A fast storage infrastructure needs to be able to handle the given number of compute nodes (physical hosts) now and in the future. After all, the I/O’s are ALWAYS generated by the VMs and the collection of hosts they live on – not the backend storage. Painfully obvious, but often overlooked.

Take a look below. This is an illustration of I/O activity in a real environment (traditional clustered compute with SAN architecture). Green lines represent read I/Os and red lines representing write I/Os.

image

Now, let’s look at I/O activity from a traditional synthetic benchmarking approach. As you can see below, it looks pretty different.

image

Storage I/O generated in a real environment is a result of the number of nodes in a cluster and the workloads running on them. So a better (but far from perfect) way to test, at the very minimum, would be:

image

In a traditional architecture with an array that exceeds the capabilities of I/O generated from a single host, this is the method most commonly used to measure the absolute high water mark numbers of that array. Typically storage manufactures quote the numbers from the array because it is a single point of measurement that fits nicely in Marketing materials. Unfortunately it doesn’t measure what really counts; the reported numbers as seen by the VMs – a fact that must not be forgotten when performing any sort of performance testing. The array of course always hits a limit at some point due to the characteristics of the array, or the fabric it has to traverse.

In any sort of clustered compute system, you cannot recognize the full power of the compute platform by testing off of one VM. The same thing goes for any type of distributed storage architecture. With clustered host based acceleration solutions like PernixData FVP, or even Hyper Converged solutions, the approach will have to be similar to above in order to measure correctly. These are different architectures that reshape the traditional data path, and with the testing recommendations above, should help in evaluating their performance. This approach also puts the focus back where it should be; the performance of the VMs, and not some irrelevant numbers from a physical storage array.

 

Proper simulation of I/O across all hosts will allow you to adequately factor in the performance of the storage fabric and all connection points. Most fabrics are quite fast when there isn’t any traffic on them. Unfortunately, that isn’t very realistic. It is important to understand the impact the fabric introduces as the environment is scaled. Since the fabric is what connects all of the hosts fetching and committing data to a storage array, we need to simulate how everything (HBAs, storage array controllers, switches, etc.) performs under contention. If you haven’t already done so, take a few moments to read one of my favorite posts from Frank Denneman, Data Path is not managed as a clustered resource.

Testing with multiple VMs on multiple hosts with FVP also allows you to takes advantage of the per VM acceleration (write buffering & read caching) capabilities across a clustered compute environment. It is one of the reasons why the FVP’s decoupled architecture can scale so well, and why real workloads become a such a beneficiary of the architecture.

You thought we were finished, didn’t you
There are just too many ways Synthetic Benchmarks are misused to cover in just one post. Stay tuned for Part 2 for more observations on why they are inadequate as a single test for modern environments, and most importantly, when and how they can actually be useful.

Interpreting Performance Metrics in PernixData FVP

In the simplest of terms, performance charts and graphs are nothing more than lines with pretty colors.  They exist to provide insight and enable smart decision making.  Yet, accurate interpretation is a skill often trivialized, or worse, completely overlooked.  Depending on how well the data is presented, performance graphs can amaze or confuse, with hardly a difference between the two.

A vSphere environment provides ample opportunity to stare at all types of performance graphs, but often lost are techniques in how to interpret the actual data.  The counterpoint to this is that most are self-explanatory. Perhaps a valid point if they were not misinterpreted and underutilized so often.  To appeal to those averse to performance graph overload, many well intentioned solutions offer overly simplified dashboard-like insights.  They might serve as a good starting point, but this distilled data often falls short in providing the detail necessary to understand real performance conditions.  Variables that impact performance can be complex, and deserve more insight than a green/yellow/red indicator over a large sampling period.

Most vSphere Administrators can quickly view the “heavy hitters” of an environment by sorting the VMs by CPU in order to see the big offenders, and then drill down from there.  vCenter does not naturally provide good visual representation for storage I/O.  Interesting because storage performance can be the culprit for so many performance issues in a virtualized environment.  PernixData FVP accelerates your storage I/O, but also fills the void nicely in helping you understand your storage I/O.

FVP’s metrics leverage VMkernel statistics, but in my opinion make them more consumable.  These statistics reported by the hypervisor are particularly important because they are the measurements your VMs and applications feel.  Something to keep in mind when your other components in your infrastructure (storage arrays, network fabrics, etc.) may advertise good performance numbers, but don’t align with what the applications are seeing.

Interpreting performance metrics is a big topic, so the goal of this post is to provide some tips to help you interpret PernixData FVP performance metrics more accurately.

Starting at the top
In order to quickly look for the busiest VMs, one can start at the top of the FVP cluster.  Click on the “Performance Map” which is similar to a heat map. Rather than projecting VM I/O activity by color, the view will project each VM on their respective hosts at different sizes proportional to how much I/O they are generating for that given time period.  More active VMs will show up larger than less active VMs.

(click on images to enlarge)

PerformanceMap

From here, you can click on the targets of the VMs to get a feel for what activity is going on – serving as a convenient way to drill into the common I/O metrics of each VM; Latency, IOPS, and Throughput.

FromFVPcluster

As shown below, these same metrics are available if the VM on the left hand side of the vSphere client is highlighted, and will give a larger view of each one of the graphs.  I tend to like this approach because it is a little easier on the eyes.

fromVMlevel

VM based Metrics – IOPS and Throughput
When drilling down into the VM’s FVP performance statistics, it will default to the Latency tab.  This makes sense considering how important latency is, but I find it most helpful to first click on the IOPS tab to get a feel for how many I/Os this VM is generating or requesting.  The primary reason why I don’t initially look at the Latency tab is that latency is a metric that requires context.  Often times VM workloads are bursty, and there may be times where there is little to no IOPS.  The VMkernel can sometimes report latency against little or no I/O activity a bit inaccurately, so looking at the IOPS and Throughput tabs first bring context to the Latency tab.

The default “Storage Type” breakdown view is a good view to start with when looking at IOPs and Throughput. To simplify the view even more tick the boxes so that only the “VM Observed” and the “Datastore” lines show, as displayed below.

IOPSgettingstarted

The predefined “read/write” breakdown is also helpful for the IOPS and Throughput tabs as it gives a feel of the proportion of reads versus writes.  More on this in a minute.

What to look for
When viewing the IOPS and Throughput in an FVP accelerated environment, there may be times when you see large amounts of separation between the “VM Observed” line (blue) and the “Datastore” (magenta). Similar to what is shown below, having this separation where the “VM Observed” line is much higher than the “Datastore” line is a clear indication that FVP is accelerating those I/Os and driving down the latency.  It doesn’t take long to begin looking for this visual cue.

IOPSread

But there are times when there may be little to no separation between these lines, such as what you see below.

IOPSwrite

So what is going on?  Does this mean FVP is no longer accelerating?  No, it is still working.  It is about interpreting the graphs correctly.  Since FVP is an acceleration tier only, cached reads come from the acceleration tier on the hosts – creating the large separation between the “Datastore” and the “VM Observed” lines.  When FVP accelerates writes, they are synchronously buffered to the acceleration tier, followed by destaging to the backing datastore as soon as possible – often within milliseconds.  The rate at which data is sampled and rendered onto the graph will report the “VM Observed” and “Datastore” statistics that are at very similar times.

By toggling the “Breakdown” to “read/write” we can confirm in this case that the change in appearance in the IOPS graph above came from the workload transitioning from mostly reads to mostly writes.  Note how the magenta “Datastore” line above matches up with the cyan “Write” line below.

IOPSreadwrite

The graph above still might imply that the performance went down as the workload transition from reads to writes. Is that really the case?  Well, let’s take a look at the “Throughput” tab.  As you can see below, the graph shows that in fact there was the same amount of data being transmitted on both phases of the workload, yet the IOPS shows much fewer I/Os at the time the writes were occurring.

IOPSvsThroughput

The most common reason for this sort of behavior is OS file system buffer caching inside the guest VM, which will assemble writes into larger I/O sizes.  The amount of data read in this example was the same as the amount of data that was written, but measuring that by only IOPS (aka I/O commands per second) can be misleading. I/O sizes are not only underappreciated for their impact on storage performance, but this is a good example of how often the I/O sizes can change, and how IOPS can be a misleading measurement if left on its own.

If the example above doesn’t make you question conventional wisdom on industry standard read/write ratios, or common methods for testing storage systems, it should.

We can also see from the Write Back Destaging tab that FVP destages the writes as aggressively as the array will allow.  As you can see below, all of the writes were delivered to the backing datastore in under 1 second.  This ties back to the previous graphs that showed the “VM Observed” and the “Datastore” lines following very closely to each other during period with several writes.

WBdestaging

The key to understanding the performance improvement is to look at the Latency tab.  Notice on the image below how that latency for the VM dropped way down to a low, predictable level throughout the entire workload.  Again, this is the metric that matters.

LatencybytheVM

Another way to think of this is that the IOPS and Throughput performance charts can typically show the visual results for read caching better than write buffering.  This is because:

  • Cached reads never come from the backing datastore, where buffered writes always hit the backing datastore.
  • Reads may be smaller I/O sizes than writes, which visually skews the impact if only looking at the IOPS tab.

Therefore, the ultimate measurement for both reads and writes is the latency metric.

VM based Metrics – Latency
Latency is arguably one of the most important metrics to look at.  This is what matters most to an active VM and the applications that live on it.  Now that you’ve looked at the IOPS and Throughput, take a look at the Latency tab. The “Storage type” breakdown is a good place to start, as it gives an overall sense of the effective VM latency against the backing datastore.  Much like the other metrics, it is good to look for separation between the “VM Observed” and “Datastore” where “VM Observed” latency should be lower than the “Datastore” line.

In the image above, the latency is dramatically improved, which again is the real measurement of impact.  A more detailed view of this same data can be viewed by selecting a “custom ” breakdown.  Tick the following checkboxes as shown below

customlatencysetting

Now take a look at the latency for the VM again. Hover anywhere on the chart that you might find interesting. The pop-up dialog will show you the detailed information that really tells you valuable information:

  • Where would have the latency come from if it had originated from the datastore (datastore read or write)
  • What has contributed to the effective “VM Observed” latency.

custombreakdown-latency-result

What to look for
The desired result for the Latency tab is to have the “VM Observed” line as low and as consistent as possible.  There may be times where the VM observed latency is not quite as low as you might expect.  The causes for this are numerous, and subject for another post, but FVP will provide some good indications as to some of the sources of that latency.  Switching over to the “Custom Breakdown” described earlier, you can see this more clearly.  This view can be used as an effective tool to help better understand any causes related to an occasional latency spike.

Hit & Eviction rate
Hit rate is the percentage of reads that are serviced by the acceleration tier, and not by the datastore.  It is great to see this measurement high, but is not the exclusive indicator of how well the environment is operating.  It is a metric that is complimentary to the other metrics, and shouldn’t be looked at in isolation.  It is only focused on displaying read caching hit rates, and conveys that as a percentage; whether there are 2,000 IOPS coming from the VM, or 2 IOPS coming from the VM.

There are times where this isn’t as high as you’d think.  Some of the causes to a lower than expected hit rate include:

  • Large amounts of sequential writes.  The graph is measuring read “hits” and will see a write as a “read miss”
  • Little or no I/O activity on the VM monitored.
  • In-guest activity that you are unaware of.  For instance, an in-guest SQL backup job might flush out the otherwise good cache related to that particular VM.  This is a leading indicator of such activity.  Thanks to the new Intelligent I/O profiling feature in FVP 2.5, one has the ability to optimize the cache for these types of scenarios.  See Frank Denneman’s post for more information about this feature.

Lets look at the Hit Rate for the period we are interested in.

hitandeviction

You can see from above that the period of activity is the only part we should pay attention to.  Notice on the previous graphs that outside of the active period we were interested in, there was very little to no I/O activity

A low hit rate does not necessarily mean that a workload hasn’t been accelerated. It simply provides and additional data point for understanding.  In addition to looking at the hit rate, a good strategy is to look at the amount of reads from the IOPS or Throughput tab by creating the custom view settings of:

custombreakdown-read

Now we can better see how many reads are actually occurring, and how many are coming from cache versus the backing datastore.  It puts much better context around the situation than relying entirely on Hit Rate.

custombreakdown-read-result

Eviction Rate will tell us the percentage of blocks that are being evicted at any point and time.  A very low eviction rate indicates that FVP is lazily evicting data on an as needed based to make room for new incoming hot data, and is a good sign that the acceleration tier size is sized large enough to handle the general working set of data.  If this ramps upward, then that tells you that otherwise hot data will no longer be in the acceleration tier.  Eviction rates are a good indicator to help you determine of your acceleration tier is large enough.

The importance of context and the correlation to CPU cycles
When viewing performance metrics, context is everything.  Performance metrics are guilty of standing on their own far too often.  Or perhaps, it is human nature to want to look at these in isolation.  In the previous graphs, notice the relationship between the IOPS, Throughput, and Latency tabs.  They all play a part in delivering storage payload.

Viewing a VM’s ability to generate high IOPS and Throughput are good, but this can also be misleading.  A common but incorrect assumption is that once a VM is on fast storage that it will start doing some startling number of IOPS.  That is simply untrue. It is the application (and the OS that it is living on) that is dictating how many I/Os it will be pushing at any given time. I know of many single threaded applications that are very I/O intensive, and several multithreaded applications that aren’t.  Thus, it’s not about chasing IOPS, but rather, the ability to deliver low latency in a consistent way.  It is that low latency that lets the CPU breath freely, and not wait for the next I/O to be executed.

What do I mean by “breath freely?”  With real world workloads, the difference between fast and slow storage I/O is that CPU cycles can satisfy the I/O request without waiting.  A given workload may be performing some defined activity.  It may take a certain number of CPU cycles, and a certain number of storage I/Os to accomplish this.  An infrastructure that allows those I/Os to complete more quickly will let more CPU cycles to take part in completing the request, but in a shorter amount of time.

CPU

Looking at CPU utilization can also be a helpful indicator of your storage infrastructure’s ability to deliver the I/O. A VM’s ability to peak at 100% CPU is often a good thing from a storage I/O perspective.  It means that VM is less likely to be storage I/O constrained.

Summary
The only thing better than a really fast infrastructure for your workloads is understanding how well it is performing.  Hopefully this post offers up a few good tips when you look at your own workloads leveraging PernixData FVP.

Your Intel NUC Home Lab questions answered

With my recent post on what’s currently running in my vSphere Home Lab, I received a number of questions about one particular part of the lab; that being my Management Cluster built with Intel NUCs. So here is a quick compilation of those questions (with answers) I’ve had around this topic.

Why did you go with a NUC?  There are cheaper options.
My approach for a Home Lab Management Cluster was a bit different than my regular Lab Cluster. I wanted to take a minimalist approach, and provide just enough resources to get my primary VMs off of my other two hosts that I do a majority of my testing against. In other words, less is more. There is a bit of a price premium with a NUC, but there also is a distinct payoff with them that often gets overlooked. If they do not keep up with your needs in the Home Lab, they can be easily repurposed as a workstation or a media PC. The same can’t be said for most Home Lab gear.

Is there anything special you have to do to run ESXi on a NUC?
Nothing terribly difficult. The buildup of ESXi on the NUC is relatively straightforward, and there is a growing number of posts that walk through this nicely. The primary steps are:

  1. Build a customized ISO by packing up an Intel NIC driver, and a SATA Controller driver, and place it on a bootable USB.
  2. Temporarily disable AHCI in the BIOS for just the installation process
  3. Install ESXi
  4. Re-enable AHCI in the BIOS after the installation of ESXi is complete.

How many cores?
The 3rd generation NUC is built off of the Intel Core i5-4250U (Haswell) chipset. It has two physical cores, and will present 4CPUs with Hyper-Threading. After managing and watching real workloads for years, my position on Hyper-Threading is a bit more conservative than many others. It is certainly better than nothing, but many times the effective performance gain is limited, and varies with workload characteristics. It’s primary benefit with the NUC is that you can run a 4vCPU VM if you need to. Utilization of the CPU from a cluster perspective are often hovering below 10%.

Is working with 16GB of RAM painful?
Having just 16GB of RAM might be a more visible pain if it were serving something other than Management VMs. The biggest issue with a "skinny" two node Management Cluster usually comes in when you have to throw one into maintenance mode. But much like having a single switch in a Home Lab, you just deal with it. Below is what the Memory usage on these NUCs look like.

clip_image001

There are a few options to improve this situation.

1. Trimming up some of your VMs might be a good start. Virtual Appliances like the VCSA are built with a healthy chunk of RAM configured by default (supporting all of that Java goodness). Here is a way to trim up memory resources on the VCSA, although, I have not done this yet because I haven’t needed to. Just don’t use the Active Memory metric as your sole data point to trim up a VM Memory configuration. See Observations with the Active Memory metric in vSphere on how easily that metric can be misinterpreted.

2. Look into a new option for increasing RAM density on the NUC. Yeah, this might blow your budget, but if you really want 32GB of RAM in a NUC, you can do it. At this time, the cost delta for making a modification like this is pretty steep, and it may make more sense to purchase a 3rd NUC for a Management Cluster.

3. Adjust expectations and accept the fact that you might have a little memory ballooning or swapping going on. This is by far the easiest, and most affordable way to go.

How is ESXi on a single NIC?
Well, there is no getting around the fact that the NUC comes with a single 1GbE NIC. This means no redundancy, and limited bandwidth. The good news is that with just one NIC, you can monitor this quite easily in vCenter!   Since you are running all services and data across a single uplink, it may be in your best interest to run a Virtual Distributed Switch (VDS) to properly control ingress and egress traffic, and make sure something like a vMotion isn’t going to wreak havoc on your environment. However, transitioning a vCenter VM to a VDS with a single uplink can sometimes be a little adventurous, so you might want to plan ahead.

If you must have a 2nd NIC to the host, take a look here. Nicholas Farmer showed quite a bit of ingenuity in coming up with a second uplink. Also, don’t forget to look at his great mini-rack he made for the NUCs out of Legos. Great stuff.

How do they perform?
Exactly the way I want them to perform. Out of sight, and out of mind.  Again, my primary lab work is performed on my Micro-ATX style hosts, so as long as the NUCs can keep the various Management and infrastructure VMs running, then that is good with me. Some VMs are easy to trim up to provide minimal resources (Linux based syslog servers, DNS, etc.) while others are more difficult or not worth the hassle.

Why did you put two SSDs in them?
This was for flexibility. I wanted one (mSATA) drive for the possibility of local storage if I decided to place any VMs locally, as well as another device (2.5" SSD) for other uses. In this case, I decided to apply a little PernixData FVP magic on them and use one of them in each host to accelerate the VMs. The image below shows the latency of the VCSA, which has about 99% writes. Note how the latency dropped after transitioning the VM from Write-Through (read caching) to Write-Back (read and write caching) to a consistently low level. Not bad considering all traffic is riding across a single link, and the flash device is an old SSD.

(click on image to enlarge)

VCSA

 

Would you recommend them?
I think the Intel NUCs serve as a great little host for a Management Cluster in a Home Lab. They won’t be replacing my Micro-ATX style boxes any time soon, nor should they ever be part of a real environment, but they allow me the freedom to experiment and test on the primary part of the Lab, which is what a Home Lab is all about.

Thanks for reading.

– Pete