Sessions at VMworld 2017

About a year ago, I had the good fortune of joining VMware. The timing of it all meant that for me, my contributions to VMworld 2016 were limited to getting a sense of what the demands were like for my teammates in Technical Marketing. While it was clearly different than anything I had experienced in past VMworld events on the customer and partner side, I was looking forward to the eventual opportunity of doing what I enjoy most: creating interesting content.

As expected, my schedule is noticeably busier than last year. I’m quite fortunate to go into VMworld 2017 with the opportunity to present a handful of sessions along side my colleagues within, and across business units. Here’s the rundown.

Interpreting Performance Metrics in Your vSAN Environment (STO1206BU)
A topic near and dear to me, this session will be showing how to interpret the performance metrics provided by the vSAN performance service. Not only will this session be diving into the offerings and practical advice with the vSAN performance service, this session will look at common mistakes when attempting to analyze storage performance metrics in vSAN, and elsewhere. Bradley Mott from GSS will be joining me in this presentation for what I believe will be a session full of learning possibilities.

vSAN Operations and Management Recommendations and Best Practices (STO1178BU) **
You just deployed vSAN. Now what? Introducing a new storage architecture to your data center means you want to think about how day to day operations may remain the same, and what activities might change. Jeff Hunter and I will be discussing what are some of those considerations you may want to think about when it comes to operationalizing vSAN in your data center, and tips on how you can bring the ease of management to your own vSAN environment.

vSAN 6.6: A Day in the Life of an I/O (STO1926BU) **
With well over 900 people in attendance at this breakout session last year, this one is going to fill up fast. The session is back, but with quite a twist. One year, and three versions later, a lot has changed with vSAN. This year’s session is going to take a look at how vSAN incorporates those features when looking at the I/O path. I have the pleasure of joining John Nicholson in this deep dive of how vSAN operates. Get your geek hat on for this one.

Optimizing vSAN with vRealize Operations (STO1895BU)
vSAN users were rewarded with another VMware software update that wasn’t a part of vSAN. vRealize Operations 6.6 was overhauled with the new HTML5 based Clarity UI, and now has four built-in dashboards specifically built for vSAN using the enhanced APIs that came with vSAN 6.6. John Dias and I will be going over what these analytics can mean to vSAN users, and show what kind of unique insight vR Ops can offer when running in a vSAN environment.

**Sessions also at VMworld Europe

Choosing the right VMworld sessions for your specific interests and needs is never easy. It’s difficult to go wrong no matter what you choose. But if you have the opportunity to attend one of the sessions above, please stop by afterward and say hello!

– Pete

Rethinking “storage efficiency” in HCI architectures–Part 1

Hyper-converged infrastructures (HCI) can bring several design and operational benefits to the table, adding to the long list of reasons behind its popularity. Yet, HCI also introduces new considerations in understanding and measuring technical costs associated with the architecture. These technical costs could be thought of as a usage “tax” or “overhead” on host resources. The amount attributed to this technical cost can vary quite drastically, and depends heavily on the architecture used. For an administrator, it can be a bit challenging to measure and understand. The architecture used by HCI solutions should not be overlooked, as these technical costs can not only influence the performance and consistency of the VMs, but dramatically impact the density of VMs per host, and ultimately the total cost of ownership.

With HCI, host resources (CPU, memory, and network) are now responsible for an entirely new set of duties typically provided by a storage array found in a traditional three-tier architecture. These responsibilities not only include handling VM storage I/O from end to end, but due to the distributed nature of HCI, hosts will take part in storage activity of VMs not local to the host, such as replicated writes of a VM, as well as data at rest operations and other services related to storage. These responsibilities consume host resources. The question is, how much?

This multi-part series is going to look at the basics of HCI architectures, and how they behave differently with respect to their demands on CPU, memory, and network resources. Operational comparisons are not covered in this series simply to maintain focus on the intent of this series.

"Storage efficiency" is more than what you think
The term "storage efficiency" is commonly associated with just data deduplication and compression. With hyper-converged infrastructures, this term takes on additional meaning. Storage efficiency in HCI relates to the efficiency of how I/Os are delivered to and from the VM. Efficiency of I/O delivery to and from VMs matter not only from performance and consistency as seen by the VM, but how much resource usage is introduced to the hosts in the cluster. The latter is often never considered, yet extremely important.

HCI Architectures
HCI solutions available in today’s market not only offer different data services, but are built differently, which is just one of the many reasons why it is difficult to generalize a typical amount of overhead that is needed to process storage I/O. All HCI solutions will vary (some more than others) on how they provide storage services to the VMs while maintaining resources for guest VM activity. The two basic categories, as illustrated in Figure 1 are:

  • Virtual appliance approach. A VM lives on each host in the cluster, delivering a distributed shared storage plane, processing I/O and the other related activities. Depending on the particular HCI solution, this virtual appliance on each host may also be responsible for a number of other duties.
  • Integrated/in-kernel approach. The distributed shared storage system is a part of the hypervisor, where key aspects of the storage system are part of the kernel. This allows for virtual machine I/O to traverse through the native kernel I/O path for the hosts participating in that I/O activity.

Figure1-HCIComparison

Figure 1. Comparing an I/O write between HCI architectures (simplified for clarity)

HCI solutions that use a VM to process storage I/O on each host reside in a context (user space) that is no different than application VMs running on the host. In other words, the resources allocated to this virtual appliance to perform system level storage duties, contend for the same resources as the VMs that it is trying to serve. HCI solutions built into the hypervisor maintain end-to-end control and awareness of the I/O. Since an in-kernel, integrated solution allows I/O to traverse through the native kernel I/O path, it uses the least "costly" way to use host resources. HCI solutions built into the kernel minimize the amplification of I/Os and the CPU and memory resources it takes to process those I/Os from end to end. Sometimes virtual appliance based HCI solutions will use devices on hosts configured in the hypervisor for direct pass-through (aka “VMDirectPath”) in an attempt to reduce overhead, but many of the fundamental penalties (especially as they relate to CPU cycles) of I/O amplification through this indirect path and context switching remain.

Addressing a problem in different ways
Why are their multiple approaches? Manufacturers may state many reasons why they chose a specific approach, and why their approach is superior. Most the decision comes from technical limitations and go-to-market pressures. An HCI vendor may not have the access, or the ability to provide this functionality natively in the kernel of a hypervisor. A virtual appliance approach is easier to bring to market, and naturally adaptable to different hypervisors since it is little more than a virtual machine to process storage I/O.

By way of comparison, those who have full ownership of the hypervisor can integrate this functionality directly into the hypervisor, and when appropriate, build some aspects of it right into the kernel, just as other core functionality is built into the kernel. Resource efficiency, hypervisor feature integration, as well as the contextual awareness and control of I/O types are typically the top reasons why it is beneficial to have a distributed storage mechanism built into the hypervisor.

Do both approaches work? Yes. Do both approaches produce the same result in VM behavior and host resource usage? No. Running the same workloads using HCI solutions with these two different architectures may produce very different results on the VMs, and the hosts that serve them. The degree of impact will depend on the technical cost (in resource usage) of the I/O processing, and other data services provided by a given solution.

This difference often does not show up until numerous, real workloads are put on these solutions. Just as with a traditional storage array, every solution is fast when there is little to no load on it. What counts is the behavior under real load with contending resources. This is something not always visible with synthetic testing. For HCI environments, the overall “storage efficiency” of the particular HCI solution can be better compared (assuming identical hardware and workloads) by looking at the following in a real HCI environment running production workloads:

  • The average number of active VMs per host when running your real workloads.
  • The performance characteristics of the VMs and hosts when running your real workloads while hosts are busy serving other workloads.

These measurements above take this topic from an occasionally tiresome academic debate, and demonstrates the differences in real world circumstances. Ironically, faster hardware can increase, not reduce, the differences between these architectural approaches to HCI. This is not unlike what occurs quite often now at the application level, where faster hardware exposes actual bottlenecks in software/application design previously unnoticeable with older, slower hardware.

Now that an explanation has been given as to why "storage efficiency" really means so much more than data services like deduplication and compression, the next post in this series will focus on CPU resources in HCI environments, and what to look out for when observing CPU usage behaviors in HCI environments.

Does the concept of host resource usage interest you? If so, stay tuned for the book, vSphere 6.5 Host Resources Deep Dive by Frank Denneman and Niels Hagoort. It is surely to be a must-have for those interested in the design and optimization of virtualized environments. You can also follow updates from them at @hostdeepdive on Twitter.

 

Juggling priorities, and my unplanned, but temporary break from blogging

If you have a blog post for long enough, sometimes readers measure your contributions only by the number of recent posts published. If that were an accurate way to measure productivity, then apparently I haven’t been doing anything in the past few months. Quite the contrary really. Since joining VMware in August of 2016, I’ve had the opportunity to work with great teams on exciting projects. It has been fast paced, educational, and fun.

So what have I been doing anyway…
It is fair to say that VMware vSAN 6.6 is the most significant launch in the history of vSAN. The amount of features packed into 6.6 is a testament to the massive focus by R&D to deliver an unprecedented set of new features and enhancements. Part of the effort of any release is rolling out technical content. A significant amount of that load falls on Technical Marketing, and by virtue of being a part of the team, I’ve been right in the thick of it. The list of deliverables is long, but it has been fun to be a part of that process.

How vSAN is integrated into the hypervisor gives it unique abilities in integration and interoperability. An area of focus for me has been demonstrating practical examples of this integration – in particular, the integration that vSAN has with vRealize Operations, and vRealize Log Insight. Connecting the dots between what really happens in data centers, and product capabilities is one way to show how, and why the right type of integration is so important. There are a lot of exciting things coming in this space, so stay tuned.

I’ve also had the chance to join my colleagues, Pete Flecha and John Nicholson on the Virtually Speaking Podcast. In episode 38, we talked a little about storage performance, and in episode 41, we discussed some of the new features of vSAN 6.6. What John and Pete have been able to accomplish with that podcast over the past year is impressive, and the popularity of it speaks to the great content they produce.

Since joining VMware, I also stepped down from my role as a Seattle VMUG leader. It was a fun and educational experience that helped me connect with the community, and appreciate every single volunteer out there. VMUG communities everywhere are run by enthusiasts of technology, and their passion is what keeps it all going. I appreciated the opportunity, and they are in good hands with Damian Kirby, who has taken over leadership duties.

All of these activities, while gratifying, left little time for my normal cadence of posts. I’ve always enjoyed creating no-nonsense, interesting, unique content with a lot of detail. Testing, capturing observations, and investigating issues is fun and rewarding for me, but it is also extremely time consuming. I spent the past 8 years churning out this type of content at a clip of about one post per month. That doesn’t sound like much, but with the level of detail and testing involved, it was difficult to keep up the pace recently. This short reprieve has allowed me to rethink what I want my site to focus on. While much of the content I’m producing these days shows up in other forms, and in other locations, I’ll now have the chance to mix up the content out here a bit. Some new posts are in the works, and hope to pick up the pace soon, if for nothing else, to let everyone know I’m actually doing something. Smile

– Pete

Helpful Links

vSAN in cost effective independent environments

Old habits in data center design can be hard to break.  New technologies are introduced that process data faster, and move data more quickly.  Yet all too often, the thought process for data center design remains the same – inevitably constructed and managed in ways that reflect conventional wisdom and familiar practices.  Unfortunately these common practices are often due to constraints of the technologies that preceded it, rather than aligning the current business objectives with new technologies and capabilities.

Historically, no component of an infrastructure dictated design and operation more than storage.  The architecture of traditional shared storage often meant that the storage infrastructure was the oddball of the modern data center.  Given enough capacity, performance, and physical ports on a fabric, a monolithic array could serve up several vSphere clusters, and therein lies the problem.  The storage was not seen or treated as a clustered resource by the hypervisor like compute.  This centralized way of storing data invited connectivity by as many hosts as possible in order to justify the associated costs. Unfortunately it also invited several problems.  It placed limits on data center design because in part, it was far too impractical to purchase separate shared storage for every use case that would benefit from an independent environment isolated from the rest of the data center.  As my colleague John Nicholson (blog/twitter) has often said, "you can’t cut your array in half."  It’s a humorous, but cogent way to describe this highly common problem.

vSANWhile VMware vSAN has proven to be extremely well suited for converging all applications into the same environment, business requirements may dictate a need for self contained, independent environments isolated in some manner from the rest of the data center.  In "Cost Effective Independent Environments using vSAN" found on VMware’s StorageHub, I walk through four examples that show how business requirements may warrant a cluster of compute and storage dedicated for a specific purpose, and why vSAN is an ideal solution.  The examples provided are:

  • Independent cluster management
  • Development/Test environments
  • Application driven requirements
  • Multi-purpose Disaster Recovery

Each example listed above details how traditional storage can fall short in delivering results efficiently, then compares how vSAN addresses and solves those specific design and operational challenges. Furthermore, learn how storage related controls are moved into the hypervisor using Storage Policy Based Management (SPBM), VMware’s framework that delivers storage performance and protection policies to VMs, and even individual VMDKs, all within vCenter.  SPBM is the common management framework used in vSAN and Virtual Volumes (VVols), and is clearly becoming the way to manage software defined storage.  Each example wraps up with a number of practical design tips for that specific scenario in order to get you started in building a better data center using vSAN.

Clustering is an incredibly powerful concept, and vSphere clusters in particular bring capabilities to your virtualized environment that are simply beyond comparison.  With VMware vSAN, the power of clustering resources are taken to the next level, forming the next logical step in the journey of modernizing your environment in preparation for a fully software defined data center.

This use case published is the first of many more to come that are focused on practical scenarios reflecting common needs of organizations large and small, and how vSAN can help deliver results, quickly and effectively.  Stay tuned!

– Pete

Accommodating for change with Virtual SAN

One of the many challenges to proper data center design is trying to accommodate for future changes, and do so in a practical way. Growth is often the reason behind change, and while that is inherently a good thing, IT budgets often don’t see that same rate of increase. CFO’s expect economies of scale to make your environment more cost efficient, and so should you.

Unfortunately, applications are always demanding more resources. The combination of commodity x86 servers and virtualization provided a flexible way to accommodate growth when it came to compute and memory resources, but addressing storage capacity and storage performance was far more difficult. Hyper-converged architectures helped break down this barrier somewhat, but some solutions lacked flexibility to cope with increasing storage capacity or performance beyond the initial prescribed configurations defined by a vendor. Users need a way to easily increase their HCI storage resources in the middle of a lifecycle without always requesting for yet another capital expenditure.

“A customer can have a car painted any color he wants as long as it’s black” — Henry Ford

But wait… it doesn’t always have to be that way. Take a look at my post on Virtual Blocks on Options in scalability with Virtual SAN. See how VSAN allows for a smarter way to approach your evolving resource needs, giving the power of choice in how you scale your environment back to you. Whether you choose to build your own servers using the VMware compatibility guide, go with VSAN Ready Nodes, or select from one of the VxRAIL options available, the principals described in the post remain the same. I hope it sparks a few ideas on how you can apply this flexibility in a strategic way to your own environment.

Thanks for reading…

The success of VSAN, and my move to VMware

For the past few years, I’ve had the opportunity to share with others how to better understand their Data Center workloads, and how to use this knowledge to better service the needs of their organizations.  As a Technical Marketing Engineer for PernixData, the role allowed me to maintain a pulse on the needs of the customers and the partners, as well as analyze what others were doing from a competitive standpoint. It was a great way to distinguish between industry hyperbole, versus what solutions people were really interested in, and implementing.

One observation simply couldn’t be ignored. It was clear that many were adopting VMware VSAN – and doing it in a big way. The rate of adoption even seemed to outpace the exceptionally rapid rate the product has been maturing. Thinking back to my days on the customer side, it was easy to see why. With its unique traits by virtue of it being built into the hypervisor, it appeals to the sensibilities of the Data Center Administrator, and the CFO. VSAN was resonating with the needs of the customers, and doing so in a much more tangible way than official market research numbers could describe.

I wanted to be a part of it.

With that, I’m thrilled to be joining VMware’s Storage and Availability business unit, as a member of their Technical Marketing Team. One of my areas of focus will be VSAN, as well as many other related topics. I’m joining the likes of GS Khalsa, Jase McCarty, Jeff Hunter, John Nicholson, Ken Werneburg, and Pete Flecha. To say it’s a honor to be joining this team is a bit of an understatement.  I’m truly grateful for the opportunity.

A special thanks to all of the great people I worked with at PernixData. An incredibly talented group of people striving to make a difference. The best of luck to each and every one of them. It’s been a truly rewarding experience indeed.

You’ll be able to my official contributions out on VMware’s Virtual Blocks and as well as other locations. I’ll be continuing to post out here at vmPete.com for unofficial content, and other things that continue to interest me.

Onward…

How CPU related metrics in vSphere may be misinterpreted

Most Data Center Administrators are accustomed to looking for high CPU utilization rates on VMs, and the hosts in which they reside. This shouldn’t be a big surprise. After all, vCenter, and other monitoring tools have default alarms to alert against high CPU usage statistics. Features like DRS, or products that claim DRS-like functionality factor in CPU related metrics as a part of their ability to redistribute VMs under periods of contention. All of these alerts and activities suggest that high CPU values are bad, and low values are good. But what if conventional wisdom on the consumption of CPU resources is wrong?

Why should you care
Infrastructure metrics can certainly be a good leading indicator of a problem. Over the years, high CPU usage alarms have helped correctly identified many rogue processes on VMs ("Hey, who enabled the screen saver via GPO?…"). But a CPU alarm trigger assumes that high CPU usage is always bad. It also implies that the absence of an alarm condition means that there is not an issue. Both assumptions can be incorrect, which may lead to bad decision making in the Data Center.

The subtleties of performance metrics can reveal problems somewhere else in the stack – if you know how and where to look. Unfortunately, when metrics are looked at in isolation, the problems remain hidden in plain sight. This post will demonstrate how a few common metrics related to CPU utilization can be misinterpreted. Take a look at the post Observations with the Active Memory metric in vSphere to see how this can happen with other metrics as well.

The testing
There are a number of CPU related metrics to monitor in the hypervisor, and at least a couple of different ways to look at them (vCenter, and esxtop). For brevity, lets focus on two metrics that readily visible in vCenter; CPU Usage and CPU Ready. This doesn’t dismiss the importance of other CPU related metrics, or the various ways to gather them, but it is a good start to understanding the relationship between metrics. As a quick refresher, CPU Usage as it relates to vCenter has two definitions. From the host, the usage is the percentage of CPU cycles in use against the total CPU cycles available on the host. On the VM, usage shows the percent of CPU resources in use against the total available CPU cycles of the vCPUs visible to the VM. CPU Ready in vCenter measures in summation form, the amount of time that the virtual machine was ready, but could not get scheduled to run on the CPU.

A few notes about the test conditions and results:

  • The tests here comprise of activities that are scheduled inside each guest, and are repeated 5 times over a 1 hour period.
  • There are no synthetic tools used here to generate storage I/O load or consume CPU cycles. (iometer, StressLinux, etc.)
  • The activities performed are using processes that are only partially multithreaded. This approach is most reflective of real world environments.
  • The "slower" storage depicted in the testing were actually SSDs, while the "faster" storage was by leveraging PernixData FVP and distributed fault tolerant memory (DFTM) as a storage acceleration tier.
  • The absolute numbers are not necessarily important for this testing. The focus is more about comparing values when a variable like storage performance changes.
  • No shares, reservations, or limits were used on the test VMs.

The complex demands of real world environments may exhibit a much greater impact than what the testing below reveals. I reference a few actual cases of production workloads later on in the post. Synthetic load generators were not used here because they cannot properly simulate a pattern of activity that is reflective of a real environment. Synthetic load generators are good at stressing resources – not simulating real world workloads, or the time it takes for those workloads to complete their tasks.

Interpreting impacts on CPU usage and CPU Ready with changing storage performance
Looking at CPU utilization can be challenging because not all applications, nor the workloads they generate are the same. Most applications are a complex mix of some processes being multithreaded, while others are not. Some processes initiate storage I/O, while others do not. It is for this reason that we will look at CPU Usage and CPU Ready over a task that is repeated on the same sets of VMs, but using storage that performs differently.

For all practical purposes, CPU Ready doesn’t become meaningful until a host is running a large number of single vCPU VMs concurrently, or a number of multiple vCPU VMs concurrently. CPU Ready can sometimes be terribly tricky to decipher because it can be influenced in so many ways. Sometimes it may align with CPU utilization, while other times it may not. It may be affected by other resources, or it may not. It really depends on the environmental conditions. I find it a good supporting metric, but definitely not one that should stand on its own merit, without proper context of other metrics. We are measuring it here because it is generally regarded as important, and one that may contribute to load distribution activities.

Test 1: Single vCPU VM on a Host with no other activity
First let’s look at one of the very simplest of comparisons. A single vCPU VM with no other activity occurring on the host, where one test is using slower storage (blue), and the other test it is using faster storage (orange). A task was completed 5 times over the course of one hour. The image below shows that from the host perspective, peak CPU utilization increased by 79% when using the faster storage. CPU Ready demonstrated very little change, which was as expected due to the nature of this test (no other VMs running on the host).

hoststats-1vcpu

When we look at the individual VMs, the results are similar. The images below show that CPU usage maximums for the VM increased by 24% when using the faster storage. CPU Ready demonstrated very little change here because there were no other VMs to contend with on that host. The "Storage Latency" column shows the average storage latency the VM was seeing during this time period.

vmstats-1vcpu

You might think that higher latency may not be realistic of today’s storage technologies. The "slower" storage in this case did in fact come from SSD based storage. But remember that Flash of any kind can suffer in performance when committing larger block I/O which is quite common with real workloads. Take a look at "Understanding block sizes in a virtualized environment" for more information.

But wait… how long did the task, set to run 5 times over the period of one hour take? Well, the task took just half the time to run with the faster storage. The same amount of cycles were processing the same amount of I/Os, but just for a shorter period of time. This faster completion of a task will free up those CPU cycles for other VMs. This is the primary reason why the averages for CPU Usage and CPU Ready changed very little. Looking at this data in a timeline form in vCenter illustrates it quite clearly. There is a clear distinction of the characteristics of the task on the fast storage. Much more difficult to decipher on the run with slower storage.

combined-singlevCPU

Test 2: Multiple vCPU VM on a host with other activity
Now let’s let the same workload run on VMs with assigned multiple (4) vCPUs, along with other multi-vCPU VMs running in the background. This is to simulate a bit of "chatter" or activity that one might experience in a production environment.

As we can see from the images below, on the host level, both CPU usage and CPU ready values increased as storage performance increased. CPU usage maximums increased by 39% on the host. CPU Ready maximums increased by 34% on the host, which was a noticeable difference than testing without any other systems running.

hoststats-4vcpu

When we look at the individual VMs, the results are similar. The images below show that CPU usage maximums increased by 39% with the faster storage. CPU Ready maximums increased by 51% while running on the faster storage. Considering the typical VM to host consolidation ratio, the effects can be profound.

vmstats-4vcpu

Now let’s take a look at the timeline in vCenter to get an appreciation of how those CPU cycles were used. On the image below, you can see that like the single vCPU VM testing, the VM running on faster storage allowed for much higher CPU usage than when running on slower storage, but that it was for a much shorter period of time (about half). You will notice that in this test, the CPU Ready measurements generally increases as the CPU usage increased.

combined-multivCPU

Real world examples
This all brings me back to what I witnessed years ago while administering a vSphere environment consisting of extremely CPU and storage I/O intensive workloads. Dozens of resource intensive VMs built for the purpose of compiling code. These were systems using that could multithread to near perfection – assuming storage performance was sufficient.

fasterprocess

Now let’s look at what CPU utilization rates looked like on that same VM, running the same code compiling job where the storage environment wasn’t able to satisfy reads and writes fast enough. The same job took 46% longer to complete, all because the available CPU cycles couldn’t be used.

longprocess

Still not a believer? Take a look at a presentation at the OpenStack summit by Charter Communications in April 2016, where they demonstrate exactly the effect I describe. Their Cassandra cluster deployed with VMware Integrated OpenStack, and the effects of CPU utilization when providing lower latency, higher performing storage. (key information beginning at 17:10). Their more freely breathing storage allowed CPU cycles related to storage I/O to be committed more quickly, thereby finishing the tasks much more quickly. High CPU usage was a desired result of theirs.

You might be thinking to yourself, "Won’t I have more CPU contention with faster storage?"   Well, yes and no. Faster storage will give power back to the Administrator to control the usage of resources as needed, and deliver the SLAs required. And moving the point of contention to the CPU allows for what it does best; time slicing processes to complete the tasks as quickly as possible.

Sample what?
The rate at which telemetry data is sampled is a factor that can dramatically change your impression of the behavior of these resources used in the Data Center. It’s a big topic, and one that will be touched on in an upcoming post, but there is one thing to note here. When leveraging faster, lower latency storage, there are many times where CPU utilization and CPU Ready will stay the same. Why? In a real workload that involve CPU cycles executing to commit storage I/O, a workflow can may consist of a given amount of those I/Os, regardless of how long it takes. If that process took 18 seconds on slow storage, but 5 seconds on faster storage, the 20 second sampling rate within vCenter may render it in the same way. One often has to employ other tools to see these figures at a higher sampling rate. Tools such as vscsiStats and esxtop are good examples of this.

Takeaways
The testing, and examples above should make it easy to imagine a scenario in which a storage system is upgraded, and CPU related alarms are tripped more frequently, even though the processes that support a workflow have completed much more quickly. So with that, it’s good to keep the following in mind.

  • Slow storage will suppress CPU utilization rates – giving you the impression that from a host, or VM perspective, everything is fine.
  • Conversely, Fast storage will allow those CPU cycles related to storage I/O to execute, thereby increasing utilization rates – albeit for a shorter period of time.  High CPU statistics are not necessarily a bad thing.
  • Averages and peaks can be misleading because increased utilization rates may not be recognizable in the vCenter CPU charts if it completes within the smallest sampling size (20 seconds)
  • Traditional methods of monitoring and balancing host resources can be misleading
  • Higher CPU utilization rates may not be a leading indicator of an issue. They are often be a trailing indicator of well-designed processes, or free breathing storage. Again, high CPU can be a good thing!!!
  • Application behavior, and the results are what counts. If a batch job in SQL takes 30 minutes, defining success should be around the desired time of that batch job. Infrastructure related metrics should help you diagnose issues and assist with achieving a desired result, but not be the one and only KPI.
  • Storage performance will generally impact every VM and host accessing the cluster. Whereas host based resource contention will only impact other VMs living on that same host.

Thanks for reading

– Pete