What does your infrastructure analytics really tell you?

There is no mistaking the value of data visualization combined with analytics.  Data visualization can help make sense of the abstract or information not easily conveyed by numbers.  Data analytics excels at taking discrete data points that make no sense on their own, into findings that have context, and relevance.  The two together can present findings in a meaningful, insightful, and easy to understand way.  But what are your analytics really telling you?

The problem for modern IT is that there can be an overabundance of data, with little regard to the quality of data gathered, how it relates to each other, and how to make it meaningful.  All too often, this "more is better" approach obfuscates the important to such a degree that it provides less value, not more.  it’s easy to collect data.  The difficulty is to do something meaningful with the right data.  Many tools collect metrics in an order not by which is most important, but what can be easily provided.

Various solutions with the same problem
Modern storage solutions have increased their sophistication in their analytics offerings for storage.  In principle this can be a good thing, as storage capacity and performance is such a common problem with today’s environments.  Storage vendors have joined the "we do that too" race of analytics features.  However, feature list checkboxes can easily mask the reality – that the quality of insight is not what you might think it is.  Creative license gets a little, well, creative.

Some storage solutions showcase their storage I/O analytics as a complete solution for understanding storage usage and performance of an environment.  Advertising an extraordinary amount of data points collected, and sophisticated methods for collection of that data that is impressive by anyone’s standards.  But these metrics are often taken at face value.  Tough questions need to be asked before important decisions are made off of them.  Is the right data being measured?  Is the data being measure from the right location?  Is the data being measured in the right way?  And is the information conveyed of real value?

Accurate analytics requires that the sources of data are of the right quality and completeness.  No amount of shiny presentation can override the result of using the wrong data, or using it in the wrong way.

What is the right data?
The right data has a supporting influence on the questions that you are trying to answer.  Why did my application slow down after 1:18pm? How did a recent application modification impact other workloads?  In Infrastructure performance, I’ve demonstrated how block sizes have historically been ignored when it came to storage design, because they could not have been easily seen or measured.  Having metrics around fan speed of a storage array might be helpful for evaluating your cooling system in your Data Center, but does little to help you understand your workloads.  The right data must also be collected at a rate that accurately reflects the real behavior.  If your analytics offerings sample data once every 5 or 10 minutes, how can it ever show spikes of contention in resources that impact what your systems experience?  The short answer is, they can’t.

The importance of location
Measuring the data at the right location is critical to accurately interpreting the conditions of your VMs, and the infrastructure in which they live.  We perceive much more than we see.  This is demonstrated most often with a playful optical illusion, but can be a serious problem with understanding your environment.  The data gathered is often incomplete, and how you perceived it by virtue of assuming it was all the data you need all lead to the wrong conclusion.  Let’s consider a common scenario where the analytics of a storage system shows great performance of a storage array, yet the VM may be performing poorly.  This is the result of measuring from the wrong location.  The array may have showed the latency of the components inside the device, but cannot account for latency introduced throughout the storage stack.  The array metric might have been technically accurate for what it was seeing, but it was not providing you the correct, and complete metric.  Since storage I/O always originate on the VMs and the infrastructure in which they live, it simply does not make sense to measure them from a supporting component like a storage array.

Measuring data inside the VM can be equally as challenging.  Operating Systems’ method of data collection assume they are the sole proprietor of resources, and may not always accurately account for that fact that it is time slicing CPU clock cycles with other VMs.  While the VM is the end "consumer" of resource, it also does not understand it is virtualized, and cannot see the influence of performance bottlenecks throughout the virtualization layer, or any of the physical components in the stack that support it.

VM metrics pulled from inside the guest OS may measure thing in different ways depending on Operating System.  Consider the differences in how disk latency in Windows "Perfmon" is measured versus Linux "top."  This is the problem with data collector based solutions that aggregate metrics from difference sources.  A lot of data collected, but none of it means the same thing.

This disparate data leaves users attempting to reconcile what these metrics mean, and how they impact each other.  Even worse when supposedly similar metrics from two different sources show different data.  This can occur with storage array solutions that hook into vCenter to augment the array based statistics.  Which one is to be believed?  One over the other, or neither?

Statistics pulled solely from the hypervisor kernel avoids this nonsense.  It provides a consistent method for gathering meaningful data about your VMs and the infrastructure as a whole.  The hypervisor kernel is also capable of measuring this data in such a way that it accounts for all elements of the virtualization stack.  However, determining the location for collection is not the end-game.  We must also consider how it is analyzed.

Seeing the trees AND the forest
Metrics are just numbers.  More is needed than numbers to provide a holistic understanding for an environment.  Data collected that stands on its own is important, but how it contributes to the broader understanding of the environment is critical.  One needs to be able to get a broad overview of an environment to drill down and identify a root cause of an issue, or be able to start out at the level of an underperforming VM and see how or why it may be impacted by others.

Many attempt to distill down this large collection of metrics to just a few that might help provide insight into performance, or potential issues.  Examples of these individual metrics might include CPU utilization, Queue depths, storage latency, or storage IOPS.  However, it is quite common to misinterpret these metrics when looked at in isolation.

Holistic understanding provides its greatest value when attempting to determine the impact of one workload over a group of other workloads.  A VM’s transition to a new type of storage I/O pattern can often result in lower CPU activity; the exact opposite of what most would look for.  The weight of impact between metrics will also vary.  Think about a VM consuming large amounts of CPU.  This will generally only impact other VMs on that host.  In contrast, a storage based noisy neighbor can impact all VMs running on that storage system, not just the other VMs that live on that host.

Conclusion
Whether your systems are physical, virtualized, or live in the cloud, analytics exist to help answer questions, and solve problems.  But analytics are far more than raw numbers.  The value comes from properly digesting and correlating numbers into a story providing real intelligence.  All of this is contingent on using the right data in the first place.   Keep this in mind as you think about ways that you currently look at your environment.