virtualization

Rethinking “storage efficiency” in HCI architectures–Part 1

Hyper-converged infrastructures (HCI) can bring several design and operational benefits to the table, adding to the long list of reasons behind its popularity. Yet, HCI also introduces new considerations in understanding and measuring technical costs associated with the architecture. These technical costs could be thought of as a usage “tax” or “overhead” on host resources. The amount attributed to this technical cost can vary quite drastically, and depends heavily on the architecture used. For an administrator, it can be a bit challenging to measure and understand. The architecture used by HCI solutions should not be overlooked, as these technical costs can not only influence the performance and consistency of the VMs, but dramatically impact the density of VMs per host, and ultimately the total cost of ownership.

With HCI, host resources (CPU, memory, and network) are now responsible for an entirely new set of duties typically provided by a storage array found in a traditional three-tier architecture. These responsibilities not only include handling VM storage I/O from end to end, but due to the distributed nature of HCI, hosts will take part in storage activity of VMs not local to the host, such as replicated writes of a VM, as well as data at rest operations and other services related to storage. These responsibilities consume host resources. The question is, how much?

This multi-part series is going to look at the basics of HCI architectures, and how they behave differently with respect to their demands on CPU, memory, and network resources. Operational comparisons are not covered in this series simply to maintain focus on the intent of this series.

"Storage efficiency" is more than what you think
The term "storage efficiency" is commonly associated with just data deduplication and compression. With hyper-converged infrastructures, this term takes on additional meaning. Storage efficiency in HCI relates to the efficiency of how I/Os are delivered to and from the VM. Efficiency of I/O delivery to and from VMs matter not only from performance and consistency as seen by the VM, but how much resource usage is introduced to the hosts in the cluster. The latter is often never considered, yet extremely important.

HCI Architectures
HCI solutions available in today’s market not only offer different data services, but are built differently, which is just one of the many reasons why it is difficult to generalize a typical amount of overhead that is needed to process storage I/O. All HCI solutions will vary (some more than others) on how they provide storage services to the VMs while maintaining resources for guest VM activity. The two basic categories, as illustrated in Figure 1 are:

Virtual appliance approach. A VM lives on each host in the cluster, delivering a distributed shared storage plane, processing I/O and the other related activities. Depending on the particular HCI solution, this virtual appliance on each host may also be responsible for a number of other duties.
Integrated/in-kernel approach. The distributed shared storage system is a part of the hypervisor, where key aspects of the storage system are part of the kernel. This allows for virtual machine I/O to traverse through the native kernel I/O path for the hosts participating in that I/O activity.

Figure 1. Comparing an I/O write between HCI architectures (simplified for clarity)

HCI solutions that use a VM to process storage I/O on each host reside in a context (user space) that is no different than application VMs running on the host. In other words, the resources allocated to this virtual appliance to perform system level storage duties, contend for the same resources as the VMs that it is trying to serve. HCI solutions built into the hypervisor maintain end-to-end control and awareness of the I/O. Since an in-kernel, integrated solution allows I/O to traverse through the native kernel I/O path, it uses the least "costly" way to use host resources. HCI solutions built into the kernel minimize the amplification of I/Os and the CPU and memory resources it takes to process those I/Os from end to end. Sometimes virtual appliance based HCI solutions will use devices on hosts configured in the hypervisor for direct pass-through (aka “VMDirectPath”) in an attempt to reduce overhead, but many of the fundamental penalties (especially as they relate to CPU cycles) of I/O amplification through this indirect path and context switching remain.

Addressing a problem in different ways
Why are their multiple approaches? Manufacturers may state many reasons why they chose a specific approach, and why their approach is superior. Most the decision comes from technical limitations and go-to-market pressures. An HCI vendor may not have the access, or the ability to provide this functionality natively in the kernel of a hypervisor. A virtual appliance approach is easier to bring to market, and naturally adaptable to different hypervisors since it is little more than a virtual machine to process storage I/O.

By way of comparison, those who have full ownership of the hypervisor can integrate this functionality directly into the hypervisor, and when appropriate, build some aspects of it right into the kernel, just as other core functionality is built into the kernel. Resource efficiency, hypervisor feature integration, as well as the contextual awareness and control of I/O types are typically the top reasons why it is beneficial to have a distributed storage mechanism built into the hypervisor.

Do both approaches work? Yes. Do both approaches produce the same result in VM behavior and host resource usage? No. Running the same workloads using HCI solutions with these two different architectures may produce very different results on the VMs, and the hosts that serve them. The degree of impact will depend on the technical cost (in resource usage) of the I/O processing, and other data services provided by a given solution.

This difference often does not show up until numerous, real workloads are put on these solutions. Just as with a traditional storage array, every solution is fast when there is little to no load on it. What counts is the behavior under real load with contending resources. This is something not always visible with synthetic testing. For HCI environments, the overall “storage efficiency” of the particular HCI solution can be better compared (assuming identical hardware and workloads) by looking at the following in a real HCI environment running production workloads:

The average number of active VMs per host when running your real workloads.
The performance characteristics of the VMs and hosts when running your real workloads while hosts are busy serving other workloads.

These measurements above take this topic from an occasionally tiresome academic debate, and demonstrates the differences in real world circumstances. Ironically, faster hardware can increase, not reduce, the differences between these architectural approaches to HCI. This is not unlike what occurs quite often now at the application level, where faster hardware exposes actual bottlenecks in software/application design previously unnoticeable with older, slower hardware.

Now that an explanation has been given as to why "storage efficiency" really means so much more than data services like deduplication and compression, the next post in this series will focus on CPU resources in HCI environments, and what to look out for when observing CPU usage behaviors in HCI environments.

Does the concept of host resource usage interest you? If so, stay tuned for the book, vSphere 6.5 Host Resources Deep Dive by Frank Denneman and Niels Hagoort. It is surely to be a must-have for those interested in the design and optimization of virtualized environments. You can also follow updates from them at @hostdeepdive on Twitter.

vSAN in cost effective independent environments

Old habits in data center design can be hard to break. New technologies are introduced that process data faster, and move data more quickly. Yet all too often, the thought process for data center design remains the same – inevitably constructed and managed in ways that reflect conventional wisdom and familiar practices. Unfortunately these common practices are often due to constraints of the technologies that preceded it, rather than aligning the current business objectives with new technologies and capabilities.

Historically, no component of an infrastructure dictated design and operation more than storage. The architecture of traditional shared storage often meant that the storage infrastructure was the oddball of the modern data center. Given enough capacity, performance, and physical ports on a fabric, a monolithic array could serve up several vSphere clusters, and therein lies the problem. The storage was not seen or treated as a clustered resource by the hypervisor like compute. This centralized way of storing data invited connectivity by as many hosts as possible in order to justify the associated costs. Unfortunately it also invited several problems. It placed limits on data center design because in part, it was far too impractical to purchase separate shared storage for every use case that would benefit from an independent environment isolated from the rest of the data center. As my colleague John Nicholson (blog/twitter) has often said, "you can’t cut your array in half." It’s a humorous, but cogent way to describe this highly common problem.

While VMware vSAN has proven to be extremely well suited for converging all applications into the same environment, business requirements may dictate a need for self contained, independent environments isolated in some manner from the rest of the data center. In "Cost Effective Independent Environments using vSAN" found on VMware’s StorageHub, I walk through four examples that show how business requirements may warrant a cluster of compute and storage dedicated for a specific purpose, and why vSAN is an ideal solution. The examples provided are:

Independent cluster management
Development/Test environments
Application driven requirements
Multi-purpose Disaster Recovery

Each example listed above details how traditional storage can fall short in delivering results efficiently, then compares how vSAN addresses and solves those specific design and operational challenges. Furthermore, learn how storage related controls are moved into the hypervisor using Storage Policy Based Management (SPBM), VMware’s framework that delivers storage performance and protection policies to VMs, and even individual VMDKs, all within vCenter. SPBM is the common management framework used in vSAN and Virtual Volumes (VVols), and is clearly becoming the way to manage software defined storage. Each example wraps up with a number of practical design tips for that specific scenario in order to get you started in building a better data center using vSAN.

Clustering is an incredibly powerful concept, and vSphere clusters in particular bring capabilities to your virtualized environment that are simply beyond comparison. With VMware vSAN, the power of clustering resources are taken to the next level, forming the next logical step in the journey of modernizing your environment in preparation for a fully software defined data center.

This use case published is the first of many more to come that are focused on practical scenarios reflecting common needs of organizations large and small, and how vSAN can help deliver results, quickly and effectively. Stay tuned!

– Pete

Accommodating for change with Virtual SAN

One of the many challenges to proper data center design is trying to accommodate for future changes, and do so in a practical way. Growth is often the reason behind change, and while that is inherently a good thing, IT budgets often don’t see that same rate of increase. CFO’s expect economies of scale to make your environment more cost efficient, and so should you.

Unfortunately, applications are always demanding more resources. The combination of commodity x86 servers and virtualization provided a flexible way to accommodate growth when it came to compute and memory resources, but addressing storage capacity and storage performance was far more difficult. Hyper-converged architectures helped break down this barrier somewhat, but some solutions lacked flexibility to cope with increasing storage capacity or performance beyond the initial prescribed configurations defined by a vendor. Users need a way to easily increase their HCI storage resources in the middle of a lifecycle without always requesting for yet another capital expenditure.

“A customer can have a car painted any color he wants as long as it’s black” — Henry Ford

But wait… it doesn’t always have to be that way. Take a look at my post on Virtual Blocks on Options in scalability with Virtual SAN. See how VSAN allows for a smarter way to approach your evolving resource needs, giving the power of choice in how you scale your environment back to you. Whether you choose to build your own servers using the VMware compatibility guide, go with VSAN Ready Nodes, or select from one of the VxRAIL options available, the principals described in the post remain the same. I hope it sparks a few ideas on how you can apply this flexibility in a strategic way to your own environment.

Thanks for reading…

How CPU related metrics in vSphere may be misinterpreted

Most Data Center Administrators are accustomed to looking for high CPU utilization rates on VMs, and the hosts in which they reside. This shouldn’t be a big surprise. After all, vCenter, and other monitoring tools have default alarms to alert against high CPU usage statistics. Features like DRS, or products that claim DRS-like functionality factor in CPU related metrics as a part of their ability to redistribute VMs under periods of contention. All of these alerts and activities suggest that high CPU values are bad, and low values are good. But what if conventional wisdom on the consumption of CPU resources is wrong?

Why should you care
Infrastructure metrics can certainly be a good leading indicator of a problem. Over the years, high CPU usage alarms have helped correctly identified many rogue processes on VMs ("Hey, who enabled the screen saver via GPO?…"). But a CPU alarm trigger assumes that high CPU usage is always bad. It also implies that the absence of an alarm condition means that there is not an issue. Both assumptions can be incorrect, which may lead to bad decision making in the Data Center.

The subtleties of performance metrics can reveal problems somewhere else in the stack – if you know how and where to look. Unfortunately, when metrics are looked at in isolation, the problems remain hidden in plain sight. This post will demonstrate how a few common metrics related to CPU utilization can be misinterpreted. Take a look at the post Observations with the Active Memory metric in vSphere to see how this can happen with other metrics as well.

The testing
There are a number of CPU related metrics to monitor in the hypervisor, and at least a couple of different ways to look at them (vCenter, and esxtop). For brevity, lets focus on two metrics that readily visible in vCenter; CPU Usage and CPU Ready. This doesn’t dismiss the importance of other CPU related metrics, or the various ways to gather them, but it is a good start to understanding the relationship between metrics. As a quick refresher, CPU Usage as it relates to vCenter has two definitions. From the host, the usage is the percentage of CPU cycles in use against the total CPU cycles available on the host. On the VM, usage shows the percent of CPU resources in use against the total available CPU cycles of the vCPUs visible to the VM. CPU Ready in vCenter measures in summation form, the amount of time that the virtual machine was ready, but could not get scheduled to run on the CPU.

A few notes about the test conditions and results:

The tests here comprise of activities that are scheduled inside each guest, and are repeated 5 times over a 1 hour period.
There are no synthetic tools used here to generate storage I/O load or consume CPU cycles. (iometer, StressLinux, etc.)
The activities performed are using processes that are only partially multithreaded. This approach is most reflective of real world environments.
The "slower" storage depicted in the testing were actually SSDs, while the "faster" storage was by leveraging PernixData FVP and distributed fault tolerant memory (DFTM) as a storage acceleration tier.
The absolute numbers are not necessarily important for this testing. The focus is more about comparing values when a variable like storage performance changes.
No shares, reservations, or limits were used on the test VMs.

The complex demands of real world environments may exhibit a much greater impact than what the testing below reveals. I reference a few actual cases of production workloads later on in the post. Synthetic load generators were not used here because they cannot properly simulate a pattern of activity that is reflective of a real environment. Synthetic load generators are good at stressing resources – not simulating real world workloads, or the time it takes for those workloads to complete their tasks.

Interpreting impacts on CPU usage and CPU Ready with changing storage performance
Looking at CPU utilization can be challenging because not all applications, nor the workloads they generate are the same. Most applications are a complex mix of some processes being multithreaded, while others are not. Some processes initiate storage I/O, while others do not. It is for this reason that we will look at CPU Usage and CPU Ready over a task that is repeated on the same sets of VMs, but using storage that performs differently.

For all practical purposes, CPU Ready doesn’t become meaningful until a host is running a large number of single vCPU VMs concurrently, or a number of multiple vCPU VMs concurrently. CPU Ready can sometimes be terribly tricky to decipher because it can be influenced in so many ways. Sometimes it may align with CPU utilization, while other times it may not. It may be affected by other resources, or it may not. It really depends on the environmental conditions. I find it a good supporting metric, but definitely not one that should stand on its own merit, without proper context of other metrics. We are measuring it here because it is generally regarded as important, and one that may contribute to load distribution activities.

Test 1: Single vCPU VM on a Host with no other activity
First let’s look at one of the very simplest of comparisons. A single vCPU VM with no other activity occurring on the host, where one test is using slower storage (blue), and the other test it is using faster storage (orange). A task was completed 5 times over the course of one hour. The image below shows that from the host perspective, peak CPU utilization increased by 79% when using the faster storage. CPU Ready demonstrated very little change, which was as expected due to the nature of this test (no other VMs running on the host).

When we look at the individual VMs, the results are similar. The images below show that CPU usage maximums for the VM increased by 24% when using the faster storage. CPU Ready demonstrated very little change here because there were no other VMs to contend with on that host. The "Storage Latency" column shows the average storage latency the VM was seeing during this time period.

You might think that higher latency may not be realistic of today’s storage technologies. The "slower" storage in this case did in fact come from SSD based storage. But remember that Flash of any kind can suffer in performance when committing larger block I/O which is quite common with real workloads. Take a look at "Understanding block sizes in a virtualized environment" for more information.

But wait… how long did the task, set to run 5 times over the period of one hour take? Well, the task took just half the time to run with the faster storage. The same amount of cycles were processing the same amount of I/Os, but just for a shorter period of time. This faster completion of a task will free up those CPU cycles for other VMs. This is the primary reason why the averages for CPU Usage and CPU Ready changed very little. Looking at this data in a timeline form in vCenter illustrates it quite clearly. There is a clear distinction of the characteristics of the task on the fast storage. Much more difficult to decipher on the run with slower storage.

Test 2: Multiple vCPU VM on a host with other activity
Now let’s let the same workload run on VMs with assigned multiple (4) vCPUs, along with other multi-vCPU VMs running in the background. This is to simulate a bit of "chatter" or activity that one might experience in a production environment.

As we can see from the images below, on the host level, both CPU usage and CPU ready values increased as storage performance increased. CPU usage maximums increased by 39% on the host. CPU Ready maximums increased by 34% on the host, which was a noticeable difference than testing without any other systems running.

When we look at the individual VMs, the results are similar. The images below show that CPU usage maximums increased by 39% with the faster storage. CPU Ready maximums increased by 51% while running on the faster storage. Considering the typical VM to host consolidation ratio, the effects can be profound.

Now let’s take a look at the timeline in vCenter to get an appreciation of how those CPU cycles were used. On the image below, you can see that like the single vCPU VM testing, the VM running on faster storage allowed for much higher CPU usage than when running on slower storage, but that it was for a much shorter period of time (about half). You will notice that in this test, the CPU Ready measurements generally increases as the CPU usage increased.

Real world examples
This all brings me back to what I witnessed years ago while administering a vSphere environment consisting of extremely CPU and storage I/O intensive workloads. Dozens of resource intensive VMs built for the purpose of compiling code. These were systems using that could multithread to near perfection – assuming storage performance was sufficient.

Now let’s look at what CPU utilization rates looked like on that same VM, running the same code compiling job where the storage environment wasn’t able to satisfy reads and writes fast enough. The same job took 46% longer to complete, all because the available CPU cycles couldn’t be used.

Still not a believer? Take a look at a presentation at the OpenStack summit by Charter Communications in April 2016, where they demonstrate exactly the effect I describe. Their Cassandra cluster deployed with VMware Integrated OpenStack, and the effects of CPU utilization when providing lower latency, higher performing storage. (key information beginning at 17:10). Their more freely breathing storage allowed CPU cycles related to storage I/O to be committed more quickly, thereby finishing the tasks much more quickly. High CPU usage was a desired result of theirs.

You might be thinking to yourself, "Won’t I have more CPU contention with faster storage?" Well, yes and no. Faster storage will give power back to the Administrator to control the usage of resources as needed, and deliver the SLAs required. And moving the point of contention to the CPU allows for what it does best; time slicing processes to complete the tasks as quickly as possible.

Sample what?
The rate at which telemetry data is sampled is a factor that can dramatically change your impression of the behavior of these resources used in the Data Center. It’s a big topic, and one that will be touched on in an upcoming post, but there is one thing to note here. When leveraging faster, lower latency storage, there are many times where CPU utilization and CPU Ready will stay the same. Why? In a real workload that involve CPU cycles executing to commit storage I/O, a workflow can may consist of a given amount of those I/Os, regardless of how long it takes. If that process took 18 seconds on slow storage, but 5 seconds on faster storage, the 20 second sampling rate within vCenter may render it in the same way. One often has to employ other tools to see these figures at a higher sampling rate. Tools such as vscsiStats and esxtop are good examples of this.

Takeaways
The testing, and examples above should make it easy to imagine a scenario in which a storage system is upgraded, and CPU related alarms are tripped more frequently, even though the processes that support a workflow have completed much more quickly. So with that, it’s good to keep the following in mind.

Slow storage will suppress CPU utilization rates – giving you the impression that from a host, or VM perspective, everything is fine.
Conversely, Fast storage will allow those CPU cycles related to storage I/O to execute, thereby increasing utilization rates – albeit for a shorter period of time. High CPU statistics are not necessarily a bad thing.
Averages and peaks can be misleading because increased utilization rates may not be recognizable in the vCenter CPU charts if it completes within the smallest sampling size (20 seconds)
Traditional methods of monitoring and balancing host resources can be misleading
Higher CPU utilization rates may not be a leading indicator of an issue. They are often be a trailing indicator of well-designed processes, or free breathing storage. Again, high CPU can be a good thing!!!
Application behavior, and the results are what counts. If a batch job in SQL takes 30 minutes, defining success should be around the desired time of that batch job. Infrastructure related metrics should help you diagnose issues and assist with achieving a desired result, but not be the one and only KPI.
Storage performance will generally impact every VM and host accessing the cluster. Whereas host based resource contention will only impact other VMs living on that same host.

Thanks for reading

– Pete

What does your infrastructure analytics really tell you?

There is no mistaking the value of data visualization combined with analytics. Data visualization can help make sense of the abstract or information not easily conveyed by numbers. Data analytics excels at taking discrete data points that make no sense on their own, into findings that have context, and relevance. The two together can present findings in a meaningful, insightful, and easy to understand way. But what are your analytics really telling you?

The problem for modern IT is that there can be an overabundance of data, with little regard to the quality of data gathered, how it relates to each other, and how to make it meaningful. All too often, this "more is better" approach obfuscates the important to such a degree that it provides less value, not more. it’s easy to collect data. The difficulty is to do something meaningful with the right data. Many tools collect metrics in an order not by which is most important, but what can be easily provided.

Various solutions with the same problem
Modern storage solutions have increased their sophistication in their analytics offerings for storage. In principle this can be a good thing, as storage capacity and performance is such a common problem with today’s environments. Storage vendors have joined the "we do that too" race of analytics features. However, feature list checkboxes can easily mask the reality – that the quality of insight is not what you might think it is. Creative license gets a little, well, creative.

Some storage solutions showcase their storage I/O analytics as a complete solution for understanding storage usage and performance of an environment. Advertising an extraordinary amount of data points collected, and sophisticated methods for collection of that data that is impressive by anyone’s standards. But these metrics are often taken at face value. Tough questions need to be asked before important decisions are made off of them. Is the right data being measured? Is the data being measure from the right location? Is the data being measured in the right way? And is the information conveyed of real value?

Accurate analytics requires that the sources of data are of the right quality and completeness. No amount of shiny presentation can override the result of using the wrong data, or using it in the wrong way.

What is the right data?
The right data has a supporting influence on the questions that you are trying to answer. Why did my application slow down after 1:18pm? How did a recent application modification impact other workloads? In Infrastructure performance, I’ve demonstrated how block sizes have historically been ignored when it came to storage design, because they could not have been easily seen or measured. Having metrics around fan speed of a storage array might be helpful for evaluating your cooling system in your Data Center, but does little to help you understand your workloads. The right data must also be collected at a rate that accurately reflects the real behavior. If your analytics offerings sample data once every 5 or 10 minutes, how can it ever show spikes of contention in resources that impact what your systems experience? The short answer is, they can’t.

The importance of location
Measuring the data at the right location is critical to accurately interpreting the conditions of your VMs, and the infrastructure in which they live. We perceive much more than we see. This is demonstrated most often with a playful optical illusion, but can be a serious problem with understanding your environment. The data gathered is often incomplete, and how you perceived it by virtue of assuming it was all the data you need all lead to the wrong conclusion. Let’s consider a common scenario where the analytics of a storage system shows great performance of a storage array, yet the VM may be performing poorly. This is the result of measuring from the wrong location. The array may have showed the latency of the components inside the device, but cannot account for latency introduced throughout the storage stack. The array metric might have been technically accurate for what it was seeing, but it was not providing you the correct, and complete metric. Since storage I/O always originate on the VMs and the infrastructure in which they live, it simply does not make sense to measure them from a supporting component like a storage array.

Measuring data inside the VM can be equally as challenging. Operating Systems’ method of data collection assume they are the sole proprietor of resources, and may not always accurately account for that fact that it is time slicing CPU clock cycles with other VMs. While the VM is the end "consumer" of resource, it also does not understand it is virtualized, and cannot see the influence of performance bottlenecks throughout the virtualization layer, or any of the physical components in the stack that support it.

VM metrics pulled from inside the guest OS may measure thing in different ways depending on Operating System. Consider the differences in how disk latency in Windows "Perfmon" is measured versus Linux "top." This is the problem with data collector based solutions that aggregate metrics from difference sources. A lot of data collected, but none of it means the same thing.

This disparate data leaves users attempting to reconcile what these metrics mean, and how they impact each other. Even worse when supposedly similar metrics from two different sources show different data. This can occur with storage array solutions that hook into vCenter to augment the array based statistics. Which one is to be believed? One over the other, or neither?

Statistics pulled solely from the hypervisor kernel avoids this nonsense. It provides a consistent method for gathering meaningful data about your VMs and the infrastructure as a whole. The hypervisor kernel is also capable of measuring this data in such a way that it accounts for all elements of the virtualization stack. However, determining the location for collection is not the end-game. We must also consider how it is analyzed.

Seeing the trees AND the forest
Metrics are just numbers. More is needed than numbers to provide a holistic understanding for an environment. Data collected that stands on its own is important, but how it contributes to the broader understanding of the environment is critical. One needs to be able to get a broad overview of an environment to drill down and identify a root cause of an issue, or be able to start out at the level of an underperforming VM and see how or why it may be impacted by others.

Many attempt to distill down this large collection of metrics to just a few that might help provide insight into performance, or potential issues. Examples of these individual metrics might include CPU utilization, Queue depths, storage latency, or storage IOPS. However, it is quite common to misinterpret these metrics when looked at in isolation.

Holistic understanding provides its greatest value when attempting to determine the impact of one workload over a group of other workloads. A VM’s transition to a new type of storage I/O pattern can often result in lower CPU activity; the exact opposite of what most would look for. The weight of impact between metrics will also vary. Think about a VM consuming large amounts of CPU. This will generally only impact other VMs on that host. In contrast, a storage based noisy neighbor can impact all VMs running on that storage system, not just the other VMs that live on that host.

Conclusion
Whether your systems are physical, virtualized, or live in the cloud, analytics exist to help answer questions, and solve problems. But analytics are far more than raw numbers. The value comes from properly digesting and correlating numbers into a story providing real intelligence. All of this is contingent on using the right data in the first place. Keep this in mind as you think about ways that you currently look at your environment.

Viewing the impact of block sizes with PernixData Architect

In the post Understanding block sizes in a virtualized environment, I describe what block sizes are as they relate to storage I/O, and how it became one of the most overlooked metrics of virtualized environments. The first step in recognizing their importance is providing visibility to them across the entire Data Center, but with enough granularity to view individual workloads. However, visibility into a metric like block sizes isn’t enough. The data itself has little value if it cannot be interpreted correctly. The data must be:

Easy to access
Easy to interpret
Accurate
Easy to understand how it relates to other relevant metrics

Future posts will cover specific scenarios detailing how this information can be used to better understand, and tune your environment for better application performance. Let’s first learn how PernixData Architect presents block size when looking at very basic, but common read/write activity.

Block size frequencies at the Summary View
When looking at a particular VM, the "Overview" view in Architect can be used to show the frequency of I/O sizes across the spectrum of small blocks to large blocks for any time period you wish. This I/O frequency will show up on the main Summary page when viewing across the entire cluster. The image below focuses on just a single VM.

(Click on images for a full size view)

What is most interesting about the image above is that the frequency of block sizes, based on reads and writes are different. This is a common characteristic that has largely gone unnoticed because there has been no way to easily view that data in the first place.

Block sizes using a "Workload" View
The "Workload" view in Architect presents a distribution of block sizes in a percentage form as they occur on a single workload, a group of workloads, or across an entire vSphere cluster. The time frame can be any period that you wish. This view tells more clearly and quickly than any other view as to how complex, unique, and dynamic the distribution of block sizes are for any given VM. The example below represents a single workload, across a 15 minute period of time. Much like read/write ratios, or other metrics like CPU utilization, it’s important to understand these changes as they occur, and not just a single summation or percentage over a long period of time.

When viewing the "Workload" view in any real world environment, it will instantly provide new perspective on the limitations of Synthetic I/O generators. Their general lack of ability to emulate the very complex distribution of block sizes in your own environment limit their value for that purpose. The "Workload" view also shows how dramatic, and continuous the changes in workloads can be. This speaks volumes as to why one-time storage assessments are not enough. Nobody treats their CPU or memory resources in that way. Why would we limit ourselves that way for storage?

Keep in mind that the "Workload" view illustrates this distribution in a percentage form. Views that are percentage based aim to illustrate proportion relative to a whole. They do not show absolute values behind those percentages. The distribution could represent 50 IOPS, or 5,000 IOPS. However, this type of view can be an incredibly effective in identifying subtle changes in a workload or across an environment for short term analysis, or long term trending.

Block sizes in an IOPS and Throughput view
First let’s take a look at this VM by looking at IOPS, based on a default "Read/Write" breakdown. The image below shows a series of reads before a series of mostly writes. When you look back at the "Workload" view above, you can see how these I/Os were represented by block size distribution.

Staying on this IOPS view and selecting the predefined "Block Size" breakdown, we can see the absolute numbers that are occurring based on block size. The image below shows that unlike the "Workload" view above, this shows the actual number of I/Os issued for the given block size.

But that doesn’t tell the whole story. A block size is an attribute for a single I/O. So in an IOPS view 10 IOPS of 4K blocks looks the same as 10 IOPS of 256K blocks. In reality, the latter is 64 times the amount of data. The way to view this from a "payload amount transmitted" perspective is using the Throughput view with the "Block Size" breakdown, as shown below.

When viewing block size by its payload size (Throughput) as shown above, it provides a much better representation of the dominance of large block sizes, and the relatively small payload of the smaller block sizes.

Here is another way Architect can help you visualize this data. We can click on the "Performance Grid" view and change the view so that we have IOPS and Throughput but for specific block sizes. As the image below illustrates, the top row shows IOPS and Throughput for 4K to <8K blocks, while the bottom row shows IOPS and Throughput for blocks over 256K in size.

What the image above shows is that while the number of IOPS for block sizes in the 4K to <8K range at it’s peak were similar to the number of IOPS for block sizes of 256K and above, there was an enormous amount of payload delivered.

Why does it matter?
Let’s let PernixData Architect tell us why all of this matters. We will look at the effective latency of the VM over that same time period. We can see from the image below that the effective latency of the VM definitely increased as it transitioned to predominately writes. (Read/Write boxes unticked for clarity).

Now, let’s look at the image below, which shows latency by block size using the "Block Size" breakdown.

There you see it. Latency was by in large a result of the larger block sizes. The flexibility of these views can take an otherwise innocent looking latency metric and tell you what was contributing most to that latency.

Now let’s take it a step further. With Architect, the "Block Size" breakdown is a predefined view that shows block size characteristic of both reads, and writes combined – whether you are looking at Latency, IOPS, or Throughput. However, you can use a custom breakdown to not only show block sizes, but show them specifically for reads or writes, as shown in the image below.

The "Custom" Breakdown for the Latency view shown above had all of the reads and writes of individual block sizes enabled, but some of them were simply "unticked" for clarity. This view confirms that the majority of latency was the result of writes that were 64K and above. In this case, we can clearly demonstrate that latency seen by the VM was the result of larger block sizes issued by writes. It’s impact however is not limited to just the higher latency of those larger blocks, as those large block latencies can impact the smaller block I/Os as well. Stay tuned for more information on that subject.

As shown in the image below, Architect also allows you to simply click on a single point, and drill in for more insight. This can be done on a per VM basis, or across the entire cluster. By hovering over each vertical bar representing various block sizes, it will tell you how many IOs were issued at that time, and the corresponding latency.

Flash to the rescue?
It’s pretty clear that block size can have significant impact on the latency your applications see. Flash to the rescue, right? Well, not exactly. All of the examples above come from VMs running on Flash. Flash, and how it is implemented in a storage solution is part of what makes this so interesting, and so impactful to the performance of your VMs. We also know that the storage media is just one component of your storage infrastructure. These components, and their abilities to hinder performance, exist regardless if one is using a traditional three-tier architecture, or distributed storage architectures like Hyper Converged environments.

Block sizes in the Performance Matrix
One unique view in Architect is the Performance Matrix. Unique in what it presents, and how it can be used. Your storage solution might have been optimized from the Manufacturer based on certain assumptions that may not align with your workloads. Typically there is no way of knowing that. As shown below, Architect can help you understand what type of workload characteristics in which the array begins to suffer.

The Performance Matrix can be viewed on a per VM basis (as shown above) or in an aggregate form. It’s a great view to see what block size thresholds your storage infrastructure may be suffering, as the VMs see it. This is very different than statistics provided by an array, as Architect offers a complete, end-to-end understanding of these metrics with extraordinary granularity. Arrays are not in the correct place to accurately understand, or measure this type of data.

Summary
Block sizes have a profound impact on the performance of your VMs, and is a metric that should be treated as a first class citizen just like compute and other storage metrics. The stakes are far too high to leave this up to speculation, or words from a vendor that say little more than "Our solution is fast. Trust us." Architect leverages it’s visibility of block sizes in ways that have never been possible. It takes advantage of this visibility to help you translate what it is, to what it means for your environment.

Understanding block sizes in a virtualized environment

Cracking the mysteries of the Data Center is a bit like space exploration. You think you understand what everything is, and how it all works together, but struggle to understand where fact and speculation intersect. The topic of block sizes, as they relate to storage infrastructures is one such mystery. The term being familiar to some, but elusive enough to remain uncertain as to what it is, or why it matters.

This inconspicuous, but all too important characteristic of storage I/O has often been misunderstood (if not completely overlooked) by well-intentioned Administrators attempting to design, optimize, or troubleshoot storage performance. Much like the topic of Working Set Sizes, block sizes are not of great concern to an Administrator or Architect because of this lack of visibility and understanding. Sadly, myth turns into conventional wisdom – in not only what is typical in an environment, but how applications and storage systems behave, and how to design, optimize, and troubleshoot for such conditions.

Let’s step through this process to better understand what a block is, and why it is so important to understand it’s impact on the Data Center.

What is it?
Without diving deeper than necessary, a block is simply a chunk of data. In the context of storage I/O, it would be a unit in a data stream; a read or a write from a single I/O operation. Block size refers the payload size of a single unit. We can blame a bit of this confusion on what a block is by a bit of overlap in industry nomenclature. Commonly used terms like blocks sizes, cluster sizes, pages, latency, etc. may be used in disparate conversations, but what is being referred to, how it is measured, and by whom may often vary. Within the context of discussing file systems, storage media characteristics, hypervisors, or Operating Systems, these terms are used interchangeably, but do not have universal meaning.

Most who are responsible for Data Center design and operation know the term as an asterisk on a performance specification sheet of a storage system, or a configuration setting in a synthetic I/O generator. Performance specifications on a storage system are often the result of a synthetic test using the most favorable block size (often 4K or smaller) for an array to maximize the number of IOPS that an array can service. Synthetic I/O generators typically allow one to set this, but users often have no idea what the distribution of block sizes are across their workloads, or if it is even possibly to simulate that with synthetic I/O. The reality is that many applications draw a unique mix of block sizes at any given time, depending on the activity.

I first wrote about the impact of block sizes back in 2013 when introducing FVP into my production environment at the time. (See section "The IOPS, Throughput & Latency relationship") FVP provided a tertiary glimpse of the impact of block sizes in my environment. Countless hours with the performance graphs, and using vscsistats provided new insight about those workloads, and the environment in which they ran. However, neither tool was necessarily built for real time analysis or long term trending of block sizes for a single VM, or across the Data Center. I had always wished for an easier way.

Why does it matter?
The best way to think of block sizes is how much of a storage payload consisting in a single unit. The physics of it becomes obvious when you think about the size of a 4KB payload, versus a 256KB payload, or even a 512KB payload. Since we refer to them as a block, let’s use a square to represent their relative capacities.

Throughput is the result of IOPS, and the block size for each I/O being sent or received. It’s not just the fact that a 256KB block has 64 times the amount of data that a 4K block has, it is the amount of additional effort throughout the storage stack it takes to handle that. Whether it be bandwidth on the fabric, the protocol, or processing overhead on the HBAs, switches, or storage controllers. And let’s not forget the burden it has on the persistent media.

This variability in performance is more prominent with Flash than traditional spinning disk. Reads are relatively easy for Flash, but the methods used for writing to NAND Flash can inhibit the same performance results from reads, especially with writes using large blocks. (For more detail on the basic anatomy and behavior of Flash, take a look at Frank Denneman’s post on Flash wear leveling, garbage collection, and write amplification. Here is another primer on the basics of Flash.) A very small number of writes using large blocks can trigger all sorts of activity on the Flash devices that obstructs the effective performance from behaving as it does with smaller block I/O. This volatility in performance is a surprise to just about everyone when they first see it.

Block size can impact storage performance regardless of the type of storage architecture used. Whether it is a traditional SAN infrastructure, or a distributed storage solution used in a Hyper Converged environment, the factors, and the challenges remain. Storage systems may be optimized for different block size that may not necessarily align with your workloads. This could be the result of design assumptions of the storage system, or limits of their architecture. The abilities of storage solutions to cope with certain workload patterns varies greatly as well. The difference between a good storage system and a poor one often comes down to the abilities of it to handle large block I/O. Insight into this information should be a part of the design and operation of any environment.

The applications that generate them
What makes the topic of block sizes so interesting are the Operating Systems, the applications, and the workloads that generate them. The block sizes are often dictated by the processes of the OS and the applications that are running in them.

Unlike what many might think, there is often a wide mix of block sizes that are being used at any given time on a single VM, and it can change dramatically by the second. These changes have profound impact on the ability for the VM and the infrastructure it lives on to deliver the I/O in a timely manner. It’s not enough to know that perhaps 30% of the blocks are 64KB in size. One must understand how they are distributed over time, and how latencies or other attributes of those blocks of various sizes relate to each other. Stay tuned for future posts that dive deeper into this topic.

Traditional methods capable of visibility
The traditional methods for viewing block sizes have been limited. They provide an incomplete picture of their impact – whether it be across the Data Center, or against a single workload.

1. Kernel statistics courtesy of vscsistats. This utility is a part of ESXi, and can be executed via the command line of an ESXi host. The utility provides a summary of block sizes for a given period of time, but suffers from a few significant problems.

Not ideal for anything but a very short snippet of time, against a specific vmdk.
Cannot present data in real-time. It is essentially a post-processing tool.
Not intended to show data over time. vscsistats will show a sum total of I/O metrics for a given period of time, but it’s of a single sample period. It has no way to track this over time. One must script this to create results for more than a single period of time.
No context. It treats that workload (actually, just the VMDK) in isolation. It is missing the context necessary to properly interpret.
No way to visually understand the data. This requires the use of other tools to help visualize the data.

The result, especially at scale, is a very labor intensive exercise that is an incomplete solution. It is extremely rare that an Administrator runs through this exercise on even a single VM to understand their I/O characteristics.

2. Storage array. This would be a vendor specific "value add" feature that might present some simplified summary of data with regards to block sizes, but this too is an incomplete solution:

Not VM aware. Since most intelligence is lost the moment storage I/O leaves a host HBA, a storage array would have no idea what block sizes were associated with a VM, or what order they were delivered in.
Measuring at the wrong place. The array is simply the wrong place to measure the impact of block sizes in the first place. Think about all of the queues storage traffic must go through before the writes are committed to the storage, and reads are fetched. (It also assumes no caching tiers outside of the storage system exist). The desire would be to measure at a location that takes all of this into consideration; the hypervisor. Incidentally, this is often why an array can show great performance on the array, but suffer in the observed latency of the VM. This speaks to the importance of measuring data at the correct location.
Unknown and possibly inconsistent method of measurement. Showing any block size information is not a storage array’s primary mission, and doesn’t necessarily provide the same method of measurement as where the I/O originates (the VM, and the host it lives on). Therefore, how it is measured, and how often it is measured is generally of low importance, and not disclosed.
Dependent on the storage array. If different types of storage are used in an environment, this doesn’t provide adequate coverage for all of the workloads.

The Hypervisor is an ideal control plane to analyze the data. It focuses on the results of the VMs without being dependent on nuances of in-guest metrics or a feature of a storage solution. It is inherently the ideal position in the Data Center for proper, holistic understanding of your environment.

Eyes wide shut – Storage design mistakes from the start
The flaw with many design exercises is we assume we know what our assumptions are. Let’s consider typical inputs when it comes to storage design. This includes factors such as

Peak IOPS and Throughput.
Read/Write ratios
RAID penalties
Perhaps some physical latencies of components, if we wanted to get fancy.

Most who have designed or managed environments have gone through some variation of this exercise, followed by a little math to come up with the correct blend of disks, RAID levels, and fabric to support the desired performance. Known figures are used when they are available, and the others might be filled in with assumptions. But yet, block sizes, and everything they impact are nowhere to be found. Why? Lack of visibility, and understanding.

If we know that block sizes can dramatically impact the performance of a storage system (as will be shown in future posts) shouldn’t it be a part of any design, optimization, or troubleshooting exercise? Of course it should. Just as with working set sizes, lack of visibility doesn’t excuse lack of consideration. An infrastructure only exists because of the need to run services and applications on it. Let those applications and workloads help tell you what type of storage fits your environment best. Not the other way around.

Is there a better way?
The ideal approach for measuring the impact of block sizes will always include measuring from the location of the hypervisor, as this will provide these measurements in the right way, and from the right location. vscsiStats and vCenter related metrics are an incredible resource to tap into, and will provide the best understanding on impacts of block sizes in a storage system. There may be some time investment to decipher block size characteristics of a workload, but the payoff is generally worth the effort.

My vSphere Home Lab. 2016 edition

Here we go again. I had no intention of writing a follow-up to my "Home Lab 2015 edition" post last year, as I didn’t foresee any changes to the lab in the coming year that would be interesting enough to write about.

So much for predicting the future.

Sometimes Home Lab environments tend to border on vanity projects. I would like to think the recent changes in my lab were done out of need, but rationalizing wants into needs is common enough to be considered a national pastime. Nevertheless, my profession now has me testing workloads and new technologies on a daily basis, and this was a driving force behind these upgrades. Honest.

Demand often drives change. This is where the evolution of my Home Lab continues to mimic a production environment – just at a smaller scale. Budget, performance, capacity, space, and heat are all elements of a Home Lab design that are almost laughably similar to a production environment. Workloads evolve, and needs grow – quickly making previously used design inputs as inadequate. That is exactly what happened to me, and knew I had to invest in a few upgrades.

Compute – Performance/Testing Cluster
It was finally time to replace a few of the oldest components of the lab. My primary hosts that were built off of Intel Sandy Bridge processors used motherboards limited to just 32GB of RAM, and PCIe 2.0. I didn’t have any 10Gb connectivity without my old InfiniBand gear, and I was consistently pushing the CPUs to their limit.

I decided to go with a pair of SuperMIcro 5018D-FN4T rack mounted units. These are an incredibly small 1U form factor that feature built-in dual 10GbE and dual 1GbE interfaces, a dedicated IPMI port, a PCIe 3.0 slot, 4 drive bays, and can pack in up to 128GB of DDR4 memory. The motherboard uses the soldered on 8 core Xeon D-1540 chip and the power supply is built into the chassis. Both items reduce flexibility, but improve the no-brainer simplicity of the unit. What is most surprising when you get your hands on them is that they are incredibly small, yet still half empty when the case is cracked open. A third host will probably be in the works at some point, but it’s not necessary at this time.

It probably will come as no surprise that multiple PernixData FVP based acceleration tiers are an integral component of my infrastructure, so a few changes occurred in that realm.

1. Adding NVMe cards to use as a Flash based acceleration tier for FVP. For this lab arrangement, I used the Intel 750 NVMe based PCIe 3.0 card. While they are not officially on the VMware HCL, they are fine for the Home Lab, as they borrow heavily from the Intel DC P3xxxx line of NVMe cards that are on the VMware HCL. Intel NVMe cards are outstanding performers. Enjoy the benefits of completely bypassing all of the legacy elements the traditional storage stack on a host such as storage controllers and SCSI commands. NVMe based Flash devices is still limited by the physics of NAND Flash, but it is an incredible performer that can make any SSD based Flash drive look quite feeble in comparison. Just make sure to use Intel’s driver for vSphere.

2. More RAM to use as a DFTM acceleration tier in FVP. I placed 64GB of Micron Memory which allows me to allocate a nice chunk of RAM for FVP acceleration. The beauty of using memory as an acceleration tier avoiding all characteristics of NAND Flash, and the ability for it to leverage compression techniques. This typically increases the effective tier size between 30% and 70% depending on workload. The larger the tier size, the more content that can live in the tier, and the less eviction that occurs against the working set of data.

Compute – Management Cluster
A management cluster in a Home Lab is great. It has allowed me to really experiment with testing workloads and new technologies without any impact to the components that run the infrastructure. My Management Cluster now comprises of three Intel NUCs. I would have been perfectly happy with just a couple of NUCs as a Management cluster, but unfortunately the 16GB RAM limitation makes that a bit tough. Eventually, the NUCs will outlive their usefulness in the lab, but the great part about them is that they can easily be used as a desktop workstation, or media server. For now, they will continue to serve their purpose as a Management Cluster.

Switching
Upgrading my network meant adding 10GbE connectivity. For this, I chose a Netgear XS708E, 8 port, 10GbE switch. This would serve as a fast interconnect for east-west traffic between hosts. My adventures with InfiniBand were always interesting and educational. It’s an amazing technology, but there was just too much administrative overhead to the gear I was using. Unfortunately, there are not too many small, affordable 10GbE switches out there. The Dell 12 port X4012 10GbE switch looked really appealing based on the specs, but the ports are SFP+, so that would have meant rethinking a number of things. As for the Netgear, what do I think of it? After configuring the product, I’m convinced the folks at Netgear wanted to punish anyone who buys the unit. All of the configuration items that should be so basic in a CLI or web based UI are obfuscated in a proprietary interface that seems to be missing half of the options you’d expect. Dear Netgear, please let me configure LAGs, trunks, MTU size, and VLANs with something remotely resembling common sense. It does work, but if I could do it over, I’d choose something else.

My network core still consists of a Cisco SG300-20 Layer 3 switch. Moving away from hosts that had 6, 1GbE ports down to hosts that had just two 1GbE ports and two 10GbE ports meant that I was able to free up space on this switch. That switch still has a bit of a premium price for a 20 port L3 switch, but it has been a rock solid component of my lab for over 4 years now.

Ancillary Components
One thing I was tired of dealing with was my wireless gateway. I’ve grown sour on any consumer based WiFi/Router solutions available. Most aren’t stable, and lack features that require one to crack them with a DD-WRT build. Memory leaks and other reboot inducing behaviors are not what you want to deal with when attempting to access the lab remotely, so it was time to take a new approach. I went with the following for my gateway and wireless needs.

Motorola SB6121 DOCSIS 3.0 Cable Modem. This was purchased to replace the oversized cable modem provided by the service provider. It’s small, affordable, and prevents the cable company from changing settings on me, as they often would with their own unit.

Ubiquiti EdgeRouter PoE. This 5 port unit serves as my gateway, where one leg feeds downstream to my core switch, and another leg is used as a DMZ for my WiFi. This is a great switch that offers everything that I was looking for. Trunking, static routes, NAT and Firewalling. The multiple PoE ports makes it easy to add new wireless access points.

Ubiquiti UniFi AP Wireless Access Point. These access points pair nicely with the PoE based router above.

It’s been a rock solid, winning combination. Always on, with no random need to reboot. Total control over configuration, and no silliness from the cable provider. Mission accomplished.

Storage
This was one of the few components that didn’t change. Storage is served up by two, 5-bay Synology units with a mix of SSDs and spinning disk. I had plenty of capacity, with enough options to test various media if needed.

Mounting
Until this latest refresh, a $25 utility rack had housed the assortment of oddly shaped lab gear pretty well. With the changeover to small 1U rackmount servers and additional switchgear, it was time for an official enclosure. I went with a Tripp Lite 9U Wall Mount Cabinet. It will eventually be wall mounted, but for the time being, sits perfectly on a $12 moving dolly from Harbor Freight. The cabinet has some nice mounting ports for supplementary exhaust fans should the need arise.

Relocation
Within the first few minutes of powering up the new hosts, I realized the arrangement was going to need a new home. Server room loud? No. But moving from 38dB to 50+dB is loud enough that you wouldn’t want to be working by it all day. There is no way 1U fans spinning at 8,000 RPM will ever be soothing. I had been quite proud of how quiet my lab gear had been up until this point. I stayed away from 1U anything, and when with quiet fans wherever I could. I tried desperately to suppress the noise, replacing all of the fans with ultra-quiet Noctua fans. Unfortunately, ultra-quiet can also mean they don’t move much air. It’s not good to disregard any delta in CFM between fans. The heat alarms made it very clear this wasn’t going to work, and I didn’t want to burn up perfectly good gear. I chose to place all of the factory fans back in the 1U servers, and the 10GbE switch, and used the Noctua fans as supplementary fans in each device. They do help the primary fans to spin at a lower rate, so the effort wasn’t a total waste. The 9U cabinet will be relocated to a more permanent location than it is now, but for the time being, its making a coat closet nice and warm.

What it looks like
The entire lab, including the UPS is now self-contained, which should make its final relocation straight forward. The entire arrangement (5 hosts, 2 switches, 2 Synology NAS units, etc.) draws between 250 – 300 watts depending upon the load. Considering the old, much less capable arrangement ran at about 200 watts, I was pretty happy with the result.

In the spirit of full disclosure, the cabinet door does cover up some rather careless cable management practices. Regardless, I am thrilled with the end result and how it performs. A space efficient arrangement that is extremely powerful.

No matter how little, or how much you decide to invest in a Home Lab, I’ve learned that the satisfaction seems to be directly proportional to how much value it brings to you. Whether it be a hobby, used for professional growth, or a part of your day-to-day job duties, any sense of buyer’s remorse only seems to creep in when it’s not used. For my circumstances, that doesn’t seem to be a problem.

Working set sizes in the Data Center

There is no shortage of mysteries in the data center. These stealthy influencers can undermine performance and consistency of your environment, while remaining elusive to identify, quantify, and control. Virtualization helped expose some of this information, as it provided an ideal control plane for visibility. But it does not, and cannot properly expose all data necessary to account for these influencers. The hypervisor also has a habit of presenting the data in ways that can be misinterpreted.

One such mystery as it relates to modern day virtualized data centers is known as the "working set." This term certainly has historical meaning in the realm of computer science, but the practical definition has evolved to include other components of the Data Center; storage in particular. Many find it hard to define, let alone understand how it impacts their data center, and how to even begin measuring it.

We often focus on what we know, and what we can control. However, lack of visibility of influencing factors in the data center does not make it unimportant. Unfortunately this is how working sets are usually treated. It is often not a part of a data center design exercise because it is completely unknown. It is rarely written about for the very same reason. Ironic considering that every modern architecture deals with some concept of localization of data in order to improve performance. Cached content versus it’s persistent home. How much of it is there? How often is it accessed? All of these types of questions are critically important to know.

What is it?
For all practical purposes, a working set refers the amount of data that a process or workflow uses in a given time period. Think of it as hot, commonly accessed data of your overall persistent storage capacity. But that simple explanation leaves a handful of terms that are difficult to qualify, and quantify. What is recent? Does "amount" mean reads, writes, or both? And does it define if it is the same data written over and over again, or is it new data? Let’s explore this more.

There are a several traits of working sets that are worth reviewing.

Working sets are driven by the workload, the applications driving the workload, and the VMs that they run on. Whether the persistent storage is local, shared, or distributed, it really doesn’t matter from the perspective of how the VMs see it. The size will be largely the same.
Working sets always relate to a time period. However, it’s a continuum. And there will be cycles in the data activity over time.
Working set will comprise of reads and writes. The amount of each is important to know because reads and writes have different characteristics, and demand different things from your storage system.
Working set size refers to an amount, or capacity, but what and how many I/Os it took to make up that capacity will vary due to ever changing block sizes.
Data access type may be different. Is one block read a thousand times, or are a thousand blocks read one time? Are the writes mostly overwriting existing data, or is it new data? This is part of what makes workloads so unique.
Working set sizes evolve and change as your workloads and data center change. Like everything else, they are not static.

A simplified, visual interpretation of data activity that would define a working set, might look like below.

If a working set is always related to a period of time, then how can we ever define it? Well in fact, you can. A workload often has a period of activity followed by a period of rest. This is sometimes referred to the "duty cycle." A duty cycle might be the pattern that shows up after a day of activity on a mailbox server, an hour of batch processing on a SQL server, or 30 minutes compiling code. Taking a look over a larger period of time, duty cycles of a VM might look something like below.

Working sets can be defined at whatever time increment desired, but the goal in calculating a working set will be to capture at minimum, one or more duty cycles of each individual workload.

Why it matters
Determining a working set sizes helps you understand the behaviors of your workloads in order to better design, operate, and optimize your environment. For the same reason you pay attention to compute and memory demands, it is also important to understand storage characteristics; which includes working sets. Understanding and accurately calculating working sets can have a profound effect on the consistency of a data center. Have you ever heard about a real workload performing poorly, or inconsistently on a tiered storage array, hybrid array, or hyper-converged environment? This is because both are extremely sensitive to right sizing the caching layer. Not accurately accounting for working set sizes of the production workloads is a common reason for such issues.

Classic methods for calculation
Over the years, this mystery around working set sizes has resulted in all sorts of sad attempts at trying to calculate. Those attempts have included:

Calculate using known (but not very helpful) factors. These generally comprise of looking at some measurement of IOPS over the course of a given time period. Maybe dress it up with a few other factors to make it look neat. This is terribly flawed, as it assumes one knows all of the various block sizes for that given workload, and that block sizes for a workload are consistent over time. It also assumes all reads and writes use the same block size, which is also false.
Measure working sets defined on a storage array, as a feature of the array’s caching layer. This attempt often fails because it sits at the wrong location. It may know what blocks of data are commonly accessed, but there is no context to the VM or workload imparting the demand. Most of that intelligence about the data is lost the moment the data exits the HBA of the vSphere host. Lack of VM awareness can even make an accurately guessed cache size on an array be insufficient at times due to cache pollution from noisy neighbor VMs.
Take an incremental backup, and look at the amount of changed data. This sounds logical, but this can be misleading because it will not account for data that is written over and over, nor does it account for reads. The incremental time period of the backup may also not be representative of the duty cycle of the workload.
Guess work. You might see "recommendations" that say a certain percentage of your total storage capacity used is hot data, but this is a more formal way to admit that it’s nearly impossible to determine. Guess large enough, and the impact of being wrong will be less, but this introduces a number of technical and financial implications on data center design.

Since working sets are collected against activity that occurs on a continuum, calculating a typical working set with a high level of precision is not only impossible, but largely unnecessary. When attempting to determine working set size of a workload, the goal is to come to a number that reflects the most typical behavior of a single workload, group of workloads, or a total sum of workloads across a cluster or data center.

A future post will detail approaches that should give a sufficient level of understanding on active working set sizes, and help reduce the potential of negative impacts on data center operation due to poor guesswork.

Thanks for reading

A closer look at the new UI for PernixData FVP, and beyond

In many ways, making a good User Interface (UI) seems like a simple task. As evident by so many software makers over the years, it is anything but simple. A good UI looks elegant to the eye, and will become a part of muscle memory without even realizing it. A bad UI can feel like a cruel joke; designed to tease the brain, and frustrate the user. It’s never done intentionally of course. In fact, bad visual and functional designs happen in any industry all the time. Just think of your favorite ugly car. At some point there was an entire committee that gave it a thumbs up. User Experience (UX) design is also an imperfect science, and the impressions are subject to the eyes of the beholder.

A good UI should present function effortlessly. Make the complex simple. However, it is more than just buttons and menus that factor into a user experience. That larger encompassing UX design is what incorporates among other things, functional requirements with a visual interface that is productive and intuitive. PernixData FVP has always received high marks for that user experience. The product not only accelerated storage I/O, but presented itself in such a way that made it informative and desirable to use.

Why the change?
PernixData products (FVP, and the up and coming Architect) now have a standalone, HTML5 interface using your favorite browser. Moving away from the vSphere Web client was a deliberate move that at first impression might be a bit surprising. With changes in needs and expectations comes the challenge of understanding what is the best way to achieve a desired result. Standalone, traditionally compiled clients are not as appealing as they once were for numerous reasons, so adopting a modern web based framework was important.

Moving to a standalone, pure HTML5 UI built from the ground up allowed for these interactions to be built just the way they should be. It removes limits explicitly or implicitly imposed by someone else’s standards. PernixData gets to step away from the shadows of VMware’s current implementation of FLEX. Removing limitations allows for more flexibility now, and in the future.

UI characteristics
One of the first impressions that will get will be the performance of the UI. It is quick and snappy. UX pain often begins with performance – whether it is the technical speed, or the ability for a user to find what they want quickly. The new UI continues where the older UI left off; telling more with less, and doing so very quickly.

Looking at the image below, you will also see that the UI was designed for the use with multiple products. The framework is used not only for FVP, but for the upcoming release of PernixData Architect. This allows for transitions between products to be fluid, and intuitive.

New search capabilities
In larger environments, isolating and filtering VMs for deeper review is a valuable feature. Not that big of a deal with a few dozen VMs, but get a few hundred or more VMs, and it becomes difficult to keep track. The quick search abilities allow for real time filtering down of VMs based on search criteria. Highlighting those VMs then allows for easy comparison.

More granularity with the hero numbers
Hero numbers have been a great way to see how much offload has occurred in an infrastructure. How many I/Os offloaded from the Datastore, how much bandwidth never touched your storage infrastructure due to this offload, and how many writes were accelerated. In previous versions, that number started counting from the moment the FVP cluster was created. In FVP 3.0, you get to choose to see how much offload has occurred over a more granular period of time.

New graphs to show cache handling
Previously, the "Hit Rate and Eviction Rate" metric helped express cache usage, and were combined in a single graph. Hit Rate indicated the percentage of reads that were serviced by the acceleration tier. It didn’t measure writes in any way. Eviction Rate indicated the percentage of data that was being evicted from the acceleration tier to make room for new incoming hot data. Each of them now have their own graphs that are more expansive in the information they provide.

As shown below, "Acceleration Rate" is in place of "Hit Rate." This new metric now accounts for both reads and writes. One thing to note is that writes will only show "accelerated" here when in Write Back mode." Even though Write Back and Write Through populate the cache with the same approach, the green "write" line will only indicate acceleration when the VM or VMs are using a Write Back policy.

"Population and Eviction" (as shown below) replaces the latter half of the "Hit Rate and Eviction Rate" metric. Note that Eviction Rate is no longer measured as a percentage, but by actual amount in GB. This is a better way to view it, as the sizes of acceleration tiers vary, and thus the percentage value varied. Now you can tell more accurately how much data is being evicted at any given time. Population rate is exactly as it sounds. This is going to account for write data being placed into the cache regardless of its Write Policy (Write Back or Write Through), as well as data read for the first time from the backing storage, and placed into the cache (known as a "false write"). This graph provides much more detail about how the cache is being utilized in your environment.

Now, if you really want to see some magical charts, and the insights that can be gleaned from them, go take a look at PernixData Architect. I’ll be covering those graphs in more detail in upcoming posts.

Summary
A lot of new goodies have been packed into the latest version of FVP, but this covers a bit about why the UI was changed, and how PernixData products are in a great position to evolve and meet the demands of the user and the environment.

Inside PernixData Engineering – UI and Web Technologies