There is no shortage of mysteries in the data center. These stealthy influencers can undermine performance and consistency of your environment, while remaining elusive to identify, quantify, and control. Virtualization helped expose some of this information, as it provided an ideal control plane for visibility. But it does not, and cannot properly expose all data necessary to account for these influencers. The hypervisor also has a habit of presenting the data in ways that can be misinterpreted.
One such mystery as it relates to modern day virtualized data centers is known as the "working set." This term certainly has historical meaning in the realm of computer science, but the practical definition has evolved to include other components of the Data Center; storage in particular. Many find it hard to define, let alone understand how it impacts their data center, and how to even begin measuring it.
We often focus on what we know, and what we can control. However, lack of visibility of influencing factors in the data center does not make it unimportant. Unfortunately this is how working sets are usually treated. It is often not a part of a data center design exercise because it is completely unknown. It is rarely written about for the very same reason. Ironic considering that every modern architecture deals with some concept of localization of data in order to improve performance. Cached content versus it’s persistent home. How much of it is there? How often is it accessed? All of these types of questions are critically important to know.
What is it?
For all practical purposes, a working set refers the amount of data that a process or workflow uses in a given time period. Think of it as hot, commonly accessed data of your overall persistent storage capacity. But that simple explanation leaves a handful of terms that are difficult to qualify, and quantify. What is recent? Does "amount" mean reads, writes, or both? And does it define if it is the same data written over and over again, or is it new data? Let’s explore this more.
There are a several traits of working sets that are worth reviewing.
- Working sets are driven by the workload, the applications driving the workload, and the VMs that they run on. Whether the persistent storage is local, shared, or distributed, it really doesn’t matter from the perspective of how the VMs see it. The size will be largely the same.
- Working sets always relate to a time period. However, it’s a continuum. And there will be cycles in the data activity over time.
- Working set will comprise of reads and writes. The amount of each is important to know because reads and writes have different characteristics, and demand different things from your storage system.
- Working set size refers to an amount, or capacity, but what and how many I/Os it took to make up that capacity will vary due to ever changing block sizes.
- Data access type may be different. Is one block read a thousand times, or are a thousand blocks read one time? Are the writes mostly overwriting existing data, or is it new data? This is part of what makes workloads so unique.
- Working set sizes evolve and change as your workloads and data center change. Like everything else, they are not static.
A simplified, visual interpretation of data activity that would define a working set, might look like below.
If a working set is always related to a period of time, then how can we ever define it? Well in fact, you can. A workload often has a period of activity followed by a period of rest. This is sometimes referred to the "duty cycle." A duty cycle might be the pattern that shows up after a day of activity on a mailbox server, an hour of batch processing on a SQL server, or 30 minutes compiling code. Taking a look over a larger period of time, duty cycles of a VM might look something like below.
Working sets can be defined at whatever time increment desired, but the goal in calculating a working set will be to capture at minimum, one or more duty cycles of each individual workload.
Why it matters
Determining a working set sizes helps you understand the behaviors of your workloads in order to better design, operate, and optimize your environment. For the same reason you pay attention to compute and memory demands, it is also important to understand storage characteristics; which includes working sets. Understanding and accurately calculating working sets can have a profound effect on the consistency of a data center. Have you ever heard about a real workload performing poorly, or inconsistently on a tiered storage array, hybrid array, or hyper-converged environment? This is because both are extremely sensitive to right sizing the caching layer. Not accurately accounting for working set sizes of the production workloads is a common reason for such issues.
Classic methods for calculation
Over the years, this mystery around working set sizes has resulted in all sorts of sad attempts at trying to calculate. Those attempts have included:
- Calculate using known (but not very helpful) factors. These generally comprise of looking at some measurement of IOPS over the course of a given time period. Maybe dress it up with a few other factors to make it look neat. This is terribly flawed, as it assumes one knows all of the various block sizes for that given workload, and that block sizes for a workload are consistent over time. It also assumes all reads and writes use the same block size, which is also false.
- Measure working sets defined on a storage array, as a feature of the array’s caching layer. This attempt often fails because it sits at the wrong location. It may know what blocks of data are commonly accessed, but there is no context to the VM or workload imparting the demand. Most of that intelligence about the data is lost the moment the data exits the HBA of the vSphere host. Lack of VM awareness can even make an accurately guessed cache size on an array be insufficient at times due to cache pollution from noisy neighbor VMs.
- Take an incremental backup, and look at the amount of changed data. This sounds logical, but this can be misleading because it will not account for data that is written over and over, nor does it account for reads. The incremental time period of the backup may also not be representative of the duty cycle of the workload.
- Guess work. You might see "recommendations" that say a certain percentage of your total storage capacity used is hot data, but this is a more formal way to admit that it’s nearly impossible to determine. Guess large enough, and the impact of being wrong will be less, but this introduces a number of technical and financial implications on data center design.
Since working sets are collected against activity that occurs on a continuum, calculating a typical working set with a high level of precision is not only impossible, but largely unnecessary. When attempting to determine working set size of a workload, the goal is to come to a number that reflects the most typical behavior of a single workload, group of workloads, or a total sum of workloads across a cluster or data center.
A future post will detail approaches that should give a sufficient level of understanding on active working set sizes, and help reduce the potential of negative impacts on data center operation due to poor guesswork.
Thanks for reading