April 9, 2014 6 Comments
One of the reasons I find the latest trends in datacenter architectures so interesting is the innovative approaches used to address deficiencies associated with more traditional arrangements. These innovations have been able to drive more of what almost everyone needs; better storage performance and better scalability.
The caveat to some of these newer arrangements is that it can put heavy stress on the plumbing that connects these servers. Distributed storage technologies like VMware VSAN, or clustered write buffering techniques used by PernixData FVP and Atlantis Computing’s USX leverage these interconnects to accelerate storage traffic. Turn-key Hyperconverged solutions do too, but they enjoy the luxury of having full control over the hardware used. Some of these software based solutions might need some retrofitting of an environment to run optimally or meet their requirements (read: 10GbE or better). The desire for the fastest interconnect possible between hosts doesn’t always align with budget or technical constraints, so it makes most sense to first see what impact there really is.
I wanted to test the impact better bandwidth would have between servers a bit more, but do to constraints in my production environment, I needed to rely on my home lab. As much as I wanted to throw 10GbE NICs in my home lab, the price points were too high. I had to do it another way. Enter InfiniBand. I’m certainly not the only one to try InfiniBand in a home lab, but I wanted to focus on two elements that are critical to the effectiveness of replica traffic. The overall bandwidth of the pipe, and equally important, the latency. While I couldn’t simulate an exact workload that I see in my production environment, I could certainly take smaller snippets of I/O patterns that I see, and model them the best I can.
InfiniBand is really interesting. As Joeb Jackson put it in a NetworkWorld.com article, "InfiniBand is architecturally sacrilegious" as it combines many layers of the OSI model. The results can be stunning. Transport latencies in the 2 microsecond neighborhood, and a healthy roadmap to 200Gbps and beyond. It’s sort of like the ’66 AC Shelby Cobra of data transports. Simple, and perhaps a little rough around the edges, but brutally fast. While it doesn’t have the ubiquity of RJ/Ethernet, it also doesn’t have the latencies that are still a part of those faster forms of Ethernet.
At the time of this writing, the InfiniBand drivers for ESXi 5.5 weren’t quite ready for VSAN testing yet, so the focus of this testing is to see how InfiniBand behaves when used in a PernixData FVP deployment. I hope to publish a VSAN edition in the future. I simply wanted to better understand if (and how much) a faster connection would improve the transmission of replica traffic when using FVP in WB+1 mode (local flash, and 1 peer). My production environment is very write intensive, and uses 1GbE for the interconnects. Any insight gained here will help in my design and purchasing roadmap for my production environment.
Testing occurred on a two host cluster backed by a Synology DS1512+. Local flash leveraged SATA III based EMLC SSD drives using an onboard controller. 1GbE interconnects traversed a Cisco SG300-20 using a 1500 byte MTU size. For InfiniBand, each host used a Mellanox MT25418 DDR 2 port HCA that offered 10Gb per connection. They were directly connected to each other, and used a 2044 byte MTU size. InfiniBand can be set to 4092 bytes but for compatibility reasons under ESXi 5.5, 2044 is the desired size.
I tend to prefer testing that relies on observational patterns versus one final, empirical number. These tests were no different, and while they attempt to simulate a very brief snippet of a workload in my production environment, I find that I still gain a much better understanding from a time based performance graph than an insulated final number.
The test case was a simple one, but would be enough to illustrate the differences I was hoping to see. The test comprised of a 2vCPU VM using 2 workers on a 100% write, 100% random workload lasting for 1 minute. The test was run three times. First with WB+0 (no peer/replica traffic), then WB+1 (one peer) using a 1GbE connection, and finally WB+1 over a single 10Gb InfiniBand connection. Each screen capture I provide will show them in that order. That test case was repeated 3 times. First with 256KB I/O sizes, followed by 32KB, then onto 4KB. I ran the tests several times in different order to ensure I wasn’t introducing inflated or deflated performance due to previous tests or caching. All were repeated several times to flush out any anomalies.
(Click on each image for a larger view)
256KB I/O size test
Testing results using this I/O size is rarely published anywhere because it never bodes well in comparison to a smaller I/O size like 4KB. But my production workloads (compiling) often deal with these I/O sizes, so it is important for me to understand their behavior.
Observations from 256KB I/O test
Note that the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance, driving just half of the IOPS and throughput compared to InfiniBand. But also take a look at the terrible native latency (70ms) of large I/O sizes even when using WB+0 (no peer traffic. Just local flash). Also note that when peer traffic performance is improved, the larger backlog of data in the destager occurs.
32KB I/O size test
Just 1/8th the size of a 256KB I/O, this is still larger than most storage vendors like to advertise in their testing. My production workload often oscillates between 32KB and 256KB I/Os.
Observations from 32KB I/O test
Once again, the IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. You can also see how much the 1GbE interface throttled down the performance on throughput. Latency had only a minor improvement moving away from 1GbE, as the latency of the flash was about 6ms.
4KB I/O size test
The most common of I/O sizes that you might see, although it is more common on reads than writes. 1/64th the size of a 256KB I/O, it is tiny compared to the others, but important to test because of the attempt to learn if and how much a fatter, lower latency pipe helps in various I/O sizes.
Observations from 4KB I/O test
IOPS and effective throughput on the WB+1 using InfiniBand was nearly identical to that of a WB+0 (local flash only) scenario. But as the I/O sizes shrink, so does the effective total/concurrent payload size. So the differences between InfiniBand and 1GbE were less than on tests with larger I/O. Latencies of this I/O size were around 2ms.
Other observations that stood out
One of the first things that stood is illustrated below, with two 5 minutes test runs. Look at where the two arrows point. The arrow on the left points to the number of packets sent while using 1GbE. The arrow on the right shows the number of packets sent while using 10Gb InfiniBand. Quite a difference. Also notice that the effective throughput started out higher, but had to throttle back
The key takeaways from these tests:
- A high bandwidth, low latency interconnect like InfiniBand can virtually eliminate any write redundancy penalty incurred in WB+1 mode.
- From a single workload, I/O sizes of 32KB and 256KB saw between 65% and 90% improvement on IOPS and throughput. I/O sizes of 4KB saw essentially no improvement (many concurrent 4KB workloads likely would see a benefit however)..
- Writes using larger I/O sizes were the clear beneficiary of a fatter pipe between servers. However, the native latencies of the flash devices under larger I/O sizes could not take advantage of the low latencies of InfiniBand. In other words, with large I/O sizes, the flash device themselves, or the bus they were using were by far the major impediment lower latency and faster I/O delivery
- The smaller pipe of 1GbE throttled back the flash device’s ability to ingest the data as fast as InfiniBand. There was always a smaller amount of outstanding writes once the test was complete, but it came at the cost of poorer performance for 1GbE.
- A few other matters can come up when attempting to accurately interpret latencies. As VMware KB 2036863 points out, reporting of latencies accurately can sometimes be a challenge. Just something to be aware of.
InfiniBand was my affordable way to test how a faster interconnect would improve the abilities of FVP to accelerate replica storage I/O. It lived up to the promise of high bandwidth with low latency. However, effective latencies were ultimately crippled by the SSDs, the controller, or the bus it was using. I did not have the opportunity to test other flash technologies such as PCIe based solutions from Fusion-IO or Virident, or the memory channel based solution from Diablo Technologies. But based on the above, it seems to be clear that how the flash is able to ingest the data is crucial to the overall performance of whatever solution that is using it.
Erik Bussink’s great post on using InfiniBand with vSphere 5.5
Vladen Seget’s post on incorporating InfiniBand into his backing storage
Mellanox, OFED and OpenSM bundles