Quick post here. I am working on updating some documentation and I wanted to add a bit more color to a section on changing the IO Operations limit for ESXi NMP Round Robin devices. The Pure Storage recommendation is to change this value to one from the default of 1,000. Therefore, ESXi will switch logical paths after each I/O instead of 1,000. There are some performance benefits to this and some evidence for improved failover time (in the case of a path failure) with this setting. I am not going to get into the veracity of these benefits right now. What I wanted to share here is that there is no doubt changing this to 1 makes a big difference to I/O balance on the array itself.
I wanted to include some screenshots showing the I/O balance changing when this parameter was changed. So I setup a simple test:
- 8 ESXi hosts
- 320 virtual machines running a decently realistic workload with VDBench (varied block sizes, write banding, mix of reads writes etc.)
- 8 datastores
I started all of these workloads up and ran it with the PSP set to Fixed (for a baseline), Round Robin with default IO Operation of 1,000 and finally Round Robin with IO Operations set to 1. Used PowerCLI of course to change these settings. I ran each of these tests a few times with remarkably consistent results.
One note on the Fixed test–I actually ran two versions of it. When setting a device to Fixed you need to specify a path. So I tried to methods in PowerCLI. The first was I just selected the first path for that device:
$path = $device |get-scsilunpath | select -first 1
The other run I selected a random path. With the idea that maybe randomized paths might provide a little benefit (better balance possibly):
$path = $device |get-scsilunpath | get-random
In order to judge the balance on the array I used a tool we have for the FlashArray that is used to verify proper multipathing prior to a code upgrade among other reasons (controllers are upgraded separately and are rebooted so we want to make sure end hosts can send data to both arrays to maintain non-disruptivity (not a word, but whatever).
Before I move onto the results, I will say that I didn’t do a deep performance analysis for this test. But I will note that the performance did not vary significantly as these were changed around. I only glanced at latency, IOPS and throughput numbers through the test so make out of that what you will.
So some results.
The default results from our I/O balance tool shows three things:
- Host–the host sending the I/O.
- The IOPS that are being sent to controller 1 and controller zero. Note that if the IOPS cross the 1,000 mark it is abbreviated as 1K.
- Whether the host is balanced across the two controllers. The threshold here is one controller is doing more than 1.5x the other.
The first test was with Fixed multipathing with the preferred path being the first path that PowerCLI reported:
Pretty ugly. All hosts were extremely unbalanced and half of them showed zero I/O to an entire controller. A controller 0 failure here would certainly cause some hiccups as ESXi switched to non-failed paths.
How about with Fixed when a random path was selected?
Looks a bit better, but still some decent imbalance across the controllers. Let’s move to Round Robin with the default IO Operations Limit of 1,000.
Much better! Only one host is reported as unbalanced here. Now for the final test, with Round Robin and an I/O Operations Limit of 1:
Beautiful! The I/O is almost perfectly balanced here.
So make of this information however you want. But I think the point is that it is important to verify your multipathing and not just because of performance but also because of balance across the system. In the end, I think this is somewhat of a “no duh!” post but I thought I’d share my results anyways.