I posted a week or so ago about the ESXCLI UNMAP process with vSphere 5.5 on the Pure Storage FlashArray here and came up with the conclusion that larger block counts are highly beneficial to the UNMAP process. So the recommendation was simply use a larger block count than the default to speed up the UNMAP operation, something sufficiently higher than the default of 200 MB. I received a few questions about a more specific recommendation (and had some myself) so I decided to dive into this a little deeper to see if I could provide some guidance that was a little more concrete. In the end a large block count is perfectly fine–if you want to know more details–read on!
Remember, this is post is for Pure Storage Flash Arrays only–mileage may vary with other storage platforms.
***UPDATE***THERE ARE CHANGES IN UNMAP BLOCK COUNT RECOMMENDATIONS DUE TO AN ESXI BUG SEE THIS POST FOR DETAILS UNMAP Block Count Behavior Change in ESXi 5.5 P3+
As I mentioned in the previous post it seems to be okay to just enter an extremely high block count and you will get the quickest operation–say like a block count of 60,000. I have yet to experience any downside to doing so but out of curiosity I wanted to know if there was a “tipping point” where choosing a higher number brought you negligible improvement and diminishing returns set in. Logically I would expect this number to be where increases in a block count become sufficiently small in comparison to the overall block count. In other words increasing from 200 blocks to 400 blocks is a 100% change, while increasing from 10,000 blocks to 11,000 blocks is just a 10% increase–so while the increase is much bigger in terms of raw block count, the effect is much smaller. Of course theory is fun (and probably correct)–but why not actually test it?
So I ran three sets of tests:
- Unmapping a 1 TB datastore
- Unmapping a 4 TB datastore
- Unmapping a 8 TB datastore
Each set consisted of 70+ individual tests (each test repeated five times to get an average) of different UNMAP block counts. Starting with a default of 200, then 400, then 1000, 2000, 3000 etc. until 69,000. To try some ultra high block counts I ran tests with block counts of 100,000 and 256,000 to see what would happen as well.
Below is a graph showing essentially what I found before–with very small block counts cause the UNMAP duration to be quite long and dramatically decreases as the block count increases, but at a certain point duration improvement essentially disappears.
Notice I cut off the results of some of the tests with very low and very high block counts–they were so far away that it made the meat of the results hard to see. Just imagine almost entirely straight lines shooting out of the x and y axis. For the three sets, the time duration barely dropped between 69,000 to 256,000 block counts (the 8 TB datastore saw the biggest difference which was only 3 seconds). As you can see, for any of the datastores at around 60,000 block counts the improvement in duration of UNMAP is almost imperceptible. So for purely duration numbers larger block counts are definitely the way to go–so here I am going to recommend 60,000.
The next question is the most important one (probably)–is there a performance impact using large block counts? This means a few things:
- Do large UNMAP block counts interfere with other workloads more than smaller ones?
- Is there a relevant impact on CPU resources?
Let’s look at number one first. I ran a variety of workloads to a virtual disk sitting on a 4 TB datastore and kicked off an UNMAP process and here is what I found.
There was no performance impact to the workload when:
- UNMAP was issued from a different host as the workload to the same datastore as the workload
- UNMAP was issued from the same or different host as the workload and to a different datastore than the workload (on the same array)
- UNMAP was issued from the same or different host as the workload and to the same or different datastore and the workload was run on an eagerzeroedthick virtual disk
- The workload was run on regions of a thin or zeroedthick virtual disk that have already been zeroed/initialized
These four probably cover 98% of situations.
There was a performance impact to the workload when:
- UNMAP was issued from the same host to the same datastore as the workload when that workload was running to regions of a thin or zeroedthick virtual disk that had not yet been initialized.
Let me note a few things about the situation where performance was impacted:
- If ANY of the above caveats were not true there was no impact (if UNMAP was run from a different host but the same datastore there was no impact for instance)
- Once the regions of the virtual disks had been initialized by the workload the performance impact went away
- Using a larger or smaller block count with UNMAP did not make a notable difference on whether there was an impact to the workload
- Impacts were only present with WRITES (more pronounced the larger the I/O size). If the workload was all READS no impact was noted.
So what does this mean? Well since UNMAP to the same datastore from the same host had an impact but UNMAP from a different host to the same datastore did not, this indicates to me the issue is probably not at the array level–this is most likely on the host level. As to exactly why I am not yet sure yet (once I do I will let you know). The UNMAP process is not slowed when there is an interfering workload on that datastore but the workload sure is–so the UNMAP seems to take a much higher priority on the ESXi host.
The CPU impact is a similar story. For my tests my CPU was a pretty low utilization–around 10%. When I issued UNMAP using the default block count the CPU usage went up around .5%–so not much. But it did go up for a lot longer time due to the smaller block count. When I used a high block count of 60,000 the CPU usage went up 2.5% but for a much, much shorter time. So I would say unless you are really starved for CPU power this shouldn’t really be a factor.
The bottom line…
Use a high block count (~60,000) for Pure FlashArrays when performing an ESXi 5.5 UNMAP operation–it goes much faster, there is no impact to performance on the array and very limited host CPU impact. There is the situation I mentioned above where it can be impactful to workloads on the same host and datastore on newly created thin or zeroedthick virtual disks but unless you are kicking off a lot of new virtual disks and/or have a lot of continuous new writes to new sectors of a virtual disk this situation is unlikely to happen or be noticed. To be absolutely safe run UNMAP on an ESXi host that has the least guest I/O going to the datastore–this way you can be assured there will be no impact. The main question of this post was whether a large block count is safe and the answer is a resounding yes. The only situation where you may not want to use a large count is if the datastore is very full–in that case you might want to drop it down to some divisor of the free space. Even the situation where performance impact was noted had nothing to do with a block counts–actually having a large block count was better because the interference last for a far shorter time! So go big or go home!
Hi Cody, this is excellent post. I’m curious if you filled up the datastores with data between the unmap loops. (for example create VMs to fill the vmfs, then delete them, then unmap).
Thanks!! Kind of–while I filled the VMFS volumes with VMs between each run (using XCOPY/PowerCLI) the FlashArray didn’t actually store too much data because it was all essentially all deduplicated on the array. I could fill it with completely non-compressible/non-dedupable data but it would take ages 🙂 What I didn’t get to was running UNMAP under load–at least extensive testing. A high load on the ESXi host does slow it down.
It would be interesting to see if the speed gain will be so big if this was actually deallocationg (and in fact really freeing space) on the backend , since in this case you don’t gain back anything by doing UNMAP. I’m running similar test now on VMAX, and see huge difference when LUNs are allocated but never written to, vs fully allocated luns.
Yeah you would see a very different behavior on the VMAX as compared to the FlashArray because the VMAX has to do a lot more work to deallocate space–in smaller tests on the FlashArray I haven’t seen much of a difference when there was a lot of data written–it is a very inexpensive operation since we are simply changing meta data. Since the XCOPY clones still “allocate” meta data on the array but not necessarily physical data (whether they actually put data down doesnt matter on the FlashArray when it comes to this operation) it should be the same speed regardless. I will run a few more in-depth tests though to verify what i’ve seen/suspect. But these test results are very unlikely to be similar to what VMAX sees-very different architectures.