Dealing with VMware Datastore space management on VSP Storage

Dealing with VMware Datastore space management on VSP Storage - part 2

By Paul Morrissey posted 05-01-2020 07:03

Like

Following on from Part 1 in this datastore space management series, lets first address VMFS6 and automatic unmap. In Part 3, we'll continue to address environments with VMFS5 with some automation that can be applied and update on vVols space management

For VMFS 6 datastores and automatic unmap, many customers revisit automatic unmap when dealing with space management issues. Although re-introduced in 6.5, here is a refresher or an example of how to verify automatic unmap is working as expected in your Hitachi Storage environment. There are many decent articles on this but time we gave a Hitachi perspective on it. As we know, there is reclaiming deleted space from datastore (e.g a VM Storage vMotion or VMDK delete) and also in-guest delete reclamation (Linux OS deletes log files). My general advice is to focus on the GBs, the MBs will sort themselves out when doing these type of tests. Although I was able to showcase fast reclamation in my setup with low file size in-guest deletion, generally automatic unmap can take anywhere from 30 mins to 24 hours depending on datastore factors (how active etc.) with an average being 3 hours based on prior testing. So extend the wait to 24 hours if you don't see immediate gratification but I have given a tip below that I use that normally works to get immediate automatic unmap response.

First, I would generally recommend before testing in-guest file deletion is working as expected with automatic unmap behavior, that a customer should test the basic automatic unmap by migrating a VM away from the VMFS6 datastore to another datastore. This should quickly verify automatic unmap behavior as you would see that UnMap I/Os counter increase pretty rapidly. (or a least within the first 24 hours). Read on for info on Unmap I/Os counter. Try that first. I had done that before taking screenshots for the blog

Onto in-guest deletion and automation unmap:-
I used Oracle Linux VM (VM_xyz) with 2 x 50GB thin provisioned disks. Reminder: For UNMAP, must be thin provisioned VMDKs. The two disks had ~ 16GB of data, vSphere Client displays 24GB which includes the 8GB memory swap

First, I ensured filesystem is mounted with discard option so filesystem will issue trim requests. So I created and mounted a /dev/sdb1 partition to /cm directory. The sdb partition is on the 50 GB "hard disk 2" VMDK that was thin provisioned.

#mount -t ext4 -o rw,discard /dev/sdb1 /cm
and the following to verify the discard flag
#mount | column -t | grep /dev

Then I verified that filesystem trim (in my case ext4 filesystem) would be automatic by viewing the discard granularity of filesystem with "lsblk --discard" command.
As long as you see non-zero for DISC-GRAN then you are good to continue. (see 1M in example below),

Reminder: Automatic UNMAP requires a granularity of 1MB or less. Hitachi storage with HMO 114 enabled advertises a granularity of less than 1MB (256KB to be accurate) .Again, If issues, verify that host mode option 114 is enabled in the hostgroup that has the LUN path for this LUN/datastore. This is in addition to standard HMO's of 54 and 63 for VMware.

Reminder: If the linux filesystem is not mounted with discard to automatically trim, then users would have to issue "fstrim -a" or "fstrim --verbose --all" to trim and see what could be trimmed

So the VM was sitting on "VSP5500-Gold-NVMe-1" datastore, we can see from datastore/configure/device backing screen on vSphere Client below that this is backed by storage LUN/device with NAA ending in x0113 and from vCenter perspective, it sees 2TB volume with 929GB used capacity.

Ok, but which storage array is this LUN coming from. Our labs have tens of arrays. Lucky enough, I had been managing our new tech preview version of our Ansible integration for Hitachi Storage as we agile develop it towards getting it ready for beta release. There is "find_lun_playbook" that will search all arrays for certain NAA# .

So , running "#ansible-playbook find_lun_playbook.yml | grep Serial" told me array serial #30081 was hosting this LUN/datastore.

Ok, to see the capacity of this volume, I could have gone to our Hitachi PowerShell cmdlets or vSphere Client Storage Plugin or indeed VASA Provider UI but as I was in ansible mode killing two birds with one stone and needed to get this datapoint frequently. I used a get_lun_df playbook to get me current used capacity as seen by the array. You can just grep the output to find used capacity.

Ok, so we have our starting point of used capacity of the datastore (699,654MB).

Next, in order to monitor UNMAP I/Os on this datastore, you can use vm support tool "vsish". Again, please use only in read only mode unless directed otherwise by VMware Support. This is just used if you want to monitor UNMAP I/Os. So SSH to ESXi host where datastore is mounted. Enter command "vsish" and then enter command

"get /vmkModules/vmfs3/auto_unmap/volumes/VSP5500-Gold-NVMe-1/properties"
(Inserting your datastore name rather than VSP...)

So current Unmap IOs are 10291

As we are dealing in capacity savings of TBs, I would recommend testing by modifying/deleting large files. But in this case I deleted 3 files totaling ~ 48MB ! I used "sync" after the file deletion (force of habit) but not necessary.

Running vsish again, I see Unmap I/O has increased to 292

Checking the used capacity of my LUN, I see indeed that it has decreased (coincidentally by 42M, our page size) verifying that Hitachi Storage received the unmap request(s) and reclaimed that deleted space that I had deleted inside the guest OS.

Looking at SN UI, I can see that 680.28GB (multiplied by 1024MB) is 696,606MB which is close to 696,612MB that playbook returned. Also, I should have mentioned that I had previously disabled capacity savings on this datastore LUN to avoid head-spinning calculations so it would be clearer what space reduction capacity would be due to UNMAP.

Again, if repeating this test, use large files as I had controlled environment.
TIP: if you want immediate automatic unmap, I have seen that when I copy a 1.4GB ova to ext4 filesystem, make a copy of that ova and then delete a few lines from the start of that 2nd ova via vi and save it. That typically gets Unmap I/O counters to fire. You don't even have to delete the ova's

To summarize, don't make any correlations to # of unmap I/Os to MB saved. Its not a precise science given how blocks are stored and again focus on the GB to be reclaimed. The important factor is that if you see the unmap I/O counter increasing, then rest assured that automatic unmap is working and Hitachi Storage is reclaiming without impacting host I/O. You could occasionally try that tip approach above every so often to get that positive feedback

You will have noticed a difference in the datastore free space capacity as reported by vSphere or vmkfstools as (930GB) compared to actual free space in this particular LUN (1.3TB). My other VMs had a mixture of thick provision lazy zero VMDKs (a holdover from OVAs deployed) and thin VMDKs. vSphere will calculate free space based on full provisioned lazy zero vmdks. Use the used capacity as seen from storage (via Hitachi Storage Plugin for vSphere or Hitachi PowerCLI cmdlet or Hitachi VASA UI) to avoid over-zealous lazy zero disks skewing the real picture and favor thin VMDKs going forward.

I'll continue with Part 3 to cover old and next gen (VMFS5 and vVols) in the next space management blog series.

Appendix:
Here are all the commands that were used and not harmed during this blog creation