I am looking at a HNAS 4060 / AMS2500 configuration that is being used for a VMWare 5.x environment through NFS.
The entire VMWare Production environment (=all VM's and Data Stores) is running on a single 60TB NFS export (=1 underlying file system).
The 60TB file system is replicated every two hours to another HNAS 4060 over 1 GbE. On average 200-250GB has to be transferred.
We're observing the following:
- When the snapshot is triggered, HNAS 4060 CPU instantly climbes to 80% and stays their until it has figured out what blocks to send. On average this takes 8 minutes. During that time latency seems to increase on the VM's side, because the monitoring hits disk queue thresholds.
- When the replication runs, it starts out nicely at 80 MB/s in the first 10 minutes, but gradually starts to decline to 70, then 60, then 50, then 40, etc MB/s (a nicely curved graph, this happens every time, no contention on the line). The gradual decrease in speed causes the replication run to exceed the replication window (2hours), so the next snapshot is skipped as a result.
- Is use of a single NFS export a HDS best practices or at least a commonly used configuration? (I went through all Best Practice papers, but have not been able to find anything about this)
- Does anyone rezognize either of the mentioned observations and if so, what can we do about it?
Thanks in advance for any feedback you may be able to provide!