We are running a HCI system with 4 nodes, 64GB RAM and 500GB hard disk on each node. The file system is a share from HNAS consisting of around 170 million files. The workflow is running very slow and HCI 4 node cluster is not able to scan even 100K files per hour. Whereas servers utilisation is not going beyond 20-30% and lot of resources are free on servers. Initially the workflow was talking bunch of 10000+ documents, and scanning 1M files / Hour. Now its reduced to less than 1000 files as batch and taking minimum 12 hours to scan 1M files. We faced the issue before and logged a case earlier but the issue is not resolved.
taking the recommendations from the support team ,we tried doing some changes to the pipelines. Like we have removed most of the regex fields in date conversion stage, added many content_type fields in mime type detection stage etc... but none of these seem to give us consistent performance and the rate is very slow.
Going by this rate, it would take months for the customer's 170 million file FS to get scanned by HCI... Can anyone please suggest some workarounds or their recommendations if you have faced a similar situation before.