Hitachi Content Platform

Estimate Solr index size

Eduardo Javier Huerta Yero posted 07-12-2019 13:04

Hi,

We have a customer with 1.5 billion objects to index. We know how many fields need to be indexed for each object, as well as the field size for each of them. We expect index replication to be set to at least 2. Is there a way to guesstimate the Solr index size? Actually, what we need to tell the customer is how many nodes they need in their cluster, which at this point we believe is going to be driven by the index size.

Any guidance is appreciated.

#HitachiContentIntelligenceHCI

Data Conversion posted 07-12-2019 13:28

There is no accurate way to estimate index size given your data. There are plenty of resources online that can offer some guidance, see https://www.google.com/search?q=solrestimateindex+size

The main issue is that it really depends on data being indexed. Not the number of objects, and not their size, but what is the actual indexable content. In search index terminology, the most important parameter is "number of unique terms" being indexed.

For example, if you indexed a million of objects and they all contain a date or a person name, the index size will be considerably different depending on whether those dates and names are unique, or duplicated.

The only practical way to estimate and size a search solution is to run a prototype on a subset of data, let's say 10M documents, measure, then extrapolate to the desired object count.

Eduardo Javier Huerta Yero posted 07-12-2019 13:35

Thanks Yury for the quick reply. I was afraid that was the answer. I had run a similar query on Google and came across an Excel spreadsheet that tries to estimate index and memory requirements. One of the inputs required was "unique terms" as you mention. I was hoping for some kind of ballpark estimate to give the customer since they want to know the size of the cluster they need to buy, but I realize it's not possible.

Once again, thanks for your answer.

Troy Myers posted 07-12-2019 13:56

You have several key pieces of information and asked an important question about the IPL level. I will assume IPL lvl 1, to make this easy and we can double it later. One of your main drivers will be shard count, and a good rule of thumb is 50 - 100 M objects per shard. Please be aware shard count can not currently be changed after configuration ( It is on the road map, PM can answer about timing and version) So we want to account for future growth also when figuring out the total number of shards. I would use 50 M, since that would let us grow into 100M. So (1,500,000,000/50,000,000 = 30 ). 1.5B/50M = 30 shards then figure out how we want to distribute the shards. With some examples do we want to setup the system with 8 nodes, 5 workers and 3 Masters for an 8 node system or will the masters also hold the index? So if you go with 5 workers it is 6 shards per node, if you go with a 4 node system it is 8 shards per node, I rounded up to 32. I would feel comfortable with the 8 nodes, 5 Shards since that is IPL1, and for IPL2 we will have to double it.

Finding out the exact Data size can be a challenge with out testing and extrapolating the data. I would also want to know the change rate as that may factor into the 50-100M goal. Plus the other piece will be how many users, connections and queries will be hitting the index. As performance can also be impacted their.

Troy

Eduardo Javier Huerta Yero posted 07-12-2019 19:59

Great stuff Troy, thanks!!

Hitachi Content Platform​

Estimate Solr index size

Related Content

HCI to audit HCP access and internal logs

CoE Cloud Object Platform

Announcing Hitachi Content Intelligence v1.5

Performance Monitoring w/ ELK - Part II: Monitoring HCP Access Logs

Visualizing Hitachi Content Platform (HCP) Chargeback Reports

Hitachi Content Platform