Hitachi Content Platform​

 Estimate Solr index size

  • Object Storage
  • Hitachi Content Intelligence HCI
Eduardo Javier Huerta Yero's profile image
Eduardo Javier Huerta Yero posted 07-12-2019 13:04

Hi,

We have a customer with 1.5 billion objects to index. We know how many fields need to be indexed for each object, as well as the field size for each of them. We expect index replication to be set to at least 2. Is there a way to guesstimate the Solr index size? Actually, what we need to tell the customer is how many nodes they need in their cluster, which at this point we believe is going to be driven by the index size.

Any guidance is appreciated.


#HitachiContentIntelligenceHCI
Data Conversion's profile image
Data Conversion

There is no accurate way to estimate index size given your data. There are plenty of resources online that can offer some guidance, see https://www.google.com/search?q=solrestimateindex+size

The main issue is that it really depends on data being indexed. Not the number of objects, and not their size, but what is the actual indexable content. In search index terminology, the most important parameter is "number of unique terms" being indexed.

For example, if you indexed a million of objects and they all contain a date or a person name, the index size will be considerably different depending on whether those dates and names are unique, or duplicated.

The only practical way to estimate and size a search solution is to run a prototype on a subset of data, let's say 10M documents, measure, then extrapolate to the desired object count.

Eduardo Javier Huerta Yero's profile image
Eduardo Javier Huerta Yero

Thanks Yury for the quick reply. I was afraid that was the answer. I had run a similar query on Google and came across an Excel spreadsheet that tries to estimate index and memory requirements. One of the inputs required was "unique terms" as you mention. I was hoping for some kind of ballpark estimate to give the customer since they want to know the size of the cluster they need to buy, but I realize it's not possible. 

Once again, thanks for your answer.

Troy Myers's profile image
Troy Myers

   You have several key pieces of information and asked an important question about the IPL level.  I will assume IPL lvl 1, to make this easy and we can double it later.  One of your main drivers will be shard count, and a good rule of thumb is 50 - 100 M objects per shard.  Please be aware shard count can not currently be changed after configuration ( It is on the road map,  PM can answer about timing and version) So we want to account for future growth also when figuring out the total number of shards.  I would use 50 M, since that would let us grow into 100M.  So (1,500,000,000/50,000,000 = 30 ).  1.5B/50M = 30 shards then figure out how we want to distribute the shards.  With some examples  do we want to setup the system with 8 nodes, 5 workers and 3 Masters for an 8 node system or will the masters also hold the index?   So if you go with 5 workers it is 6 shards per node, if you go with a 4 node system  it is 8 shards per node, I rounded up to 32.  I would feel comfortable with the 8 nodes, 5 Shards since that is IPL1, and for IPL2 we will have to double it. 

  Finding out the exact Data size can be a challenge with out testing and extrapolating the data.  I would also want to know the change rate as that may factor into the 50-100M goal.  Plus the other piece will be how many users, connections and queries will be hitting the index.  As performance can also be impacted their. 

  Troy

Eduardo Javier Huerta Yero's profile image
Eduardo Javier Huerta Yero

Great stuff Troy, thanks!!