On 8/23/2018 5:19 AM, zhenyuan wei wrote:
Thanks for your detail answer @Shawn

Yes I run the query in SolrCloud mode, and my collection has 20 shards,
each shard size is 30~50GB。
4 solr server, each solr JVM  use 6GB, HDFS datanode are 4 too, each
datanode JVM use 2.5GB。
Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB SSD 。

So, in order to  search 2 billion docs fast( HDFS shows 787GB ),I should
turn on autowarm,and   How
much  solr RAM/how many solr node  it should be?
Is there a roughly  formula to budget ?

There are no generic answers, no rough formulas.  Every install is different and minimum requirements are dependent on the specifics of the install.

How many replicas do you have of each of those 20 shards? Is the 787GB of data the size of *one* replica, or the size of *all* replicas?  Based on the info you shared, I suspect that it's the size of one replica.

Here's a guide I've written:

https://wiki.apache.org/solr/SolrPerformanceProblems

That guide doesn't consider HDFS, so the info about the OS disk cache on that page is probably not helpful.  I really have no idea what requirements HDFS has.  I *think* that the HDFS client block cache would replace the OS disk cache, and that the Solr heap must be increased to accommodate that block cache.  This might lead to GC issues, though, because ideally the cache would be large enough to cache all of the index data that the Solr instance is accessing.  In your case, that's a LOT of data, far more than you can fit into the 32GB total system memory.Solr performance will suffer if you're not able to have the system cache Solr's index data.  But I will tell you that achieving a QTime of 125 on a wildcard query against a 2 billion document index is impressive, not something I would expect to happen with the low hardware resources you're using.

You have 20 shards.  If your replicationFactor is 3, then ideally you would have 60 servers - one for each shard replica. Each server would have enough memory installed that it could cache the 30-50GB of data in that shard, or at least MOST of it.

IMHO, Solr should be using local storage, not a network filesystem like HDFS.  Things are a lot more straightforward that way.

Thanks,
Shawn

Reply via email to