Thanks again~ @Shawn @Jan Høydahl What is the recommend size of one shard?or how many docs per shard is recommend? My collection has 20 shard,each shard is 30~50GB。
@Shawn according to your wiki and e-mail reply, to achieve a better query performance ,I estimate the RAM requirement like follow: 787GB 2billion docs,each document has 13 field。document field like follow: *document { useid,phone,username,age,contry,city,province,job, gender,height,weight,lastupdatetime,createTime }* All fields will be search randomly,maybe combine with each other randomly too. So, in order to query fast, I should use filterCache,queryResultCache,documentCache, and use HDFS directMemorySize. Follow is the detail: 1. "userId" is document "id" field, no nead to cache. 2. "phone" is offen precise query,no nead to cache too 3. username maybe wildcard query,no nead to cache,due to cache is low hit rate 4. age、contry、city、province、gender,these are enumerate type,we can cache each type of them age 0~100,filterCache maybe holds 100 object, total docId is 2billion,maybe 2billion * integer.size = 7.5GB contry 190+, filterCache size mybe holds 190+ object,total docId is 2billion, 7.5GB city,2billion docid , 7.5GB province,2billion docid ,7.5GB gender , 2billion docid ,7.5GB And those conditions maybe combine with each other while query happen,but we can not enumerate them, and it is impossiable to cache in all queryResultCache。 5. height、weight are double,and can not cache at all,and no rule to estimate 6. lastupdatetime、createtime are long,always use range query,and no rule to estimate All I can do is resize filterCache,with 5 enumerate type,meybe 7.5GB x 5 = 37.5GB, and enable HDFS direcMemorySize as 20GB per node,using SSD for storage。 As a result, if query cache is 2 times of 37.5GB, 75GB, I should use 11 solr node,each solr jvm node using 6GB RAM, and each node enable 10GB hdfs directMemorySize。 *Is there anything else to make it batter?* Jan Høydahl <jan....@cominvent.com> 于2018年8月23日周四 下午9:38写道: > Shawn, the block cache seems to be off-heap according to > https://lucene.apache.org/solr/guide/7_4/running-solr-on-hdfs.html < > https://lucene.apache.org/solr/guide/7_4/running-solr-on-hdfs.html> > > So you have 800G across 4 nodes, that gives 500M docs and 200G index data > per solr node and 40G per shard. > Initially I'd say this is way too much data and too little RAM per node > but it obviously work due to the very small docs you have. > So the first I'd try (after doing some analysis of various metrics for > your running cluster) was to adjust the size of the hdfs block-cache > following the instructions from the link above. You'll have 20-25Gb > available for this, which is only 1/10 of the index size. > > So next step would be to replace the EC2 images with ones with more RAM > and increase block cache further and see of that helps. > > Next I'd enable autoWarmCount on filterCache, find alternatives to > wildcard query and more. > > But all in all, I'd be very very satisfied with those low response times > given the size of your data. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 23. aug. 2018 kl. 15:05 skrev Shawn Heisey <apa...@elyograg.org>: > > > > On 8/23/2018 5:19 AM, zhenyuan wei wrote: > >> Thanks for your detail answer @Shawn > >> > >> Yes I run the query in SolrCloud mode, and my collection has 20 shards, > >> each shard size is 30~50GB。 > >> 4 solr server, each solr JVM use 6GB, HDFS datanode are 4 too, each > >> datanode JVM use 2.5GB。 > >> Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB > SSD 。 > >> > >> So, in order to search 2 billion docs fast( HDFS shows 787GB ),I should > >> turn on autowarm,and How > >> much solr RAM/how many solr node it should be? > >> Is there a roughly formula to budget ? > > > > There are no generic answers, no rough formulas. Every install is > different and minimum requirements are dependent on the specifics of the > install. > > > > How many replicas do you have of each of those 20 shards? Is the 787GB > of data the size of *one* replica, or the size of *all* replicas? Based on > the info you shared, I suspect that it's the size of one replica. > > > > Here's a guide I've written: > > > > https://wiki.apache.org/solr/SolrPerformanceProblems > > > > That guide doesn't consider HDFS, so the info about the OS disk cache on > that page is probably not helpful. I really have no idea what requirements > HDFS has. I *think* that the HDFS client block cache would replace the OS > disk cache, and that the Solr heap must be increased to accommodate that > block cache. This might lead to GC issues, though, because ideally the > cache would be large enough to cache all of the index data that the Solr > instance is accessing. In your case, that's a LOT of data, far more than > you can fit into the 32GB total system memory.Solr performance will suffer > if you're not able to have the system cache Solr's index data. But I will > tell you that achieving a QTime of 125 on a wildcard query against a 2 > billion document index is impressive, not something I would expect to > happen with the low hardware resources you're using. > > > > You have 20 shards. If your replicationFactor is 3, then ideally you > would have 60 servers - one for each shard replica. Each server would have > enough memory installed that it could cache the 30-50GB of data in that > shard, or at least MOST of it. > > > > IMHO, Solr should be using local storage, not a network filesystem like > HDFS. Things are a lot more straightforward that way. > > > > Thanks, > > Shawn > > > >