Thanks again~ @Shawn @Jan Høydahl
What is the recommend size of one shard?or how many docs per shard is
recommend?
My collection has 20 shard,each shard is 30~50GB。

@Shawn according to your wiki and e-mail reply,
to achieve a better query performance ,I estimate the RAM requirement like
follow:

787GB 2billion docs,each document has 13 field。document field like follow:


*document {  useid,phone,username,age,contry,city,province,job,
gender,height,weight,lastupdatetime,createTime }*

All fields will be search randomly,maybe combine with each other randomly
too.
So, in order to query fast, I should use
filterCache,queryResultCache,documentCache, and use HDFS  directMemorySize.
Follow is the detail:

1. "userId" is document "id" field, no nead to cache.
2. "phone" is offen precise query,no nead to cache too
3. username maybe wildcard query,no nead to cache,due to cache is low hit
rate
4. age、contry、city、province、gender,these are enumerate type,we can cache
each type of them
age 0~100,filterCache maybe holds 100 object, total docId is 2billion,maybe
2billion * integer.size = 7.5GB
contry 190+, filterCache size mybe holds 190+ object,total docId is
2billion, 7.5GB
city,2billion docid , 7.5GB
province,2billion docid ,7.5GB
gender , 2billion docid ,7.5GB

And those conditions maybe combine with each other while query happen,but
we can not enumerate them, and it is impossiable
to cache in all queryResultCache。
5. height、weight are double,and can not cache at all,and no rule to
estimate
6. lastupdatetime、createtime are long,always use range query,and no rule to
estimate

All I can do is resize filterCache,with 5 enumerate type,meybe 7.5GB x 5 =
37.5GB,
and enable HDFS direcMemorySize as 20GB per node,using SSD for storage。
As a result, if query cache is  2 times of  37.5GB,   75GB,
I should use 11 solr node,each solr jvm node using 6GB RAM, and each node
enable 10GB hdfs directMemorySize。

*Is there anything else to make it batter?*





Jan Høydahl <jan....@cominvent.com> 于2018年8月23日周四 下午9:38写道:

> Shawn, the block cache seems to be off-heap according to
> https://lucene.apache.org/solr/guide/7_4/running-solr-on-hdfs.html <
> https://lucene.apache.org/solr/guide/7_4/running-solr-on-hdfs.html>
>
> So you have 800G across 4 nodes, that gives 500M docs and 200G index data
> per solr node and 40G per shard.
> Initially I'd say this is way too much data and too little RAM per node
> but it obviously work due to the very small docs you have.
> So the first I'd try (after doing some analysis of various metrics for
> your running cluster) was to adjust the size of the hdfs block-cache
> following the instructions from the link above. You'll have 20-25Gb
> available for this, which is only 1/10 of the index size.
>
> So next step would be to replace the EC2 images with ones with more RAM
> and increase block cache further and see of that helps.
>
> Next I'd enable autoWarmCount on filterCache, find alternatives to
> wildcard query  and more.
>
> But all in all, I'd be very very satisfied with those low response times
> given the size of your data.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 23. aug. 2018 kl. 15:05 skrev Shawn Heisey <apa...@elyograg.org>:
> >
> > On 8/23/2018 5:19 AM, zhenyuan wei wrote:
> >> Thanks for your detail answer @Shawn
> >>
> >> Yes I run the query in SolrCloud mode, and my collection has 20 shards,
> >> each shard size is 30~50GB。
> >> 4 solr server, each solr JVM  use 6GB, HDFS datanode are 4 too, each
> >> datanode JVM use 2.5GB。
> >> Linux server host are 4 node too,each node is 16 core/32GB RAM/1600GB
> SSD 。
> >>
> >> So, in order to  search 2 billion docs fast( HDFS shows 787GB ),I should
> >> turn on autowarm,and   How
> >> much  solr RAM/how many solr node  it should be?
> >> Is there a roughly  formula to budget ?
> >
> > There are no generic answers, no rough formulas.  Every install is
> different and minimum requirements are dependent on the specifics of the
> install.
> >
> > How many replicas do you have of each of those 20 shards? Is the 787GB
> of data the size of *one* replica, or the size of *all* replicas?  Based on
> the info you shared, I suspect that it's the size of one replica.
> >
> > Here's a guide I've written:
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > That guide doesn't consider HDFS, so the info about the OS disk cache on
> that page is probably not helpful.  I really have no idea what requirements
> HDFS has.  I *think* that the HDFS client block cache would replace the OS
> disk cache, and that the Solr heap must be increased to accommodate that
> block cache.  This might lead to GC issues, though, because ideally the
> cache would be large enough to cache all of the index data that the Solr
> instance is accessing.  In your case, that's a LOT of data, far more than
> you can fit into the 32GB total system memory.Solr performance will suffer
> if you're not able to have the system cache Solr's index data.  But I will
> tell you that achieving a QTime of 125 on a wildcard query against a 2
> billion document index is impressive, not something I would expect to
> happen with the low hardware resources you're using.
> >
> > You have 20 shards.  If your replicationFactor is 3, then ideally you
> would have 60 servers - one for each shard replica. Each server would have
> enough memory installed that it could cache the 30-50GB of data in that
> shard, or at least MOST of it.
> >
> > IMHO, Solr should be using local storage, not a network filesystem like
> HDFS.  Things are a lot more straightforward that way.
> >
> > Thanks,
> > Shawn
> >
>
>

Reply via email to