Hi Dimitry,

>>The parameters you have menioned -- termInfosIndexDivisor and
>>termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you 
>>using SOLR 3.1?

I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is 
in the 1.4.1 example solrconfig.xml file, although I don't have a copy to check 
at the moment.  We are using a 3.1 dev version.  As far as the 
termInfosIndexDivisor, I I'm also pretty sure it works with 1.4.1, but you 
might have to ask the list to be sure.  As you can see from the blog posts 
those settings really reduced our memory requirements.    We haven't been doing 
faceting so we expect memory use to go up again once we add faceting, but at 
least we are starting at a 4GB baseline instead of a 20-32GB baseline.

>>Did you you do logical sharding or document hash based?

On the indexing side we just assign documents to a particular shard on a round 
robin basis and use a database to keep track of which document is in which 
shard so if we need to update it we update the right shard (See the "Forty 
days" article on the blog for a more detailed description and some diagrams) .  
We hope that this distributes the documents evenly enough to avoid problems 
with Solr's lack of global idf.

>>Do you have load balancer between the front SOLR (or front entity) and shards,

As far as load balancing which shard is the head shard/front shard, again, our 
app layer just randomly picks one of the shards to be the head shard.  We 
originally were going to do tests to determine if it was better to have one 
dedicated machine configured to be the head shard, but never got around to 
that.  We have a very low query request rate, so haven't had to seriously look 
at load balancing

>>do you do merging? 

I'm not sure what you mean by "do you do merging" .  We are just using the 
default Solr distributed search.  In theory our documents should be randomly 
distributed among the shards so the lack of global idf should not hurt the 
merging process.  Andrzej Bialecki gave a recent presentation on Solr 
distributed search that talks about less than optimal results merging and some 
ideas for dealing with it:
http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf

>>Each shard currently is allocated max 12GB memory. 
I'm curious about how much memory you leave to the OS for disk caching.  Can 
you give any details about the number of shards per machine and the total 
memory on the machine.


Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search



________________________________________
From: Dmitry Kan [dmitry....@gmail.com]
Sent: Tuesday, June 14, 2011 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: huge shards (300GB each) and load balancing

Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?



Reply via email to