Hi,
I have a question about the optimal way to distribute solr indexes across a 
cloud.  I have a small number of collections (less than 10).  And a small 
cluster (6 nodes), but each node has several disks - 5 of which I am using for 
my solr indexes.  The cluster is also a hadoop cluster, so the disks are not 
RAIDed, they are JBOD.  So, on my 5 slave nodes, each with 5 disks, I was 
thinking of putting one shard per collection.  This means I end up with 25 
shards per collection.  If I had 10 collections, that would make it 250 shards 
total.  Given that Solr 4 supports multi-core, my first thought was to try one 
JVM for each node: for 10 collections per node, that means that each JVM would 
contain 50 shards.

So, I set up my first collection, with a modest 20M documents, and everything 
seems to work fine.  But, now my subsequent collections that I have added are 
having issues.  The first one is that every time I query for the document count 
(*:* with rows=0), a different number of documents is returned. The number can 
differ by as much as 10%.  Now if I query each shard individually (setting 
distrib=false), the number returned is always consistent.

I am not entirely sure this is related as I may have missed a step in my setup 
of subsequent collections (bootstrapping the config)

But, more related to the architecture question: is it better to have one JVM 
per disk, one JVM per shard, or one JVM per node.  Given the MMap of the 
indexes, how does memory play into the question?   There is a blog post 
(http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html) that 
recommends minimizing the amount of JVM memory and maximizing the amount of 
OS-level file cache, but how does that impact sorting / boosting?

Sorry if I have missed some documentation: I have been through the cloud 
tutorial a couple of times, and I didn't see any discussion of these issues

Thanks,
Dave

Reply via email to