Hi, I think here you want to use a single JVM per server - no need for multiple JVMs, JVM per Collection and such. If you can spread data over more than 1 disk on each of your servers, great, that will help. Re data loss - yes, you really should just be using replication. Sharding a ton will minimize your data loss if you have no replication, but could actually decrease performance. Also, if you have lots of small shards, say 250, and only 5 servers, if 1 server dies doesn't that mean you will lose 50 shards - 20% of your data?
Otis -- Solr Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html On Thu, Nov 15, 2012 at 8:18 PM, Buttler, David <buttl...@llnl.gov> wrote: > The main reason to split a collection into 25 shards is to reduce the > impact of the loss of a disk. I was running an older version of solr, a > disk went down, and my entire collection was offline. Solr 4 offers > shards.tolerant to reduce the impact of the loss of a disk: fewer documents > will be returned. Obviously, I could replicate the data so that I wouldn't > lose any documents while I replace my disk, but since I am already storing > the original data in HDFS, (with a 3x replication), adding additional > replication for solr eats into my disk budget a bit too much. > > Also, my other collections have larger amounts of data / number of > documents. For every TB of raw data, how much disk space do I want to be > using? As little as possible. Drives are cheap, but not free. And, nodes > only hold so many drives. > > Dave > > -----Original Message----- > From: Upayavira [mailto:u...@odoko.co.uk] > Sent: Thursday, November 15, 2012 4:37 PM > To: solr-user@lucene.apache.org > Subject: Re: cores shards and disks in SolrCloud > > Personally I see no benefit to have more than one JVM per node, cores > can handle it. I would say that splitting a 20m index into 25 shards > strikes me as serious overkill, unless you expect to expand > significantly. 20m would likely be okay with two or three shards. You > can store the indexes for each core on different disks which can give > ome performance benefit. > > Just some thoughts. > > Upayavira > > > > On Thu, Nov 15, 2012, at 11:04 PM, Buttler, David wrote: > > Hi, > > I have a question about the optimal way to distribute solr indexes across > > a cloud. I have a small number of collections (less than 10). And a > > small cluster (6 nodes), but each node has several disks - 5 of which I > > am using for my solr indexes. The cluster is also a hadoop cluster, so > > the disks are not RAIDed, they are JBOD. So, on my 5 slave nodes, each > > with 5 disks, I was thinking of putting one shard per collection. This > > means I end up with 25 shards per collection. If I had 10 collections, > > that would make it 250 shards total. Given that Solr 4 supports > > multi-core, my first thought was to try one JVM for each node: for 10 > > collections per node, that means that each JVM would contain 50 shards. > > > > So, I set up my first collection, with a modest 20M documents, and > > everything seems to work fine. But, now my subsequent collections that I > > have added are having issues. The first one is that every time I query > > for the document count (*:* with rows=0), a different number of documents > > is returned. The number can differ by as much as 10%. Now if I query > > each shard individually (setting distrib=false), the number returned is > > always consistent. > > > > I am not entirely sure this is related as I may have missed a step in my > > setup of subsequent collections (bootstrapping the config) > > > > But, more related to the architecture question: is it better to have one > > JVM per disk, one JVM per shard, or one JVM per node. Given the MMap of > > the indexes, how does memory play into the question? There is a blog > > post > > (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > ) > > that recommends minimizing the amount of JVM memory and maximizing the > > amount of OS-level file cache, but how does that impact sorting / > > boosting? > > > > Sorry if I have missed some documentation: I have been through the cloud > > tutorial a couple of times, and I didn't see any discussion of these > > issues > > > > Thanks, > > Dave >