Re: cores shards and disks in SolrCloud

Otis Gospodnetic Thu, 15 Nov 2012 20:16:39 -0800

Hi,

I think here you want to use a single JVM per server - no need for multiple
JVMs, JVM per Collection and such.
If you can spread data over more than 1 disk on each of your servers,
great, that will help.
Re data loss - yes, you really should just be using replication.  Sharding
a ton will minimize your data loss if you have no replication, but could
actually decrease performance.  Also, if you have lots of small shards, say
250, and only 5 servers, if 1 server dies doesn't that mean you will lose
50 shards - 20% of your data?


Otis
--
Solr Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html

On Thu, Nov 15, 2012 at 8:18 PM, Buttler, David <buttl...@llnl.gov> wrote:

> The main reason to split a collection into 25 shards is to reduce the
> impact of the loss of a disk.  I was running an older version of solr, a
> disk went down, and my entire collection was offline.  Solr 4 offers
> shards.tolerant to reduce the impact of the loss of a disk: fewer documents
> will be returned.  Obviously, I could replicate the data so that I wouldn't
> lose any documents while I replace my disk, but since I am already storing
> the original data in HDFS, (with a 3x replication), adding additional
> replication for solr eats into my disk budget a bit too much.
>
> Also, my other collections have larger amounts of data / number of
> documents. For every TB of raw data, how much disk space do I want to be
> using? As little as possible.  Drives are cheap, but not free.  And, nodes
> only hold so many drives.
>
> Dave
>
> -----Original Message-----
> From: Upayavira [mailto:u...@odoko.co.uk]
> Sent: Thursday, November 15, 2012 4:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: cores shards and disks in SolrCloud
>
> Personally I see no benefit to have more than one JVM per node, cores
> can handle it. I would say that splitting a 20m index into 25 shards
> strikes me as serious overkill, unless you expect to expand
> significantly. 20m would likely be okay with two or three shards. You
> can store the indexes for each core on different disks which can give
> ome performance benefit.
>
> Just some thoughts.
>
> Upayavira
>
>
>
> On Thu, Nov 15, 2012, at 11:04 PM, Buttler, David wrote:
> > Hi,
> > I have a question about the optimal way to distribute solr indexes across
> > a cloud.  I have a small number of collections (less than 10).  And a
> > small cluster (6 nodes), but each node has several disks - 5 of which I
> > am using for my solr indexes.  The cluster is also a hadoop cluster, so
> > the disks are not RAIDed, they are JBOD.  So, on my 5 slave nodes, each
> > with 5 disks, I was thinking of putting one shard per collection.  This
> > means I end up with 25 shards per collection.  If I had 10 collections,
> > that would make it 250 shards total.  Given that Solr 4 supports
> > multi-core, my first thought was to try one JVM for each node: for 10
> > collections per node, that means that each JVM would contain 50 shards.
> >
> > So, I set up my first collection, with a modest 20M documents, and
> > everything seems to work fine.  But, now my subsequent collections that I
> > have added are having issues.  The first one is that every time I query
> > for the document count (*:* with rows=0), a different number of documents
> > is returned. The number can differ by as much as 10%.  Now if I query
> > each shard individually (setting distrib=false), the number returned is
> > always consistent.
> >
> > I am not entirely sure this is related as I may have missed a step in my
> > setup of subsequent collections (bootstrapping the config)
> >
> > But, more related to the architecture question: is it better to have one
> > JVM per disk, one JVM per shard, or one JVM per node.  Given the MMap of
> > the indexes, how does memory play into the question?   There is a blog
> > post
> > (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> )
> > that recommends minimizing the amount of JVM memory and maximizing the
> > amount of OS-level file cache, but how does that impact sorting /
> > boosting?
> >
> > Sorry if I have missed some documentation: I have been through the cloud
> > tutorial a couple of times, and I didn't see any discussion of these
> > issues
> >
> > Thanks,
> > Dave
>

Re: cores shards and disks in SolrCloud

Reply via email to