bq: I thought SolrCloud replicas were replication, and you imply parallel indexing
Absolutely! You couldn't get near-real-time indexing if you relied on replication a-la 3x. And you also couldn't guarantee consistency. Say you have 1 shard, a leader and a follower (i.e. 2 replicas). Now you throw a doc to be indexed. The sequence is: leader gets the doc leader forwards the doc to the follower leader and follower both add the doc to their local index (and tlog). follower acks back to leader leader acks back to client. So yes, the raw document is forwarded to all replicas before the leader responds to the client, the docs all get written to the tlogs, etc. That's the only way to guarantee that if the leader goes down, the follower can take over without losing documents. Best, Erick On Sun, Jan 25, 2015 at 6:15 PM, Dan Davis <dansm...@gmail.com> wrote: > @Erick, > > Problem space is not constant indexing. I thought SolrCloud replicas were > replication, and you imply parallel indexing. Good to know. > > On Sunday, January 25, 2015, Erick Erickson <erickerick...@gmail.com> > wrote: > > > @Shawn: Cool table, thanks! > > > > @Dan: > > Just to throw a different spin on it, if you migrate to SolrCloud, then > > this question becomes moot as the raw documents are sent to each of the > > replicas so you very rarely have to copy the full index. Kind of a > tradeoff > > between constant load because you're sending the raw documents around > > whenever you index and peak usage when the index replicates. > > > > There are a bunch of other reasons to go to SolrCloud, but you know your > > problem space best. > > > > FWIW, > > Erick > > > > On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey <apa...@elyograg.org > > <javascript:;>> wrote: > > > > > On 1/24/2015 10:56 PM, Dan Davis wrote: > > > > When I polled the various projects already using Solr at my > > > organization, I > > > > was greatly surprised that none of them were using Solr replication, > > > > because they had talked about "replicating" the data. > > > > > > > > But we are not Pinterest, and do not expect to be taking in changes > one > > > > post at a time (at least the engineers don't - just wait until its > used > > > for > > > > a Crud app that wants full-text search on a description field!). > > > Still, > > > > rsync can be very, very fast with the right options (-W for gigabit > > > > ethernet, and maybe -S for sparse files). I've clocked it at 48 > MB/s > > > over > > > > GigE previously. > > > > > > > > Does anyone have any numbers for how fast Solr replication goes, and > > what > > > > to do to tune it? > > > > > > > > I'm not enthusiastic to give-up recently tested cluster stability > for a > > > > home grown mess, but I am interested in numbers that are out there. > > > > > > Numbers are included on the Solr replication wiki page, both in graph > > > and numeric form. Gathering these numbers must have been pretty easy > -- > > > before the HTTP replication made it into Solr, Solr used to contain an > > > rsync-based implementation. > > > > > > http://wiki.apache.org/solr/SolrReplication#Performance_numbers > > > > > > Other data on that wiki page discusses the replication config. There's > > > not a lot to tune. > > > > > > I run a redundant non-SolrCloud index myself through a different method > > > -- my indexing program indexes each index copy completely > independently. > > > There is no replication. This separation allows me to upgrade any > > > component, or change any part of solrconfig or schema, on either copy > of > > > the index without affecting the other copy at all. With replication, > if > > > something is changed on the master or the slave, you might find that > the > > > slave no longer works, because it will be handling an index created by > > > different software or a different config. > > > > > > Thanks, > > > Shawn > > > > > > > > >