Re: SolrCloud replication question

avenka Mon, 09 Jul 2012 10:23:14 -0700

Erick, thanks. I now do see segment files in an index.<timestamp> directory at 
the replicas. Not sure why they were not getting populated earlier.

I have a couple more questions, the second is more elaborate - let me know if I 
should move it to a separate thread.

(1) The speed of adding documents in SolrCloud is excruciatingly slow. It takes 
about 30-50 seconds to add a batch of 100 documents (and about twice that to 
add 200, etc.) to the primary but just ~10 seconds to add 5K documents in 
batches of 200 on a standalone solr 4 server. The log files indicate that the 
primary is timing out with messages like below and Cloud->Graph in the UI shows 
the other two replicas in orange after starting green.
 org.apache.solr.client.solrj.SolrServerException: Timeout occured while 
waiting response from server at: http://localhost:7574/solr

Any idea why?

(3) I am seriously considering using symbolic links for a replicated solr setup 
with completely independent instances on a *single machine*. Tell me if I am 
thinking about this incorrectly. Here is my reasoning: 

(a) Master/slave replication in 3.6 simply seems old school as it doesn't have 
the nice consistency properties of SolrCloud. Polling say every 20 seconds 
means I don't know exactly how up-to-speed each replica is, which will 
complicate my request re-distribution.

(b) SolrCloud seems like a great alternative to master/slave replication. But 
it seems slow (see 1) and having played with it, I don't feel comfortable with 
the maturity of ZK integration (or my comprehension of it) in solr 4 alpha. 

(c) Symbolic links seem like the fastest and most space-efficient solution 
*provided* there is only a single writer, which is just fine for me. I plan to 
run completely separate solr instances with one designated as the primary and 
do the following operations in sequence: Add a batch to the primary and commit 
--> From each replica's index directory, remove all symlinks and re-create 
symlinks to segment files in the primary (but not the write.lock file) --> Call 
update?commit=true to force replicas to re-load their in-memory index --> Do 
whatever read-only processing is required on the batch using the primary and 
all replicas by manually (randomly) distributing read requests --> Repeat 
sequence.

Is there any downside to 3(c) (other than maintaining a trivial script to 
manage symlinks and call commit)? I tested it on small index sizes and it seems 
to work fine. The throughput improves with more replicas (for 2-4 replicas) as 
a single replica is not enough to saturate the machine (due to high query 
latency). Am I overlooking something in this setup?

Overall, I need high throughput and minimal latency from the time a document is 
added to the time it is available at a replica. SolrCloud's automated request 
redirection, consistency, and fault-tolerance is awesome for a physically 
distributed setup, but I don't see how it beats 3(c) in a single-writer, 
single-machine, replicated setup.

AV

On Jul 9, 2012, at 9:43 AM, Erick Erickson [via Lucene] wrote:

> No, you're misunderstanding the setup. Each replica has a complete 
> index. Updates get automatically forwarded to _both_ nodes for a 
> particular shard. So, when a doc comes in to be indexed, it gets 
> sent to the leader for, say, shard1. From there: 
> 1> it gets indexed on the leader 
> 2> it gets forwarded to the replica(s) where it gets indexed locally. 
> 
> Each replica has a complete index (for that shard). 
> 
> There is no master/slave setup any more. And you do 
> _not_ have to configure replication. 
> 
> Best 
> Erick 
> 
> On Sun, Jul 8, 2012 at 1:03 PM, avenka <[hidden email]> wrote:
> 
> > I am trying to wrap my head around replication in SolrCloud. I tried the 
> > setup at http://wiki.apache.org/solr/SolrCloud/. I mainly need replication 
> > for high query throughput. The setup at the URL above appears to maintain 
> > just one copy of the index at the primary node (instead of a replicated 
> > index as in a master/slave configuration). Will I still get roughly an 
> > n-fold increase in query throughput with n replicas? And if so, why would 
> > one do master/slave replication with multiple copies of the index at all? 
> > 
> > -- 
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761.html
> > Sent from the Solr - User mailing list archive at Nabble.com. 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3993889.html
> To unsubscribe from SolrCloud replication question, click here.
> NAML

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3993960.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud replication question

Reply via email to