Hm, are you sure this is not a network/switch/disk/something like that problem?
Also, precisely because you have such a large index I'd avoid optimizing the 
index and then replicating it.  My wild guess is that simply rsyncing this much 
data over the network kills your machines.  Have you tried manually doing the 
rsync and watching the machine/switches/NICs/disks to see what's going on?  
That's what I'd do.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Kyle Lau <k...@biz360.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 22, 2009 7:54:53 PM
> Subject: solr machine freeze up during first replication after optimization
> 
> Hi all,
> 
> We recently started running into this solr slave server freeze up problem.
> After looking into the logs and the timing of such occurrences, it seems
> that the problem always follows the first replication after an
> optimization.  Once the server freezes up, we are unable to ssh into it, but
> ping still returns fine.  The only way to recover is by rebooting the
> machine.
> 
> In our replication setup, the masters are optimized nightly because we have
> a fairly large index (~60GB per master) and are adding millions of documents
> everyday.  After the optimization, a snapshot happens automatically.  When
> replication kicks in, the corresonding slave server will retrieve the
> snapshot using rsync.
> 
> Here is the snappuller.log capturing one of the failed pull and one
> successful pull before and after it:
> 
> 2009/05/21 22:55:01 started by biz360
> 2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ...
> 2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402
> 2009/05/21 22:55:11 ended (elapsed time: 10 sec)
> 
> ##### optimization completes sometime during this gap, and a new snapshot is
> created
> 
> 2009/05/21 23:55:01 started by biz360
> 2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ...
> 2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922
> 
> ##### slave freezes up, and machine has to be rebooted
> 
> 2009/05/22 01:55:02 started by biz360
> 2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ...
> 2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528
> 2009/05/22 02:56:12 ended (elapsed time: 3670 sec)
> 
> 
> A more detailed debug log shows snappuller simply stopped at some point:
> 
> started by biz360
> command: /mnt/solr/bin/snappuller ...
> pulling snapshot snapshot.20090521233922
> receiving file list ... done
> deleting segments_16a
> deleting _cwu.tis
> deleting _cwu.tii
> deleting _cwu.prx
> deleting _cwu.nrm
> deleting _cwu.frq
> deleting _cwu.fnm
> deleting _cwt.tis
> deleting _cwt.tii
> deleting _cwt.prx
> deleting _cwt.nrm
> deleting _cwt.frq
> deleting _cwt.fnm
> deleting _cws.tis
> deleting _cws.tii
> deleting _cws.prx
> deleting _cws.nrm
> deleting _cws.frq
> deleting _cws.fnm
> deleting _cwr_1.del
> deleting _cwr.tis
> deleting _cwr.tii
> deleting _cwr.prx
> deleting _cwr.nrm
> deleting _cwr.frq
> deleting _cwr.fnm
> deleting _cwq.tis
> deleting _cwq.tii
> deleting _cwq.prx
> deleting _cwq.nrm
> deleting _cwq.frq
> deleting _cwq.fnm
> deleting _cwq.fdx
> deleting _cwq.fdt
> deleting _cwp.tis
> deleting _cwp.tii
> deleting _cwp.prx
> deleting _cwp.nrm
> deleting _cwp.frq
> deleting _cwq.fnm
> deleting _cwq.fdx
> deleting _cwq.fdt
> deleting _cwp.tis
> deleting _cwp.tii
> deleting _cwp.prx
> deleting _cwp.nrm
> deleting _cwp.frq
> deleting _cwp.fnm
> deleting _cwp.fdx
> deleting _cwp.fdt
> deleting _cwo_1.del
> deleting _cwo.tis
> deleting _cwo.tii
> deleting _cwo.prx
> deleting _cwo.nrm
> deleting _cwo.frq
> deleting _cwo.fnm
> deleting _cwo.fdx
> deleting _cwo.fdt
> deleting _cwe_1.del
> deleting _cwe.tis
> deleting _cwe.tii
> deleting _cwe.prx
> deleting _cwe.nrm
> deleting _cwe.frq
> deleting _cwe.fnm
> deleting _cwe.fdx
> deleting _cwe.fdt
> deleting _cw2_3.del
> deleting _cw2.tis
> deleting _cw2.tii
> deleting _cw2.prx
> deleting _cw2.nrm
> deleting _cw2.frq
> deleting _cw2.fnm
> deleting _cw2.fdx
> deleting _cw2.fdt
> deleting _cvs_4.del
> deleting _cvs.tis
> deleting _cvs.tii
> deleting _cvs.prx
> deleting _cvs.nrm
> deleting _cvs.frq
> deleting _cvs.fnm
> deleting _cvs.fdx
> deleting _cvs.fdt
> deleting _csp_h.del
> deleting _csp.tis
> deleting _csp.tii
> deleting _csp.prx
> deleting _csp.nrm
> deleting _csp.frq
> deleting _csp.fnm
> deleting _csp.fdx
> deleting _csp.fdt
> deleting _cpn_q.del
> deleting _cpn.tis
> deleting _cpn.tii
> deleting _cpn.prx
> deleting _cpn.nrm
> deleting _cpn.frq
> deleting _cpn.fnm
> deleting _cpn.fdx
> deleting _cpn.fdt
> deleting _cmk_x.del
> deleting _cmk.tis
> deleting _cmk.tii
> deleting _cmk.prx
> deleting _cmk.nrm
> deleting _cmk.frq
> deleting _cmk.fnm
> deleting _cmk.fdx
> deleting _cmk.fdt
> deleting _cjg_14.del
> deleting _cjg.tis
> deleting _cjg.tii
> deleting _cjg.prx
> deleting _cjg.nrm
> deleting _cjg.frq
> deleting _cjg.fnm
> deleting _cjg.fdx
> deleting _cjg.fdt
> deleting _cge_19.del
> deleting _cge.tis
> deleting _cge.tii
> deleting _cge.prx
> deleting _cge.nrm
> deleting _cge.frq
> deleting _cge.fnm
> deleting _cge.fdx
> deleting _cge.fdt
> deleting _cd9_1m.del
> deleting _cd9.tis
> deleting _cd9.tii
> deleting _cd9.prx
> deleting _cd9.nrm
> deleting _cd9.frq
> deleting _cd9.fnm
> deleting _cd9.fdx
> deleting _cd9.fdt
> ./
> _cww.fdt
> 
> We have random Solr slaves failing in the exact same manner almost daily.
> Any help is appreciated!

Reply via email to