I'm getting intermittent issues with replication in my current arrangement: one master, 3 slaves; all the same SOLR version/war file deployment.
I update the master, which kicks off replication across the other three; however, they never seem to "finish". In the data/ folders I get an empty index.timestamp folder, the admin page for replication shows it "stuck" pulling a file (no progress shown, just constant refresh). The index never changes over (always claiming to be out-of-date). Abort messages are ignored, both from the admin console and through a curl "abortfetch" request to the slaves. The searcher is still responsive, I can query, but the master's changes are of course not there. If I kill my container (tomcat 6) and start it back up, magically the replication has "finished" and the slave is up-to-date. This sort of leads me to believe something isn't finalizing the change over and opening a new searcher on the index (looks like the "main" index is actually being updated, but a close/open, or reopen, is not happening?) The relevant thread dump when in that state seems to be: snapPuller-7-thread-1 (12) java.util.concurrent.FutureTask$Sync@55ccfb48 sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt (AbstractQueuedSynchronizer.java:811) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly (AbstractQueuedSynchronizer.java:969) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly (AbstractQueuedSynchronizer.java:1281) java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) java.util.concurrent.FutureTask.get(FutureTask.java:83) org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:655) org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:466) org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:281) org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101 (ScheduledThreadPoolExecutor.java:98) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic (ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:204) java.util.concurrent.ThreadPoolExecutor$Worker.runTask (ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:662) Tomcat containers are all the same; each of the slaves is running entirely alone on its own container, separate machines. SOLR reported versions: Versions solr-spec 4.2.0.2013.03.01.10.10.50 solr-impl 4.2-SNAPSHOT 1451604 - ensorn - 2013-03-01 10:10:50 lucene-spec 4.2-SNAPSHOT lucene-impl 4.2-SNAPSHOT 1451604 - ensorn - 2013-03-01 10:02:53 Any help would be appreciated. This is getting very frustrating. To make things worse, I have set up a new "slave" on my work PC (Mac), and it replicates FLAWLESSLY on the same set up; only difference is the slaves on the servers are on a SAN array (not sure if locking is causing the heartburn?) Any pointers would be great. This is obviously becoming a pain to work with, even with fairly infrequent replications. Thanks! Neal Ensor nen...@gmail.com