1) Hmm, maybe, didn't notice that... but I'd be very confused why it works occasionally, and manual replication (through Solr Admin) always works ok in that case? 2) This was my initial thought, it was happening on one core (multiple commits while replication in progress), but I noticed it happening on another core (the one mentioned below) which only had 1 commit and a single generation (11 > 12) change to replicate.
I too hoped and presumed that the Master is being Locked while replication is copying files... can anyone confirm this? We are using the native Lock type on a Windows/Tomcat server. Is anyone aware of any reason why the replication skips files, or fails to copy/find files other than because of presumably a commit or optimize re-chunking the segments and deleting them on the Master? -----Original Message----- From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] Sent: 25 October 2011 20:48 To: solr-user@lucene.apache.org Subject: RE: Replication issues with multiple Slaves I noted that in these messages the left hand side is lower case collection, but the right hand side is upper case Collection. Assuming you did a cut/paste, could you have a core name mismatch between a master and a slave somehow? Otherwise (shudder): could you be doing a commit while the replication is in progress, causing files to shift about on it? I'd have expected (perhaps naively) solr to have some sort of lock to prevent such a problem. But if there is no internal lock, that would be a serious matter (and could happen to us, too, down the road). JRJ -----Original Message----- From: Rob Nicholls [mailto:robst...@hotmail.com] Sent: Tuesday, October 25, 2011 10:32 AM To: solr-user@lucene.apache.org Subject: Replication issues with multiple Slaves Hey guys, We have a Master (1 server) and 2 Slaves (2 servers) setup and running replication across multiple cores. However, the replication appears to behave sporadically and often fails when left to replicate automatically via poll. More often than not a replicate will fail after the slave has finished pulling down the segment files, because it cannot find a particular file, giving errors such as: Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile SEVERE: Unable to move index file from: D:\web\solr\collection\data\index.20111025100000\_3u.tii to: D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy SEVERE: Unable to copy index file from: D:\web\solr\collection\data\index.20111025100000\_3s.fdt to: D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException: D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot find the file specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(Unknown Source) at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47) at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585) at org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:2 67) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown Source) at java.util.concurrent.FutureTask.runAndReset(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$ 101(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeri odic(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unk nown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) For these files, I checked the master, and they did indeed exist. Both slave machines are configured the same, with the same replication settings and a 60 minutes poll interval. Is it perhaps because both slave machines are trying to pull down files at the same time? (and the other has a lock on the file, thus it gets skipped maybe?) Note: If I manually force replication on each slave, one at a time, the replication always seems to work fine. Is there any obvious explanation or oddities I should be aware of that may cause this? Thanks, Rob