We are running "5.4.0 1718046 - upayavira - 2015-12-04 23:16:46" on a series of index replication slaves of a single master.
The master is behind a VPN connection to a slower network. There are times when that network might have timouts, and we need our applications to be robust against that type of temporary problem. Yesterday we had a timeout that lead to a Solr slave having 0 documents in its core on a production machine, which is pretty serious for us. Some relevant logs: root@havarti:/data/solr/solr5-var/logs# grep " ERROR " solr.log.* solr.log.2:2016-02-24 16:52:52.832 ERROR (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Master at: http://localname-removed:8983/solr/cap2015 is not available. Index fetch failed. Exception: IOException occured when talking to server at: http://localname-removed:8983/solr/cap2015 root@havarti:/data/solr/solr5-var/logs# grep -vE "(\/select|\/replication)" solr.log.2 2016-02-24 16:52:52.832 ERROR (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Master at: http://localname-removed:8983/solr/cap2015 is not available. Index fetch failed. Exception: IOException occured when talking to server at: http://localname-removed:8983/solr/cap2015 2016-02-24 16:53:52.093 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.u.DirectUpdateHandler2 start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 2016-02-24 16:53:52.203 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onCommit: commits: num=2 commit{dir=NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160222200052555 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4b8e51d2; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ufj,generation=39439} commit{dir=NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160222200052555 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4b8e51d2; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ufk,generation=39440} 2016-02-24 16:53:52.203 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.SolrDeletionPolicy newest commit generation = 39440 2016-02-24 16:53:52.215 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.s.SolrIndexSearcher Opening Searcher@4c16fd6[cap2015] main 2016-02-24 16:53:52.216 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.u.DirectUpdateHandler2 end_commit_flush 2016-02-24 16:53:52.216 INFO (searcherExecutor-7-thread-1-processing-x:cap2015) [ x:cap2015] o.a.s.c.QuerySenderListener QuerySenderListener sending requests to Searcher@4c16fd6[cap2015] main{ExitableDirectoryReader(UninvertingDirectoryReader())} 2016-02-24 16:53:52.216 INFO (searcherExecutor-7-thread-1-processing-x:cap2015) [ x:cap2015] o.a.s.c.QuerySenderListener QuerySenderListener done. 2016-02-24 16:53:52.216 INFO (searcherExecutor-7-thread-1-processing-x:cap2015) [ x:cap2015] o.a.s.c.SolrCore [cap2015] Registered new searcher Searcher@4c16fd6[cap2015] main{ExitableDirectoryReader(UninvertingDirectoryReader())} root@havarti:/data/solr/solr5-var/logs# Is there some setting I am missing to prevent this problem? This doesn't appear to be normal behaviour, and is something I want to ensure that is prevented. After spending over 24hours replicating the index was loaded, only to have the index attempt to do a complete copy again a few minutes later after getting a similar temporary timout error. oot@havarti:/data/solr/solr5-var/logs# grep -vE "(\/select|\/replication|Slave in sync with master)" solr.log 2016-02-25 15:47:56.562 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=true,bytesDownloaded=98312405389) : 80512 secs (1221090 bytes/sec) to NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160224172604265 lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2016-02-25 15:47:56.667 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher New index installed. Updating index properties... index=index.20160224172604265 2016-02-25 15:47:56.685 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher removing old index directory NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160222200052555 lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2016-02-25 15:47:56.685 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=cap2015 2016-02-25 15:47:56.690 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.SolrCore New index directory detected: old=/data/solr/solr5-var/data/cap2015/data/index.20160222200052555 new=/data/solr/solr5-var/data/cap2015/data/index.20160224172604265 2016-02-25 15:47:56.775 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onInit: commits: num=1 commit{dir=NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160224172604265 lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ufj,generation=39439} 2016-02-25 15:47:56.775 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.SolrDeletionPolicy newest commit generation = 39439 2016-02-25 15:47:56.776 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. 2016-02-25 15:48:00.166 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.s.SolrIndexSearcher Opening Searcher@1dda02b3[cap2015] main 2016-02-25 15:48:00.173 INFO (searcherExecutor-7-thread-1-processing-x:cap2015) [ x:cap2015] o.a.s.c.QuerySenderListener QuerySenderListener sending requests to Searcher@1dda02b3[cap2015] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_4914(5.4.0):C48254424/3683:delGen=8) Uninverting(_491i(5.4.0):c326) Uninverting(_491j(5.4.0):C921) Uninverting(_491k(5.4.0):C1411) Uninverting(_491l(5.4.0):C734) Uninverting(_491m(5.4.0):C454)))} 2016-02-25 15:48:00.173 INFO (searcherExecutor-7-thread-1-processing-x:cap2015) [ x:cap2015] o.a.s.c.QuerySenderListener QuerySenderListener done. 2016-02-25 15:48:00.174 INFO (searcherExecutor-7-thread-1-processing-x:cap2015) [ x:cap2015] o.a.s.c.SolrCore [cap2015] Registered new searcher Searcher@1dda02b3[cap2015] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_4914(5.4.0):C48254424/3683:delGen=8) Uninverting(_491i(5.4.0):c326) Uninverting(_491j(5.4.0):C921) Uninverting(_491k(5.4.0):C1411) Uninverting(_491l(5.4.0):C734) Uninverting(_491m(5.4.0):C454)))} 2016-02-25 15:48:00.175 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.CachingDirectoryFactory looking to close /data/solr/solr5-var/data/cap2015/data/index.20160222200052555 [CachedDir<<refCount=0;path=/data/solr/solr5-var/data/cap2015/data/index.20160222200052555;done=true>>] 2016-02-25 15:48:00.175 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.CachingDirectoryFactory Closing directory: /data/solr/solr5-var/data/cap2015/data/index.20160222200052555 2016-02-25 15:48:00.175 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.CachingDirectoryFactory Removing directory before core close: /data/solr/solr5-var/data/cap2015/data/index.20160222200052555 2016-02-25 15:55:04.130 ERROR (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Master at: http://localname-removed:8983/solr/cap2015 is not available. Index fetch failed. Exception: Timeout occured while waiting response from server at: http://localname-removed:8983/solr/cap2015 2016-02-25 15:55:07.188 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Master's generation: 39439 2016-02-25 15:55:07.188 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Slave's generation: 39439 2016-02-25 15:55:07.188 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Starting replication process 2016-02-25 15:55:07.221 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Number of files in latest index in master: 55 2016-02-25 15:55:07.223 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.c.CachingDirectoryFactory return new directory for /data/solr/solr5-var/data/cap2015/data/index.20160225155507222 2016-02-25 15:55:07.231 INFO (indexFetcher-12-thread-1) [ x:cap2015] o.a.s.h.IndexFetcher Starting download (fullCopy=true) to NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160225155507222 lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913; maxCacheMB=48.0 maxMergeSizeMB=4.0) root@havarti:/data/solr/solr5-var/logs# This again seems like odd behaviour that it will do a fullCopy=true on a slave that had only a timeout error (shouldn't even be an error, but a warning) and where the master and slave were on the same commit generation. In this case restarting Solr caused the redundant replication to stop, and for the correct core to be loaded. I've disabled Solr replication polling for the time being to avoid a repeat of the problem. We only started to use index replication recently, and I'm starting to wonder if we need to move away from Solr replication. We have an alternative which is to feed our updates to Solr from a CouchDB database where the replication is reliably handled by that database. Thoughts? -- System Administration and software developer, Canadiana.org http://www.canadiana.ca