Timeout connecting to index replication master causing slave core failure (0 documents).

Russell McOrmond Thu, 25 Feb 2016 08:52:34 -0800

We are running "5.4.0 1718046 - upayavira - 2015-12-04 23:16:46" on a
series of index replication slaves of a single master.


The master is behind a VPN connection to a slower network.  There are times
when that network might have timouts, and we need our applications to be
robust against that type of temporary problem.

Yesterday we had a timeout that lead to a Solr slave having 0 documents in
its core on a production machine, which is pretty serious for us.


Some relevant logs:

root@havarti:/data/solr/solr5-var/logs# grep " ERROR " solr.log.*
solr.log.2:2016-02-24 16:52:52.832 ERROR (indexFetcher-12-thread-1) [
x:cap2015] o.a.s.h.IndexFetcher Master at:
http://localname-removed:8983/solr/cap2015 is not available. Index fetch
failed. Exception: IOException occured when talking to server at:
http://localname-removed:8983/solr/cap2015
root@havarti:/data/solr/solr5-var/logs# grep -vE "(\/select|\/replication)"
solr.log.2
2016-02-24 16:52:52.832 ERROR (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Master at: http://localname-removed:8983/solr/cap2015
is not available. Index fetch failed. Exception: IOException occured when
talking to server at: http://localname-removed:8983/solr/cap2015
2016-02-24 16:53:52.093 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2016-02-24 16:53:52.203 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onCommit: commits: num=2
    
commit{dir=NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160222200052555
lockFactory=org.apache.lucene.store.NativeFSLockFactory@4b8e51d2;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ufj,generation=39439}
    
commit{dir=NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160222200052555
lockFactory=org.apache.lucene.store.NativeFSLockFactory@4b8e51d2;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ufk,generation=39440}
2016-02-24 16:53:52.203 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.SolrDeletionPolicy newest commit generation = 39440
2016-02-24 16:53:52.215 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.s.SolrIndexSearcher Opening Searcher@4c16fd6[cap2015] main
2016-02-24 16:53:52.216 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.u.DirectUpdateHandler2 end_commit_flush
2016-02-24 16:53:52.216 INFO
 (searcherExecutor-7-thread-1-processing-x:cap2015) [   x:cap2015]
o.a.s.c.QuerySenderListener QuerySenderListener sending requests to
Searcher@4c16fd6[cap2015]
main{ExitableDirectoryReader(UninvertingDirectoryReader())}
2016-02-24 16:53:52.216 INFO
 (searcherExecutor-7-thread-1-processing-x:cap2015) [   x:cap2015]
o.a.s.c.QuerySenderListener QuerySenderListener done.
2016-02-24 16:53:52.216 INFO
 (searcherExecutor-7-thread-1-processing-x:cap2015) [   x:cap2015]
o.a.s.c.SolrCore [cap2015] Registered new searcher Searcher@4c16fd6[cap2015]
main{ExitableDirectoryReader(UninvertingDirectoryReader())}
root@havarti:/data/solr/solr5-var/logs#


Is there some setting I am missing to prevent this problem?  This doesn't
appear to be normal behaviour, and is something I want to ensure that is
prevented.


After spending over 24hours replicating the index was loaded, only to have
the index attempt to do a complete copy again a few minutes later after
getting a similar temporary timout error.


oot@havarti:/data/solr/solr5-var/logs# grep -vE
"(\/select|\/replication|Slave in sync with master)" solr.log
2016-02-25 15:47:56.562 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Total time taken for download
(fullCopy=true,bytesDownloaded=98312405389) : 80512 secs (1221090
bytes/sec) to 
NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160224172604265
lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
2016-02-25 15:47:56.667 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher New index installed. Updating index properties...
index=index.20160224172604265
2016-02-25 15:47:56.685 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher removing old index directory
NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160222200052555
lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
2016-02-25 15:47:56.685 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=cap2015
2016-02-25 15:47:56.690 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.SolrCore New index directory detected:
old=/data/solr/solr5-var/data/cap2015/data/index.20160222200052555
new=/data/solr/solr5-var/data/cap2015/data/index.20160224172604265
2016-02-25 15:47:56.775 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onInit: commits: num=1
    
commit{dir=NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160224172604265
lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ufj,generation=39439}
2016-02-25 15:47:56.775 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.SolrDeletionPolicy newest commit generation = 39439
2016-02-25 15:47:56.776 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.
2016-02-25 15:48:00.166 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.s.SolrIndexSearcher Opening Searcher@1dda02b3[cap2015] main
2016-02-25 15:48:00.173 INFO
 (searcherExecutor-7-thread-1-processing-x:cap2015) [   x:cap2015]
o.a.s.c.QuerySenderListener QuerySenderListener sending requests to
Searcher@1dda02b3[cap2015]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_4914(5.4.0):C48254424/3683:delGen=8)
Uninverting(_491i(5.4.0):c326) Uninverting(_491j(5.4.0):C921)
Uninverting(_491k(5.4.0):C1411) Uninverting(_491l(5.4.0):C734)
Uninverting(_491m(5.4.0):C454)))}
2016-02-25 15:48:00.173 INFO
 (searcherExecutor-7-thread-1-processing-x:cap2015) [   x:cap2015]
o.a.s.c.QuerySenderListener QuerySenderListener done.
2016-02-25 15:48:00.174 INFO
 (searcherExecutor-7-thread-1-processing-x:cap2015) [   x:cap2015]
o.a.s.c.SolrCore [cap2015] Registered new searcher Searcher@1dda02b3[cap2015]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_4914(5.4.0):C48254424/3683:delGen=8)
Uninverting(_491i(5.4.0):c326) Uninverting(_491j(5.4.0):C921)
Uninverting(_491k(5.4.0):C1411) Uninverting(_491l(5.4.0):C734)
Uninverting(_491m(5.4.0):C454)))}
2016-02-25 15:48:00.175 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.CachingDirectoryFactory looking to close
/data/solr/solr5-var/data/cap2015/data/index.20160222200052555
[CachedDir<<refCount=0;path=/data/solr/solr5-var/data/cap2015/data/index.20160222200052555;done=true>>]
2016-02-25 15:48:00.175 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.CachingDirectoryFactory Closing directory:
/data/solr/solr5-var/data/cap2015/data/index.20160222200052555
2016-02-25 15:48:00.175 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.CachingDirectoryFactory Removing directory before core close:
/data/solr/solr5-var/data/cap2015/data/index.20160222200052555
2016-02-25 15:55:04.130 ERROR (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Master at: http://localname-removed:8983/solr/cap2015
is not available. Index fetch failed. Exception: Timeout occured while
waiting response from server at: http://localname-removed:8983/solr/cap2015
2016-02-25 15:55:07.188 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Master's generation: 39439
2016-02-25 15:55:07.188 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Slave's generation: 39439
2016-02-25 15:55:07.188 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Starting replication process
2016-02-25 15:55:07.221 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Number of files in latest index in master: 55
2016-02-25 15:55:07.223 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.c.CachingDirectoryFactory return new directory for
/data/solr/solr5-var/data/cap2015/data/index.20160225155507222
2016-02-25 15:55:07.231 INFO  (indexFetcher-12-thread-1) [   x:cap2015]
o.a.s.h.IndexFetcher Starting download (fullCopy=true) to
NRTCachingDirectory(MMapDirectory@/data/solr/solr5-var/data/cap2015/data/index.20160225155507222
lockFactory=org.apache.lucene.store.NativeFSLockFactory@44b0c913;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
root@havarti:/data/solr/solr5-var/logs#


This again seems like odd behaviour that it will do a fullCopy=true on a
slave that had only a timeout error (shouldn't even be an error, but a
warning) and where the master and slave were on the same commit generation.

In this case restarting Solr caused the redundant replication to stop, and
for the correct core to be loaded.  I've disabled Solr replication polling
for the time being to avoid a repeat of the problem.

We only started to use index replication recently, and I'm starting to
wonder if we need to move away from Solr replication.  We have an
alternative which is to feed our updates to Solr from a CouchDB database
where the replication is reliably handled by that database.

Thoughts?

-- 
System Administration and software developer,
Canadiana.org   http://www.canadiana.ca

Timeout connecting to index replication master causing slave core failure (0 documents).

Reply via email to