Hello, Reloading and restarting doesn't seem to help here. Just occasionally the replicas decide to finally replicate some files, the next few commits are just ignored.
I did finally found some errors. On the leader: 2019-08-23 01:11:10.989 ERROR (qtp367746789-4669) [c:nutch s:shard1 r:core_node40 x:collection_shard1_replica_t39] o.a.s.h.ReplicationHandler Unable to get file names for indexCommit generation: 1205 => java.nio.file.NoSuchFileException: /..../data/collection_shard1_replica_t39/data/index/_27m_8b.liv at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) java.nio.file.NoSuchFileException: /app/data/nutch_shard1_replica_t39/data/index/_27m_8b.liv at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:1.8.0_222] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_222] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_222] at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:1.8.0_222] at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144) ~[?:1.8.0_222] at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) ~[?:1.8.0_222] On the slave: No files to download for index generation: 1205 So it seems obvious now. The replica won't replicate because of the error on the leader. Is this a known error? New Jira? Regards, Markus -----Original message----- > From:Ere Maijala <ere.maij...@helsinki.fi> > Sent: Friday 23rd August 2019 11:24 > To: solr-user@lucene.apache.org > Subject: Re: 8.2.0 After changing replica types, state.json is wrong and > replication no longer takes place > > Hi, > > We've had PULL replicas stop replicating a couple of times in Solr 7.x. > Restarting Solr has got it going again. No errors in logs, and I've been > unable to reproduce the issue at will. At least once it happened when I > reloaded a collection, but other times that hasn't caused any issues. > > I'll make a note to check state.json next time we encounter the > situation to see if I can see what you reported. > > Regards, > Ere > > Markus Jelsma kirjoitti 22.8.2019 klo 16.36: > > Hello, > > > > There is a newly created 8.2.0 all NRT type cluster for which i replaced > > each NRT replica with a TLOG type replica. Now, the replicas no longer > > replicate when the leader receives data. The situation is odd, because some > > shard replicas kept replicating up until eight hours ago, another one (same > > collection, same node) seven hours, and even another one four hours! > > > > I inspected state.json to see what might be wrong, and compare it with > > another fully working, but much older, 8.2.0 all TLOG collection. > > > > The faulty one still lists, probably from when it was created: > > "nrtReplicas":"2", > > "tlogReplicas":"0" > > "pullReplicas":"0", > > "replicationFactor":"2", > > > > The working collection only has: > > "replicationFactor":"1", > > > > What actually could cause this new collection to start replicating when i > > delete the data directory, but later on stop replicating at some random > > time, which is different for each shard. > > > > Is there something i should change in state.json, and can it just be > > reuploaded to ZK? > > > > Thanks, > > Markus > > > > -- > Ere Maijala > Kansalliskirjasto / The National Library of Finland >