I may be missunderstanding something in your setup, and/or I may be miss-remembering things about Solr, but I think the behavior you are seeing is because *search* in solr is "eventually consistent" -- while "RTG" (ie: using the /get" handler) is (IIRC) "strongly consistent"
ie: there's a reason it's called "Near Real Time Searching" and "NRT Replica" ... not "RT Replica" When you kill a node hosting a replica, then send an update which a leader accepts but can't send to that replica, that replica is now "out of sync" and will continue to be out of sync when it comes back online and starts responding to search requests as it recovers from the leader/tlog -- eventually the search will have consistent results across all replicas, but during the recovery period this isn't garunteed. If however you use the /get request handler, then it (again, IIRC) consults the tlog for the latest version of the doc even if it's mid-recovery and the index itself isn't yet up to date. So for the purposes of testing solr as a "strongly consistent" document store, using /get?id=foo to check the "current" data in the document is more appropriate then /select?q=id:foo Some more info here... https://lucene.apache.org/solr/guide/8_4/solrcloud-resilience.html https://lucene.apache.org/solr/guide/8_4/realtime-get.html A few other things that jumped out at me in your email that seemed weird or worthy of comment... : Accordung to solrs documentation, a commit with openSearcher=true and : waitSearcher=true and waitFlush=true only returns once everything is : presisted AND the new searcher is visible. : : To me this sounds like that any subsequent request after a successful : commit MUST hit the new searcher and is guaranteed to see the commit : changes, regardless of node failures or restarts. that is true for *single* node solr, or a "heathy" cluster but as i mentioned if a node is down when the "commit" happens it won't have the document yet -- nor is it alive to process the commit. the document update -- and the commit -- are in the tlog that still needs to replay when the replica comes back online : - A test-collection with 1 Shard and 2 NRT Replicas. I'm guessing since you said you were using 3 nodes, that what you mean here is a single shard with a total of 3 replicas which are all NRT -- remember the "leader" is still itself an NRT replica. (i know, i know ... i hate the terminology) This is a really important point to clarify in your testing because of how you are using 'rf' ... seeing exactly how you create your collection is important to make sure we're talking about the same thing. : Each "transaction" adds, modifys and deletes documents and we ensure that : each response has a "rf=2" (achieved replication factor=2) attribute. So to be clear: 'rf=2' means a total of 2 replicas confirmed the update -- that includes the leader replica. 'rf=1' means the leader accepted the doc, but all other replicas are down. if you wnat to me 100% certain that every replica recieved the update, then you should be confirming rf=3 : After a "transaction" was performed without errors we send first a : hardCommit and then a softCommit, both with waitFlush=true, : waitSearcher=true and ensure they both return without errors. FYI: three is no need to send a softCommit after a hardCommit -- a hard commit with openSearcher=true (the default) is a super-set of a soft commit. -Hoss http://www.lucidworks.com/
