Re: Bug? Documents not visible after sucessful commit - chaos testing

Chris Hostetter Wed, 05 Feb 2020 16:42:41 -0800


I may be missunderstanding something in your setup, and/or I may be 
miss-remembering things about Solr, but I think the behavior you are 
seeing is because *search* in solr is "eventually consistent" -- while 
"RTG" (ie: using the /get" handler) is (IIRC) "strongly consistent"


ie: there's a reason it's called "Near Real Time Searching" and "NRT 
Replica" ... not "RT Replica"

When you kill a node hosting a replica, then send an update which a leader 
accepts but can't send to that replica, that replica is now "out of sync" 
and will continue to be out of sync when it comes back online and starts 
responding to search requests as it recovers from the leader/tlog -- 
eventually the search will have consistent results across all replicas, 
but during the recovery period this isn't garunteed.

If however you use the /get request handler, then it (again, IIRC) 
consults the tlog for the latest version of the doc even if it's 
mid-recovery and the index itself isn't yet up to date.

So for the purposes of testing solr as a "strongly consistent" document 
store, using /get?id=foo to check the "current" data in the document is 
more appropriate then /select?q=id:foo

Some more info here...

https://lucene.apache.org/solr/guide/8_4/solrcloud-resilience.html
https://lucene.apache.org/solr/guide/8_4/realtime-get.html


A few other things that jumped out at me in your email that seemed weird 
or worthy of comment...

: Accordung to solrs documentation, a commit with openSearcher=true and
: waitSearcher=true and waitFlush=true only returns once everything is
: presisted AND the new searcher is visible.
: 
: To me this sounds like that any subsequent request after a successful
: commit MUST hit the new searcher and is guaranteed to see the commit
: changes, regardless of node failures or restarts.

that is true for *single* node solr, or a "heathy" cluster but as i 
mentioned if a node is down when the "commit" happens it won't have the 
document yet -- nor is it alive to process the commit.  the document 
update -- and the commit -- are in the tlog that still needs to replay 
when the replica comes back online

:    - A test-collection with 1 Shard and 2 NRT Replicas.

I'm guessing since you said you were using 3 nodes, that what you 
mean here is a single shard with a total of 3 replicas which are all NRT 
-- remember the "leader" is still itself an NRT  replica.  

(i know, i know ... i hate the terminology) 

This is a really important point to clarify in your testing because of how 
you are using 'rf' ... seeing exactly how you create your collection is 
important to make sure we're talking about the same thing.

: Each "transaction" adds, modifys and deletes documents and we ensure that
: each response has a "rf=2" (achieved replication factor=2) attribute.

So to be clear: 'rf=2' means a total of 2 replicas confirmed the update -- 
that includes the leader replica.  'rf=1' means the leader accepted the 
doc, but all other replicas are down.

if you wnat to me 100% certain that every replica recieved the update, 
then you should be confirming rf=3

: After a "transaction" was performed without errors we send first a
: hardCommit and then a softCommit, both with waitFlush=true,
: waitSearcher=true and ensure they both return without errors.

FYI: three is no need to send a softCommit after a hardCommit -- a hard 
commit with openSearcher=true (the default) is a super-set of a soft 
commit.



-Hoss
http://www.lucidworks.com/

Re: Bug? Documents not visible after sucessful commit - chaos testing

Reply via email to