[ https://issues.apache.org/jira/browse/SOLR-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-13811: -------------------------------------- Attachment: hoss_local_failure_after_refactoring.log.txt apache_Lucene-Solr-NightlyTests-8.x_221.log.txt Status: Open (was: Open) As noted by gitbot, I've committed some refactoring to help clean this up and isolate the problematic test logic. ---- I'm attaching two files: * {{apache_Lucene-Solr-NightlyTests-8.x_221.log.txt}} - showing and example of how the problem has manifested in jenkins builds _prior_ to the refactoring I've just committed. * {{hoss_local_failure_after_refactoring.log.txt}} - showing how the newly refactored {{testRapidStopStartStopWithPropChange()}} can fail demonstrating the same problem in isolation. Note that {{testRapidStopStartStopWithPropChange()}} does not fail deterministically – the behavior is dependent on the timing of when exactly {{NodeLostTrigger}} fires _after_ the node is restarted, but before it is stopped again. Perhaps there is a way to "pause" the triggers to increase the odds of this happening? ... not sure. (It also seems to fail much more often in the Hdfs version of the test ... i'm not sure if that's because the MOVEREPLICA logic works faster/slower then in the non hdfs situation? ... i actaully haven't been able to trigger the failure w/the refactoring in place) [~ab] : can you please take a look at this and chime in with wether you think the current code in {{testRapidStopStartStopWithPropChange()}} is something that should pass reliably given the way the code is designed to work? ... if so please update the jira summary/description to make it clear what the underlying bug is, if not we should go ahead and: delete this test method, reclassify this issue as a "Test" task, and resolve as "DONE". > possible autoAddReplicas bug and/or (Hdfs)AutoAddReplicasIntegrationTest > refactoring / fixes > -------------------------------------------------------------------------------------------- > > Key: SOLR-13811 > URL: https://issues.apache.org/jira/browse/SOLR-13811 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Priority: Major > Attachments: apache_Lucene-Solr-NightlyTests-8.x_221.log.txt, > hoss_local_failure_after_refactoring.log.txt > > > I've noticed a pattern of failure behavior in jenkins runs of > {{AutoAddReplicasIntegrationTest}} (which mostly manifests in the subclass > {{HdfsAutoAddReplicasIntegrationTest}}, probably due to timing) which > indicates either: > # the test is too contrived, and expects {{autoAddReplicas}} to kick in in a > situation where the current impl of {{NodeLostTrigger}} isn't smart enough to > handle > # {{NodeLostTrigger}} _should_ be smart enough to handle this, but isn't. > The test failure is currently somewhat finicky to reproduce, and depends on a > node being stoped, restarted, and stopped again – while an affected > collection is changed from {{autoAddReplicas=false}} to > {{autoAddReplicas=true}} before the second "stop" > Regardless of which of the 2 above is true: the test itself is somewhat > convoluted. It creates a sequence of events (some randomized, some static) > and asserting specific outcomes after each – but the timing of scheduled > triggers like {{NodeLostTrigger}} , and the interplay of things like "pick a > random node to shutdown" with a subsequent "explicitly shut down node2" (even > if it was the node randomly shut down earlier) is confusing. > I'm creating this issue to track two tightly dependent objectives: > # refactoring this test to: > ** better isolate the specific things it's trying to test in individual test > methods. > ** have a singular test method that triggers the specific sequence of events > that is currently problematic (ideally in such a way that it reliably fails). > # AwaitsFix this new test method until someone with a better understand of > the {{autoAddReplicas}} / {{NodeLostTrigger}} code can assess if the test is > faulty or the code being tested is faulty. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org