[ 
https://issues.apache.org/jira/browse/SOLR-13943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-13943:
--------------------------------------
    Attachment: apache_Lucene-Solr-repro-Java11_618.log.txt
                apache_Lucene-Solr-BadApples-Tests-master_533.log.txt
                apache_Lucene-Solr-BadApples-Tests-master_531.log.txt
        Status: Open  (was: Open)


The root cause of the problem apperas to be that the test assumes it can set a 
"watcher" on ZK to monitor for changes to {{/aliases.json}} and then 
{{await()}} on a latch that will be updated when that watcher fires.  Once the 
{{await()}} returns, it tries to parse the {{ROUTER_START}} property of the 
(updated) alias.

The problem with this approach, is that when ZK updates result in notifying 
watchers, there is no garuntee which order the watchers are called in or how 
quickly they will be called.  The watcher registered (and {{await()}}ed) by the 
test thread can fire before the {{AliasesManager}} is updated in the 
{{ZkStateReader}} used by the {{ZkClientClusterStateProvider}} -- _which is 
what the test consults when asserting the value of the {{ROUTER_START}} 
property_.

The test either needs to ignore the {{clusterStateProvider}} and use the data 
provided to it's own watcher to veirfying the property was updated as expected, 
*OR* it needs to "hook in" to the {{ZkStateReader}} / {{AliasesManager}} and 
only proceed once they aware of the latest {{aliases.json}} information.

----

FWIW: I notice that the {{AliasesManager}} is public on {{ZkStateReader}} and 
has a a public {{update()}} method that forces a sync with ZK.

While it's almost certianly not a best practice to do force sync's with ZK in 
production code, doing so here *using the ZkStateReader of the underlying 
{{ClusterStateProvider}}* after the {{aliasUpdate.await()}} may be suitable for 
test purposes?

I should also point out: if monitoring/wating on {{/alias.json}} updates is a 
common occurance (even if just for tests), there should probably be public APIs 
for doing so similar to the {{DocCollectionWatcher}}, 
{{CollectionPropsWatcher}}, and {{LiveNodesWatcher}} APIs


> TimeRoutedAliasUpdateProcessorTest.testDateMathInStart: multi-threaded race 
> condition due to ZK assumptions
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13943
>                 URL: https://issues.apache.org/jira/browse/SOLR-13943
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: apache_Lucene-Solr-BadApples-Tests-master_531.log.txt, 
> apache_Lucene-Solr-BadApples-Tests-master_533.log.txt, 
> apache_Lucene-Solr-repro-Java11_618.log.txt
>
>
> TimeRoutedAliasUpdateProcessorTest does not currently run in many jenkins 
> builds due to being marked BadApple(SOLR-13059) -- however when it does run, 
> the method {{testDateMathInStart}} frequently fails due to what appears to be 
> a multi-threaded race condition in the test logic...
> {noformat}
>    [junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TimeRoutedAliasUpdateProcessorTest 
> -Dtests.method=testDateMathInStart -Dtests.seed=8879E35521A4B9EA 
> -Dtests.multiplier=2 -Dtests.
> slow=true -Dtests.badapples=true -Dtests.locale=nl-BQ 
> -Dtests.timezone=America/Porto_Acre -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>    [junit4] FAILURE 6.96s J0 | 
> TimeRoutedAliasUpdateProcessorTest.testDateMathInStart <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: router.start should 
> not have any date math by this point and parse as an instant. Using class 
> org.apache.solr.client.solrj.impl.ZkCl
> ientClusterStateProvider Found:2019-09-14T03:00:00Z/DAY
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([8879E35521A4B9EA:64FE3DD88112B802]:0)
>    [junit4]    >        at 
> org.apache.solr.update.processor.TimeRoutedAliasUpdateProcessorTest.testDateMathInStart(TimeRoutedAliasUpdateProcessorTest.java:765)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>    [junit4]    >        at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    [junit4]    >        at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
>    [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> I'll attach some logs from recent failures and my own quick analysis of the 
> problems of how the test appears to be asserting ZK updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to