[ 
https://issues.apache.org/jira/browse/SOLR-14524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127696#comment-17127696
 ] 

Ilan Ginzburg commented on SOLR-14524:
--------------------------------------

I am actively looking at this (week-end permitting). Here are my notes so far, 
to be considered WIP.

 

Looking at the logs from the test 
[https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/26896/consoleFull] 
(run 26913 is similar)

The long running task is added at timestamp 281488 (log "{{MOCK task added 
2}}").

The first task (id 0, fast one) got added 6ms earlier at 281482 and we see it 
getting processed at 281498. Total time from enqueue to finished processing is 
16ms.

Interestingly, the long running task 2 gets executed BEFORE task 1 get executed 
even though task 1 was enqueued before it. Task 2 completed execution at 
291532, which is 10 seconds and 44ms after it got enqueued. This makes sense: 
its processing duration was set to 10 seconds and task 0 shows that dequeuing 
by the {{OverseerCollectionMessageHandler}} is quick.

+The surprising bit now+: at 291603 (i.e. after task 2 completed), task 1 
completes. Logs from {{OverseerCollectionMessageHandler}} show it got processed 
by the same thread that first processed task 2.

The test is failing for a simple reason: it waits for task 1 to complete before 
proceeding, but then verifies that task 2 hasn't completed yet to make sure 
it's possible to test how task 2 is scheduled compared to another task then 
enqueued for another collection.

The real question therefore is: why does {{OverseerCollectionMessageHandler}} 
process (or appears to process) messages on the same collection out of order? 
This is not expected.

> Harden MultiThreadedOCPTest
> ---------------------------
>
>                 Key: SOLR-14524
>                 URL: https://issues.apache.org/jira/browse/SOLR-14524
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (9.0)
>            Reporter: Ilan Ginzburg
>            Assignee: Mike Drob
>            Priority: Minor
>              Labels: test
>             Fix For: master (9.0)
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> {{MultiThreadedOCPTest.test()}} fails occasionally in Jenkins because of 
> timing of tasks enqueue to the Collection API queue.
> This test in {{testFillWorkQueue()}} enqueues a large number of tasks (115, 
> more than the 100 Collection API parallel executors) to the Collection API 
> queue for a collection COLL_A, then observes a short delay and enqueues a 
> task for another collection COLL_B.
>  It verifies that the COLL_B task (that does not require the same lock as the 
> COLL_A tasks) completes before the third COLL_A task.
> Test failures happen because when enqueues are slowed down enough, the first 
> 3 tasks on COLL_A complete even before the COLL_B task gets enqueued!
> In one sample failed Jenkins test execution, the COLL_B task enqueue happened 
> 1275ms after the enqueue of the first COLL_A, leaving plenty of time for a 
> few (and possibly all) COLL_A tasks to complete.
> Fix will be along the lines of:
>  * Make the “blocking” COLL_A task longer to execute (currently 1 second) to 
> compensate for slow enqueues.
>  * Verify the COLL_B task (a 1ms task) finishes before the long running 
> COLL_A task does. This would be a good indication that even though the 
> collection queue was filled with tasks waiting for a busy lock, a non 
> competing task was picked and executed right away.
>  * Delay the enqueue of the COLL_B task to the end of processing of the first 
> COLL_A task. This would guarantee that COLL_B is enqueued once at least some 
> COLL_A tasks started processing at the Overseer. Possibly also verify that 
> the long running task of COLL_A didn't finish execution yet when the COLL_B 
> task is enqueued...
>  * It might be possible to set a (very) long duration for the slow task of 
> COLL_A (to be less vulnerable to execution delays) without requiring the test 
> to wait for that task to complete, but only wait for the COLL_B task to 
> complete (so the test doesn't run for too long).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to