[ 
https://issues.apache.org/jira/browse/SOLR-14524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124330#comment-17124330
 ] 

Erick Erickson commented on SOLR-14524:
---------------------------------------

[~mdrob] Do you want to commit this or shall I? FWIW, I could not get this test 
to fail in 1,000 iterations on my machine (without the patch), so we'll have to 
check it in and see if it stops failing on the various Jenkins machines.

> Harden MultiThreadedOCPTest
> ---------------------------
>
>                 Key: SOLR-14524
>                 URL: https://issues.apache.org/jira/browse/SOLR-14524
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (9.0)
>            Reporter: Ilan Ginzburg
>            Priority: Minor
>              Labels: test
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> {{MultiThreadedOCPTest.test()}} fails occasionally in Jenkins because of 
> timing of tasks enqueue to the Collection API queue.
> This test in {{testFillWorkQueue()}} enqueues a large number of tasks (115, 
> more than the 100 Collection API parallel executors) to the Collection API 
> queue for a collection COLL_A, then observes a short delay and enqueues a 
> task for another collection COLL_B.
>  It verifies that the COLL_B task (that does not require the same lock as the 
> COLL_A tasks) completes before the third COLL_A task.
> Test failures happen because when enqueues are slowed down enough, the first 
> 3 tasks on COLL_A complete even before the COLL_B task gets enqueued!
> In one sample failed Jenkins test execution, the COLL_B task enqueue happened 
> 1275ms after the enqueue of the first COLL_A, leaving plenty of time for a 
> few (and possibly all) COLL_A tasks to complete.
> Fix will be along the lines of:
>  * Make the “blocking” COLL_A task longer to execute (currently 1 second) to 
> compensate for slow enqueues.
>  * Verify the COLL_B task (a 1ms task) finishes before the long running 
> COLL_A task does. This would be a good indication that even though the 
> collection queue was filled with tasks waiting for a busy lock, a non 
> competing task was picked and executed right away.
>  * Delay the enqueue of the COLL_B task to the end of processing of the first 
> COLL_A task. This would guarantee that COLL_B is enqueued once at least some 
> COLL_A tasks started processing at the Overseer. Possibly also verify that 
> the long running task of COLL_A didn't finish execution yet when the COLL_B 
> task is enqueued...
>  * It might be possible to set a (very) long duration for the slow task of 
> COLL_A (to be less vulnerable to execution delays) without requiring the test 
> to wait for that task to complete, but only wait for the COLL_B task to 
> complete (so the test doesn't run for too long).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to