[jira] [Commented] (SOLR-14347) Autoscaling placement wrong when concurrent replica placements are calculated

Andrzej Bialecki (Jira) Tue, 12 May 2020 02:14:02 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105255#comment-17105255
 ]


Andrzej Bialecki commented on SOLR-14347:
-----------------------------------------

Hmm, I think you are right ... I don't know how I missed this :( I focused on 
the fact that per-collection policies caused side-effects in 
{{Session.expandedClauses}} and missed the fact that we're dropping the 
previous state of the matrix that contains changes from previous ops, which are 
not yet persisted.

At this point I'm not sure how to fix this properly - the changes we're making 
here still need to be pushed back to the original Session, so a Session.copy() 
won't work either.

> Autoscaling placement wrong when concurrent replica placements are calculated
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-14347
>                 URL: https://issues.apache.org/jira/browse/SOLR-14347
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: 8.5
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>             Fix For: 8.6
>
>         Attachments: SOLR-14347.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Steps to reproduce:
>  * create a cluster of a few nodes (tested with 7 nodes)
>  * define per-collection policies that distribute replicas exclusively on 
> different nodes per policy
>  * concurrently create a few collections, each using a different policy
>  * resulting replica placement will be seriously wrong, causing many policy 
> violations
> Running the same scenario but instead creating collections sequentially 
> results in no violations.
> I suspect this is caused by incorrect locking level for all collection 
> operations (as defined in {{CollectionParams.CollectionAction}}) that create 
> new replica placements - i.e. CREATE, ADDREPLICA, MOVEREPLICA, DELETENODE, 
> REPLACENODE, SPLITSHARD, RESTORE, REINDEXCOLLECTION. All of these operations 
> use the policy engine to create new replica placements, and as a result they 
> change the cluster state. However, currently these operations are locked (in 
> {{OverseerCollectionMessageHandler.lockTask}} ) using 
> {{LockLevel.COLLECTION}}. In practice this means that the lock is held only 
> for the particular collection that is being modified.
> A straightforward fix for this issue is to change the locking level to 
> CLUSTER (and I confirm this fixes the scenario described above). However, 
> this effectively serializes all collection operations listed above, which 
> will result in general slow-down of all collection operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-14347) Autoscaling placement wrong when concurrent replica placements are calculated

Reply via email to