[ https://issues.apache.org/jira/browse/SOLR-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105510#comment-17105510 ]
Ilan Ginzburg commented on SOLR-14347: -------------------------------------- I didn't look in detail at how Sessions are used, but can we use a copy of the cached Session rather than a new one built from ZK and once all computation is done, copy over the new cluster state from the computed Session back to the original one, then return the original one? If you think it makes sense I can look more in detail and submit a PR. > Autoscaling placement wrong when concurrent replica placements are calculated > ----------------------------------------------------------------------------- > > Key: SOLR-14347 > URL: https://issues.apache.org/jira/browse/SOLR-14347 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling > Affects Versions: 8.5 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Critical > Fix For: 8.6 > > Attachments: SOLR-14347.patch > > Time Spent: 20m > Remaining Estimate: 0h > > Steps to reproduce: > * create a cluster of a few nodes (tested with 7 nodes) > * define per-collection policies that distribute replicas exclusively on > different nodes per policy > * concurrently create a few collections, each using a different policy > * resulting replica placement will be seriously wrong, causing many policy > violations > Running the same scenario but instead creating collections sequentially > results in no violations. > I suspect this is caused by incorrect locking level for all collection > operations (as defined in {{CollectionParams.CollectionAction}}) that create > new replica placements - i.e. CREATE, ADDREPLICA, MOVEREPLICA, DELETENODE, > REPLACENODE, SPLITSHARD, RESTORE, REINDEXCOLLECTION. All of these operations > use the policy engine to create new replica placements, and as a result they > change the cluster state. However, currently these operations are locked (in > {{OverseerCollectionMessageHandler.lockTask}} ) using > {{LockLevel.COLLECTION}}. In practice this means that the lock is held only > for the particular collection that is being modified. > A straightforward fix for this issue is to change the locking level to > CLUSTER (and I confirm this fixes the scenario described above). However, > this effectively serializes all collection operations listed above, which > will result in general slow-down of all collection operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org