[ 
https://issues.apache.org/jira/browse/SOLR-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017466#comment-17017466
 ] 

Andrzej Bialecki commented on SOLR-14192:
-----------------------------------------

This patch seems to fix it for me, at least I wasn't able to reproduce this 
anymore.

Summary of changes:
 * use an ephemeral ZK lock when upgrading the schema to managed.
 * be more lenient when retrieving the schema - if local core claims to be 
still using {{schema.xml}} but it cannot be found in ZK then try to retrieve 
the backup left over after upgrade, ie. {{schema.xml.bak}}, and if that doesn't 
exist either then simply use the current in-memory schema.

> Race condition between SchemaManager and ZkIndexSchemaReader
> ------------------------------------------------------------
>
>                 Key: SOLR-14192
>                 URL: https://issues.apache.org/jira/browse/SOLR-14192
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.4
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>             Fix For: 8.5
>
>         Attachments: SOLR-14192.patch
>
>
> Spin-off from SOLR-14128 and SOLR-13368.
> In SolrCloud when a SolrCore is created and it uses managed schema then its 
> {{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial 
> {{schema.xml}} to {{managed-schema}}. This includes removing the original 
> {{schema.xml}} file.
> SOLR-13368 added some locking to make sure the changed resource name (i.e. 
> {{managed-schema}}) becomes visible only when this process is complete, and 
> that in-flight requests to /admin/schema block until this process is 
> complete, to avoid returning inconsistent data. This locking mechanism uses 
> simple Object monitors.
> However, if there's more than 1 node in the cluster the subsequent request to 
> retrieve schema may execute on a core that still hasn't reloaded its schema 
> ({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to 
> trigger), and the resource name in that stale schema still points to 
> {{schema.xml}}, which by this time no longer exists because it was removed by 
> {{ManagedIndexSchemaFactory}} in the first core.
> As I see it there are two bugs here:
>  # there's no distributed locking when this upgrade is performed, so it's 
> natural that there are multiple cores racing against each other to perform 
> this upgrade.
>  # the upgrade process removes {{schema.xml}} too early - it triggers all 
> other cores by creating the {{managed-schema}} file, and then other cores 
> reload from the new managed schema - but it should wait until this reload is 
> complete on all cores because only then it's safe to delete the non-managed 
> resource as it's no longer in use by any core.
> Issue 1. can be solved by adding an ephemeral znode lock so that only one 
> core can perform the upgrade. Issue 2. can be solved by using 
> {{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and 
> deleting {{schema.xml}} only after it's done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to