Andrzej Bialecki created SOLR-14192: ---------------------------------------
Summary: Race condition between SchemaManager and ZkIndexSchemaReader Key: SOLR-14192 URL: https://issues.apache.org/jira/browse/SOLR-14192 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 8.4 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 8.5 Spin-off from SOLR-14128 and SOLR-13368. In SolrCloud when a SolrCore is created and it uses managed schema then its {{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial {{schema.xml}} to {{managed-schema}}. This includes removing the original {{schema.xml}} file. SOLR-13368 added some locking to make sure the changed resource name (i.e. {{managed-schema}}) becomes visible only when this process is complete, and that in-flight requests to /admin/schema block until this process is complete, to avoid returning inconsistent data. This locking mechanism uses simple Object monitors. However, if there's more than 1 node in the cluster the subsequent request to retrieve schema may execute on a core that still hasn't reloaded its schema ({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to trigger), and the resource name in that stale schema still points to {{schema.xml}}, which by this time no longer exists because it was removed by {{ManagedIndexSchemaFactory}} in the first core. As I see it there are two bugs here: # there's no distributed locking when this upgrade is performed, so it's natural that there are multiple cores racing against each other to perform this upgrade. # the upgrade process removes {{schema.xml}} too early - it triggers all other cores by creating the {{managed-schema}} file, and then other cores reload from the new managed schema - but it should wait until this reload is complete on all cores because only then it's safe to delete the non-managed resource as it's no longer in use by any core. Issue 1. can be solved by adding an ephemeral znode lock so that only one core can perform the upgrade. Issue 2. can be solved by using {{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and deleting {{schema.xml}} only after it's done. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org