Andrzej Bialecki created SOLR-14192:
---------------------------------------

             Summary: Race condition between SchemaManager and 
ZkIndexSchemaReader
                 Key: SOLR-14192
                 URL: https://issues.apache.org/jira/browse/SOLR-14192
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
    Affects Versions: 8.4
            Reporter: Andrzej Bialecki
            Assignee: Andrzej Bialecki
             Fix For: 8.5


Spin-off from SOLR-14128 and SOLR-13368.

In SolrCloud when a SolrCore is created and it uses managed schema then its 
{{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial 
{{schema.xml}} to {{managed-schema}}. This includes removing the original 
{{schema.xml}} file.

SOLR-13368 added some locking to make sure the changed resource name (i.e. 
{{managed-schema}}) becomes visible only when this process is complete, and 
that in-flight requests to /admin/schema block until this process is complete, 
to avoid returning inconsistent data. This locking mechanism uses simple Object 
monitors.

However, if there's more than 1 node in the cluster the subsequent request to 
retrieve schema may execute on a core that still hasn't reloaded its schema 
({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to 
trigger), and the resource name in that stale schema still points to 
{{schema.xml}}, which by this time no longer exists because it was removed by 
{{ManagedIndexSchemaFactory}} in the first core.

As I see it there are two bugs here:
 # there's no distributed locking when this upgrade is performed, so it's 
natural that there are multiple cores racing against each other to perform this 
upgrade.
 # the upgrade process removes {{schema.xml}} too early - it triggers all other 
cores by creating the {{managed-schema}} file, and then other cores reload from 
the new managed schema - but it should wait until this reload is complete on 
all cores because only then it's safe to delete the non-managed resource as 
it's no longer in use by any core.

Issue 1. can be solved by adding an ephemeral znode lock so that only one core 
can perform the upgrade. Issue 2. can be solved by using 
{{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and 
deleting {{schema.xml}} only after it's done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to