Andrzej Bialecki created SOLR-14192:
---------------------------------------
Summary: Race condition between SchemaManager and
ZkIndexSchemaReader
Key: SOLR-14192
URL: https://issues.apache.org/jira/browse/SOLR-14192
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 8.4
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Fix For: 8.5
Spin-off from SOLR-14128 and SOLR-13368.
In SolrCloud when a SolrCore is created and it uses managed schema then its
{{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial
{{schema.xml}} to {{managed-schema}}. This includes removing the original
{{schema.xml}} file.
SOLR-13368 added some locking to make sure the changed resource name (i.e.
{{managed-schema}}) becomes visible only when this process is complete, and
that in-flight requests to /admin/schema block until this process is complete,
to avoid returning inconsistent data. This locking mechanism uses simple Object
monitors.
However, if there's more than 1 node in the cluster the subsequent request to
retrieve schema may execute on a core that still hasn't reloaded its schema
({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to
trigger), and the resource name in that stale schema still points to
{{schema.xml}}, which by this time no longer exists because it was removed by
{{ManagedIndexSchemaFactory}} in the first core.
As I see it there are two bugs here:
# there's no distributed locking when this upgrade is performed, so it's
natural that there are multiple cores racing against each other to perform this
upgrade.
# the upgrade process removes {{schema.xml}} too early - it triggers all other
cores by creating the {{managed-schema}} file, and then other cores reload from
the new managed schema - but it should wait until this reload is complete on
all cores because only then it's safe to delete the non-managed resource as
it's no longer in use by any core.
Issue 1. can be solved by adding an ephemeral znode lock so that only one core
can perform the upgrade. Issue 2. can be solved by using
{{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and
deleting {{schema.xml}} only after it's done.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]