[ https://issues.apache.org/jira/browse/SOLR-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017466#comment-17017466 ]
Andrzej Bialecki commented on SOLR-14192: ----------------------------------------- This patch seems to fix it for me, at least I wasn't able to reproduce this anymore. Summary of changes: * use an ephemeral ZK lock when upgrading the schema to managed. * be more lenient when retrieving the schema - if local core claims to be still using {{schema.xml}} but it cannot be found in ZK then try to retrieve the backup left over after upgrade, ie. {{schema.xml.bak}}, and if that doesn't exist either then simply use the current in-memory schema. > Race condition between SchemaManager and ZkIndexSchemaReader > ------------------------------------------------------------ > > Key: SOLR-14192 > URL: https://issues.apache.org/jira/browse/SOLR-14192 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 8.4 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > Fix For: 8.5 > > Attachments: SOLR-14192.patch > > > Spin-off from SOLR-14128 and SOLR-13368. > In SolrCloud when a SolrCore is created and it uses managed schema then its > {{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial > {{schema.xml}} to {{managed-schema}}. This includes removing the original > {{schema.xml}} file. > SOLR-13368 added some locking to make sure the changed resource name (i.e. > {{managed-schema}}) becomes visible only when this process is complete, and > that in-flight requests to /admin/schema block until this process is > complete, to avoid returning inconsistent data. This locking mechanism uses > simple Object monitors. > However, if there's more than 1 node in the cluster the subsequent request to > retrieve schema may execute on a core that still hasn't reloaded its schema > ({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to > trigger), and the resource name in that stale schema still points to > {{schema.xml}}, which by this time no longer exists because it was removed by > {{ManagedIndexSchemaFactory}} in the first core. > As I see it there are two bugs here: > # there's no distributed locking when this upgrade is performed, so it's > natural that there are multiple cores racing against each other to perform > this upgrade. > # the upgrade process removes {{schema.xml}} too early - it triggers all > other cores by creating the {{managed-schema}} file, and then other cores > reload from the new managed schema - but it should wait until this reload is > complete on all cores because only then it's safe to delete the non-managed > resource as it's no longer in use by any core. > Issue 1. can be solved by adding an ephemeral znode lock so that only one > core can perform the upgrade. Issue 2. can be solved by using > {{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and > deleting {{schema.xml}} only after it's done. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org