[ https://issues.apache.org/jira/browse/SOLR-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022705#comment-17022705 ]
David Smiley commented on SOLR-14192: ------------------------------------- IMO we should remove the underlying complexity and not have the schema file change on the fly. Certainly out of scope here but this bug reveals it's tricky. Also I encountered this as a limitation when doing schema sharing in SOLR-14040 > Race condition between SchemaManager and ZkIndexSchemaReader > ------------------------------------------------------------ > > Key: SOLR-14192 > URL: https://issues.apache.org/jira/browse/SOLR-14192 > Project: Solr > Issue Type: Bug > Affects Versions: 8.4 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > Fix For: 8.5 > > Attachments: SOLR-14192.patch > > > Spin-off from SOLR-14128 and SOLR-13368. > In SolrCloud when a SolrCore is created and it uses managed schema then its > {{ManagedIndexSchemaFactory}} performs an automatic upgrade of the initial > {{schema.xml}} to {{managed-schema}}. This includes removing the original > {{schema.xml}} file. > SOLR-13368 added some locking to make sure the changed resource name (i.e. > {{managed-schema}}) becomes visible only when this process is complete, and > that in-flight requests to /admin/schema block until this process is > complete, to avoid returning inconsistent data. This locking mechanism uses > simple Object monitors. > However, if there's more than 1 node in the cluster the subsequent request to > retrieve schema may execute on a core that still hasn't reloaded its schema > ({{ZkIndexSchemaReader}} uses a ZK watcher, which may take some time to > trigger), and the resource name in that stale schema still points to > {{schema.xml}}, which by this time no longer exists because it was removed by > {{ManagedIndexSchemaFactory}} in the first core. > As I see it there are two bugs here: > # there's no distributed locking when this upgrade is performed, so it's > natural that there are multiple cores racing against each other to perform > this upgrade. > # the upgrade process removes {{schema.xml}} too early - it triggers all > other cores by creating the {{managed-schema}} file, and then other cores > reload from the new managed schema - but it should wait until this reload is > complete on all cores because only then it's safe to delete the non-managed > resource as it's no longer in use by any core. > Issue 1. can be solved by adding an ephemeral znode lock so that only one > core can perform the upgrade. Issue 2. can be solved by using > {{ManagedIndexSchema.waitForSchemaZkVersionAgreement}} after upgrade, and > deleting {{schema.xml}} only after it's done. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org