[ https://issues.apache.org/jira/browse/SOLR-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222947#comment-17222947 ]
Erick Erickson commented on SOLR-14969: --------------------------------------- First, thanks for raising this. While we might be able to get around this in the CoreAdmin STATUS command (I haven't looked), the fact that the Solr needs to be reloaded is scary enough that the root cause should be addressed. This'll be somewhat tricky to fix. Much of the complexity here is because of "transient" and "lazy" cores. On a quick look at the change, I don't see anything obvious that would have changed the behavior of the create code, but perhaps it exposed an underlying issue. *background* There are two use-cases: 1> An installation has, say, 1,000 cores but only 100 of them are in use at any given time. Given enough disk space, hardware costs can be reduced by a factor of 10 if they can have cores load and unload dynamically on an LRU basis. This is the "transient" case. 2> An installation has 1,000 cores that can all be loaded at once, but startup time is prohibitive if Solr waits for all 1,000 cores to be loaded. The "lazy" case is if they can afford to take the hit for loading when the first request is made to a core. So at startup, all the CoreDescriptors are read through "core discovery", but the cores may or may not be loaded. Which means that there's lots of synchronization code and a bewildering variety of lists (pendingCoreOps, currentlyLoadingCores, similar lists in the TransientSolrCoreCache). currentlyLoadingCores is tempting, but it's for async core loading. All the above is to emphasize that this code is gnarly for some non-obvious reasons, tread with care ;). *On to this problem* CoreContainer.create checks at the top for pre-existing cores and throws an error if it finds one, which if fine and good for cores that exist already 'cause all the CoreDescriptors are read at CoreContainer initialization. However, when creating a new core, the new core descriptor is invisible to any other thread that comes in here and does the above check until sometime during core creation, which is where this problem arises. The check at the top won't "see" the core being created until it's added to some other list in SolrCores eventually. I suppose we could add in something like the following: Add a new member variable. {code:java} List<String> inFlightCores = new ArrayList<String>(); {code} then wrap the entirety of CoreContainer.create in a try/finally block {code:java} try { synchronized (inFlightCores) { if (inFlightCores.contains(newCoreName) throw new SolrException); inFlightCores.add(newCoreName); } rest of CoreContainer.create code } finally { synchronized (inFlightCores) { inFlightCores.remove(newCoreName) } } {code} A significant amount of the complexity there is a result of lessons learned when a client made extensive use of transient cores that were created and destroyed, but obviously not two at once with the same name. I'm not excited about adding an ad-hoc fix like above, but I can say with certainty that a more intrusive change would be difficult to get right. Pending a more extensive revisiting of all the core admin operations, maybe something like this would be the best way to go... I'd intended just to write this up, but now that I've thought about it I'll see if I can work up a fix like above. Could you test a fix if I come up with one? I should be able to write a test though... > Race condition when creating cores leads to NPE in CoreAdmin STATUS > ------------------------------------------------------------------- > > Key: SOLR-14969 > URL: https://issues.apache.org/jira/browse/SOLR-14969 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: multicore > Affects Versions: 8.6, 8.6.3 > Reporter: Andreas Hubold > Priority: Major > > CoreContainer#create does not correctly handle concurrent requests to create > the same core. There's a race condition (see also existing TODO comment in > the code), and CoreContainer#createFromDescriptor may be called subsequently > for the same core name. > The _second call_ then fails to create an IndexWriter, and exception handling > causes an inconsistent CoreContainer state. > {noformat} > 2020-10-27 00:29:25.350 ERROR (qtp2029754983-24) [ ] > o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error > CREATEing SolrCore 'blueprint_acgqqafsogyc_comments': Unable to create core > [blueprint_acgqqafsogyc_comments] Caused by: Lock held by this virtual > machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1312) > at > org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:95) > at > org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367) > ... > Caused by: org.apache.solr.common.SolrException: Unable to create core > [blueprint_acgqqafsogyc_comments] > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1408) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1273) > ... 47 more > Caused by: org.apache.solr.common.SolrException: Error opening new searcher > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1071) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:906) > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1387) > ... 48 more > Caused by: org.apache.solr.common.SolrException: Error opening new searcher > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2184) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2308) > at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1130) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1012) > ... 50 more > Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by > this virtual machine: > /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock > at > org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139) > at > org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) > at > org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) > at > org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) > at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:785) > at > org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:126) > at > org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100) > at > org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:261) > at > org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:135) > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2145) > {noformat} > CoreContainer#createFromDescriptor removes the CoreDescriptor when handling > this exception. The SolrCore created for the first successful call is still > registered in SolrCores.cores, but now there's no corresponding > CoreDescriptor for that name anymore. > This inconsistency leads to subsequent NullPointerExceptions, for example > when using CoreAdmin STATUS with the core name: > CoreAdminOperation#getCoreStatus first gets the non-null SolrCore > (cores.getCore(cname)) but core.getInstancePath() throws an NPE, because the > CoreDescriptor is not registered anymore: > {noformat} > 2020-10-27 00:29:25.353 INFO (qtp2029754983-19) [ ] o.a.s.s.HttpSolrCall > [admin] webapp=null path=/admin/cores > params={core=blueprint_acgqqafsogyc_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} > status=500 QTime=0 > 2020-10-27 00:29:25.353 ERROR (qtp2029754983-19) [ ] o.a.s.s.HttpSolrCall > null:org.apache.solr.common.SolrException: Error handling 'STATUS' action > at > org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:372) > at > org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397) > at > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181) > ... > Caused by: java.lang.NullPointerException > at org.apache.solr.core.SolrCore.getInstancePath(SolrCore.java:333) > at > org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:329) > at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:54) > at > org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367) > {noformat} > STATUS keeps failing until Solr is restarted. > The NPE for CoreAdmin STATUS is a regression in 8.6. It seems to be caused by > https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org