Mike Drob created SOLR-15093: -------------------------------- Summary: Heavy lock contention during collection creation Key: SOLR-15093 URL: https://issues.apache.org/jira/browse/SOLR-15093 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Reporter: Mike Drob
I was doing some lock analysis and found that we have quite a bit of contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection creation. I ran a sample workload creating as many collections as I could in 10 minutes, and this method was blocked for about 1:30 of that, which is a pretty significant portion. A few representative stack traces: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} And another: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.common.cloud.ZkStateReader.getCollection(String) org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, boolean, boolean) org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} And one more: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, DocCollectionWatcher) org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, TimeUnit, Predicate) org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor) org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} It looks like part of the problem is that we never allow ourselves to use the cache so each one happens to be a full fetch out to ZK. We have the optimizations there to compare the stat and the version, but it's still relatively heavyweight it appears. cc: [~noble.paul], you might find this interesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org