[ https://issues.apache.org/jira/browse/SOLR-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man updated SOLR-13781: ---------------------------- Attachment: egrep-out.local.log.txt egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt local.log.txt Status: Open (was: Open) I'm attaching 4 files... * apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt ** the section of logs from the apache/Lucene-Solr-Tests-8.x/587 jenkins build that includes the FAIL of testPackageAPI * local.log.txt ** logs from my local PASS run of testPackageAPI using the same git SHA and reproduce line * egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt * egrep-out.local.log.txt ...the last 2 being the output of running the following grep command on the first 2... {noformat} egrep 'Loaded cluster properties|MemClassLoader|PackageManager RequestHandlers being reloaded|PackageManager Registering request handlers|webapp=null path=/node/ext|webapp=null path=/cluster' {noformat} ...in order to extract the lines that seem the most important to notice for comparison. The key bit i'd like to point out is this section of egrep-out.local.log.txt, which begins with the last request client request of {{/node/ext/bar}} to verify it exists, and shows the initial log messages resuting from it's subsequent deletion... {noformat} 6354 INFO (qtp822543449-83) [n:127.0.0.1:36249_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/node/ext/bar params={wt=javabin&version=2} status=0 QTime=1 6373 DEBUG (zkCallback-49-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6373 DEBUG (zkCallback-64-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6373 DEBUG (zkCallback-79-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6373 DEBUG (zkCallback-70-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6374 DEBUG (zkCallback-63-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6374 DEBUG (zkCallback-72-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6375 DEBUG (zkCallback-88-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6375 DEBUG (zkCallback-53-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6378 DEBUG (zkCallback-45-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 6380 DEBUG (zkCallback-72-thread-1) [ ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 6381 DEBUG (zkCallback-72-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [] 6381 DEBUG (zkCallback-49-thread-1) [ ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 6381 DEBUG (zkCallback-49-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [] 6386 DEBUG (zkCallback-53-thread-1) [ ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 6387 DEBUG (zkCallback-53-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [] 6388 DEBUG (zkCallback-45-thread-1) [ ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 6388 DEBUG (zkCallback-45-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [] ... {noformat} * There are 9 zkCallback threads that detect the new cluster properties indicating an empty list of requestHandler's * 4 of these zkCallback threads (corrisponding to the 4 nodes in the MiniSolrCloudCluster) invoke the PackageManager (the others are evidently unrelated cluster property watchers) * These 4 zkCallback threads all agree that no request handlers need reloaded, and no request handlers need registered. Compare that with the same section of logs from egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt ... {noformat} 3532248 INFO (qtp1010395414-41961) [n:127.0.0.1:42699_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/node/ext/bar params={wt=javabin&version=2} status=0 QTime=0 3532251 DEBUG (zkCallback-14907-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532251 DEBUG (zkCallback-14919-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532251 DEBUG (zkCallback-14941-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532251 DEBUG (zkCallback-14935-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532263 DEBUG (zkCallback-14892-thread-2) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532263 DEBUG (zkCallback-14896-thread-2) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532264 DEBUG (zkCallback-14896-thread-2) [ ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 3532264 DEBUG (zkCallback-14896-thread-2) [ ] o.a.s.c.PackageManager Registering request handlers [] 3532265 DEBUG (zkCallback-14898-thread-2) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532265 DEBUG (zkCallback-14912-thread-1) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532266 DEBUG (qtp1010395414-41962) [n:127.0.0.1:42699_solr ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 3532266 DEBUG (qtp1010395414-41962) [n:127.0.0.1:42699_solr ] o.a.s.c.PackageManager Registering request handlers [] 3532269 DEBUG (zkCallback-14927-thread-2) [ ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3 3532321 DEBUG (zkCallback-14892-thread-2) [ ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 3532321 DEBUG (zkCallback-14892-thread-2) [ ] o.a.s.c.PackageManager Registering request handlers [] 3532340 DEBUG (zkCallback-14898-thread-2) [ ] o.a.s.c.PackageManager RequestHandlers being reloaded : [] 3532340 DEBUG (zkCallback-14898-thread-2) [ ] o.a.s.c.PackageManager Registering request handlers [] 3532371 INFO (zkCallback-14898-thread-1) [ ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 3532372 INFO (zkCallback-14896-thread-1) [ ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 3532372 INFO (zkCallback-14927-thread-1) [ ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 3532372 DEBUG (zkCallback-14927-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [bar] 3532373 DEBUG (zkCallback-14896-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [bar] 3532389 DEBUG (zkCallback-14898-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [bar] 3532390 INFO (zkCallback-14892-thread-1) [ ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 3532390 DEBUG (zkCallback-14892-thread-1) [ ] o.a.s.c.PackageManager Registering request handlers [bar] {noformat} * Again, we see 9 zkCallback threads that detect the new cluster properties indicating an empty list of requestHandler's * Again, 4 of these zkCallback threads invoke the PackageManager * Again, these 4 zkCallback threads all agree that no request handlers need reloaded, and no request handlers need registered. ... BUT! ... * the naming of these 4 zkCallback threads indicate they are the *2nd* thread created from their respective threadpools * subsequent to them finishing, a similarly named *1st* thread from each of those 4 zkCallback pools calls MemClassLoader to Defining_class & PackageManager to register the 'bar' request handler * if we search backwards in the log for these threads, it seems that these are still left over (and still processing) the cluster property changed events that occured as a result of the {{update-package}} command ... they didn't finish their "Defining_class" & "Registering request handlers" execution by the time the test executed the {{delete-requesthandler}} (FWIW: i'm also a little perplexed why the logs in the failure case show the jetty request handling thread {{qtp1010395414-41962}} currently processing the {{/cluster}} request *also* calling PackageManager directly in addition to whichever zkCallback thread corrisponds to that node) > TestContainerReqHandler.testPackageAPI failures imply race condition between > update-package and delete-requesthandler > --------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-13781 > URL: https://issues.apache.org/jira/browse/SOLR-13781 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Hoss Man > Priority: Major > Attachments: apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt, > egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt, > egrep-out.local.log.txt, local.log.txt > > > We're seeing roughly an 8% failure rate from > {{TestContainerReqHandler.testPackageAPI}} with failures occuring on both > master and branch_8x, and on various jenkins servers and various OSes. > All of the failures occur at the same place: A V2 request to {{/node/ext}} to > verify that that the {{requestHandler}} List is empty after issuing a > {{delete-requesthandler: 'bar'}} payload to the {{/cluster}} API. The logs > and failure message indicate that the {{'bar'}} request handler still exists > even the assertion does a "sleep/retry" of the verification query 10 times. > While i don't fully understand this test, or the underlying code being > tested, i spent a little time digging into the logs from some of these > jenkins failures, and comparing them to the logs i see generated when i get a > successful test run locally, and I think what's happening here - and the > reason that {{delete-requesthandler}} seems to "fail" frequently in this test > method, but not in {{testSetClusterReqHandler}} - is because the prior > {{update-package}} command is still in process. > After the test code runs an {{update-package}} command, the test executes > requests against {{/node/ext/bar}} to verify that the {{version}} has changed > as a result of updating the package, but i suspect this is only looking at > the _metadata_ that has changed as a result of the {{update-package}} command > and not actaully ensuring that the request handler has fully loaded - because > the logs when this test fails seem to show that the zkCallback threads kicked > off by {{update-package}} command are still running when the zkCallback > threads kicked off by the subsequent {{delete-requesthandler}} command are > running, and finish *after* them, "re-registering" the handler that was just > deleted. > ---- > It's not 100% clear to me if this is _just_ a test bug - and it should be > monitoring something else to know when the request handler's a finished > loading - or if this indicates a broader flaw in the design of how commands > like {{add-package}} / {{update-package}} / {{add-requesthandler}} / > {{delete-requesthandler}} should interact if/when they occur in close > temporal proximity. > (ie: if there are zkCallback watchers loading classes and initializing > objects as a result of cluster property changes, shouldn't there be some sort > of lineraization/synchronization logic to ensure that they get executed in > the same order on all the nodes in the cluster?) > ---- > More detail and log file attachments to follow... -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org