[ https://issues.apache.org/jira/browse/SOLR-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008057#comment-17008057 ]
Erick Erickson commented on SOLR-14159: --------------------------------------- [~hossman] Many thanks for your effort here, I'm totally going out of my gourd trying to track this. The full output is in stdout that I attached yesterday. And all the debug output is a total mess, I posted it there for masochists ;). Basically, it's a scattergun approach "put all this logging in everywhere you can and see what sticks to the wall" ;). Most of it is dumping a bunch of data trying to figure out whether there was something weird with proxies, whether the server where the connection was refused was somehow NOT in live_nodes, anything I could think of that would shed light on what I was seeing, but none of the things I was looking at as possible causes were borne out. AFAICT: - The proxies are fine, at least to my untrained eye. - Recovery was successful. - The host is in live_nodes. - The node where the connection is refused is up and has updated state.json marking itself as active. - The query happens after the node recovers and posts itself as active and the host is in live_nodes. - The request that's failing is actually going to the host, not using the proxy (I think). - Just to see what would happen, I tried re-opening the HttpSolrClient upon failure, but the test still fails with "connection refused". - There a bunch of messages about how things are closing down, but these are all after the failure and part of the normal test termination. They're in close proximity in the log file though, which mislead me for a bit. - I can generate this by beasting only the single failing test in this suite. I'll apply the patch for SOLR-13486 and post that if I can get it to fail again, that'll be later today. Probably much later today as it may take several hundred runs, beasting only the failing test. I'll leave in exactly one dump of all the information in clusterstate and the local member variables just before the test that fails, delimited with "EOE START", and "EOE END", that may be useful and should be much easier to ignore. FWIW, I've seen at least two different flavors of failure apart from what is in 13486: 1> the Connection Refused problem 2> unable to create the cluster <2> is much rarer than <1> FWIW. If I can get failures for both I'll attach. > Fix errors in TestCloudConsistency > ---------------------------------- > > Key: SOLR-14159 > URL: https://issues.apache.org/jira/browse/SOLR-14159 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Erick Erickson > Assignee: Erick Erickson > Priority: Major > Attachments: SOLR-14159_debug.patch, stdout > > > Moving over here from SOLR-13486 as per Hoss. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org