[ 
https://issues.apache.org/jira/browse/SOLR-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008057#comment-17008057
 ] 

Erick Erickson commented on SOLR-14159:
---------------------------------------

[~hossman] Many thanks for your effort here, I'm totally going out of my gourd 
trying to track this.

The full output is in stdout that I attached yesterday. And all the debug 
output is a total mess, I posted it there for masochists ;).

Basically, it's a scattergun approach "put all this logging in everywhere you 
can and see what sticks to the wall" ;). Most of it is dumping a bunch of data 
trying to figure out whether there was something weird with proxies, whether 
the server where the connection was refused was somehow NOT in live_nodes, 
anything I could think of that would shed light on what I was seeing, but none 
of the things I was looking at as possible causes were borne out.

AFAICT:
- The proxies are fine, at least to my untrained eye.
- Recovery was successful.
- The host is in live_nodes. 
- The node where the connection is refused is up and has updated state.json 
marking itself as active. 
- The query happens after the node recovers and posts itself as active and the 
host is in live_nodes.
- The request that's failing is actually going to the host, not using the proxy 
(I think).
- Just to see what would happen, I tried re-opening the HttpSolrClient upon 
failure, but the test still fails with "connection refused".
- There a bunch of messages about how things are closing down, but these are 
all after the failure and part of the normal test termination. They're in close 
proximity in the log file though, which mislead me for a bit.
- I can generate this by beasting only the single failing test in this suite.

I'll apply the patch for SOLR-13486 and post that if I can get it to fail 
again, that'll be later today. Probably much later today as it may take several 
hundred runs, beasting only the failing test. I'll leave in exactly one dump of 
all the information in clusterstate and the local member variables just before 
the test that fails, delimited with "EOE START", and "EOE END", that may be 
useful and should be much easier to ignore.

FWIW, I've seen at least two different flavors of failure apart from what is in 
13486:
1> the Connection Refused problem
2> unable to create the cluster

<2> is much rarer than <1> FWIW. If I can get failures for both I'll attach.

> Fix errors in TestCloudConsistency
> ----------------------------------
>
>                 Key: SOLR-14159
>                 URL: https://issues.apache.org/jira/browse/SOLR-14159
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: SOLR-14159_debug.patch, stdout
>
>
> Moving over here from SOLR-13486 as per Hoss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to