[ https://issues.apache.org/jira/browse/SOLR-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-14159: -------------------------------------- Attachment: SOLR-14159_waitFor_testfix.patch SOLR-14159_proxy_logging.patch Status: Open (was: Open) So far I've been unable to reprodue either of the failures Erick mentioned in my local beasting (close to a thousand iterations). While beasting, I've been looking over the "FailWithHossFix" which shows the "Connection Refused" problem. (i have not looked into the other log file/problem) Nothing in the log messages gives me an immediate "Ah-HA!" moment, but I do want to reply directly to a few comments Erick has made, which has lead to a "best guess" as to what the problem may be (below) ---- Note first that the specific failure condition in this log is... {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestCloudConsistency -Dtests.method=testOutOfSyncReplicasCannotBecomeLeader -Dtests.seed=66748AAA9CB5BE46 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=nyn-UG -Dtests.timezone=Pacific/Apia -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 50.2s | TestCloudConsistency.testOutOfSyncReplicasCannotBecomeLeader <<< [junit4] > Throwable #1: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false ... {noformat} ... So the problem relates to port #49158 (which hosts a single replica of the single shard collection) ... {quote}Recovery was successful. {quote} That does not appear to be true? ... when the leader is shutdown (after doc4 gets added) there is an attempt at an election in which both (live) replicas (including 49158) realize they are not eligable to be the leader (due to stale ShardTerm) and need to wait for the leader to come back and then recover... {noformat} [junit4] 2> 42358 INFO (zkCallback-125-thread-1) [n:127.0.0.1:49158_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node6 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5 ] o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with higher term participated in leader election [junit4] 2> 42358 INFO (zkCallback-125-thread-1) [n:127.0.0.1:49158_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node6 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5 ] o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than us - going back into recovery {noformat} ...but AFAICT, even though '127.0.0.1:49158_solr' has a recovery thread (recoveryExecutor-122-thread-1-processing-n:127.0.0.1:49158_solr) that is running all the way to the end of the test, that replica never seems to log thta it has recovered. When the leader replica does come back online, AFAICT it successfully PeerSyncs to the _other_ replica, but it logs that can *NOT* PeerSync to 49158 – it encounters a "Connection Refused" (just like the test thread does later)... {noformat} [junit4] 2> 44867 INFO (zkCallback-168-thread-2) [n:127.0.0.1:65519_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 ] o.a.s.c.SyncStrategy Sync replicas to http://127.0.0.1:65519/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1/ [junit4] 2> 44877 INFO (zkCallback-168-thread-2) [n:127.0.0.1:65519_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 ] o.a.s.u.PeerSync PeerSync: core=outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 url=http://127.0.0.1:65519/solr START replicas=[http://127.0.0.1:65529/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3/, http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5/] nUpdates=100 [junit4] 2> 44880 INFO (SocketProxy-Acceptor-65529) [ ] o.a.s.c.s.c.SocketProxy accepted Socket[addr=/127.0.0.1,port=49422,localport=65529], receiveBufferSize:408300 [junit4] 2> 44890 INFO (SocketProxy-Acceptor-65529) [ ] o.a.s.c.s.c.SocketProxy proxy connection Socket[addr=/127.0.0.1,port=65534,localport=49424], receiveBufferSize=408300 [junit4] 2> 44892 WARN (zkCallback-168-thread-2) [n:127.0.0.1:65519_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 ] o.a.s.u.PeerSync PeerSync: core=outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 url=http://127.0.0.1:65519/solr couldn't connect to http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5/, counting as success [junit4] 2> => org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5 {noformat} Note that while the leader is trying to PeerSync to the 2 replicas, the SocketProxy for 65529 records (the _other_ replica) the connection it got, but the proxy for 49158 doesn't log anything. *Perhaps because the proxy for 49158 hasn't successfully restarted yet?* _(Seems unlikely given that it's had over 10 full seconds to do so since the proxies were re-opened ... but then again, now that i look at the logic in TestCloudConsistency that "waits" to see if a replica erroneously becomes the leader (even though it's ShardTerm is too low), i realize it's a spin loop w/o any sleep or waiting ... it could be causing thread starvation? {code:java} TimeOut timeOut = new TimeOut(10, TimeUnit.SECONDS, TimeSource.NANO_TIME); while (!timeOut.hasTimedOut()) { Replica newLeader = getCollectionState(collection).getLeader("shard1"); if (newLeader != null && !newLeader.getName().equals(leader.getName()) && newLeader.getState() == Replica.State.ACTIVE) { fail("Out of sync replica became leader " + newLeader); } } {code} {quote}The request that's failing is actually going to the host, not using the proxy (I think). {quote} No, it's definitely going to the proxy – what might be confusing you is that the testclass spins up the SockProxy instances extenrally from the JettySolrRunner's (because MiniSolrCloud doesn't provide an easy way to do that) so things like {{JettySolrRunner.getProxy()}} return null – but port 49158 is still a proxy port (and jett knows it ... note when/where {{JettySolrRunner.setProxyPort(int)}} is called in TestCloudConsistency) {quote}NOTE: it has two other changes. One for an NPE in: SolrClientNodeStateProvider that I ran into occasionally, I think a race condition when shutting down hit it. And one cleanup in MiniSolrCloudCluster where we pass a timeout value to waitForAllNodes then ignore it and use a hard-coded 30 seconds. {quote} These both seem like very legit bugs unrelated to this specific test that should be tracked & fixed in their own jiras ... have you filed new jiras for these? ---- I'm attaching 2 independent patches: * {{SOLR-14159_proxy_logging.patch}} * {{SOLR-14159_waitFor_testfix.patch}} [~erickerickson]: I would appreciate it if you could try the following and report back with logs # Try beasting with {{SOLR-14159_proxy_logging.patch}} applied and see if you can trigger a "Connection refused" failure ** the added logging will be very noisy, and it's possible it will shift the thread timing so the error no longer happens, or some other error happens ** if you can reproduce with this patch applied, please capture & upload the logs before proceeding ** if you can't reproduce with this patch applied, go ahead and revert it # Then apply {{SOLR-14159_waitFor_testfix.patch}} and see if that can still trigger a "Connection refused" failure ** If so please report back wih those logs as well (ie: if it still fails, then ideally it would be good to have the logs with both patches applied at the same time) > Fix errors in TestCloudConsistency > ---------------------------------- > > Key: SOLR-14159 > URL: https://issues.apache.org/jira/browse/SOLR-14159 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Erick Erickson > Assignee: Erick Erickson > Priority: Major > Attachments: FailWaitingForCollection_WithHossFix, FailWithHossFix, > SOLR-14159_debug.patch, SOLR-14159_proxy_logging.patch, > SOLR-14159_waitFor_testfix.patch, WithHossFix.patch, stdout > > > Moving over here from SOLR-13486 as per Hoss. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org