[ 
https://issues.apache.org/jira/browse/SOLR-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-14159:
--------------------------------------
    Attachment: SOLR-14159_waitFor_testfix.patch
                SOLR-14159_proxy_logging.patch
        Status: Open  (was: Open)

So far I've been unable to reprodue either of the failures Erick mentioned in 
my local beasting (close to a thousand iterations).

While beasting, I've been looking over the "FailWithHossFix" which shows the 
"Connection Refused" problem. (i have not looked into the other log 
file/problem)

Nothing in the log messages gives me an immediate "Ah-HA!" moment, but I do 
want to reply directly to a few comments Erick has made, which has lead to a 
"best guess" as to what the problem may be (below)
----
Note first that the specific failure condition in this log is...
{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestCloudConsistency 
-Dtests.method=testOutOfSyncReplicasCannotBecomeLeader 
-Dtests.seed=66748AAA9CB5BE46 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=nyn-UG -Dtests.timezone=Pacific/Apia -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   50.2s | 
TestCloudConsistency.testOutOfSyncReplicasCannotBecomeLeader <<<
   [junit4]    > Throwable #1: 
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: 
http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false
...
{noformat}
... So the problem relates to port #49158 (which hosts a single replica of the 
single shard collection) ...
{quote}Recovery was successful.
{quote}
That does not appear to be true? ... when the leader is shutdown (after doc4 
gets added) there is an attempt at an election in which both (live) replicas 
(including 49158) realize they are not eligable to be the leader (due to stale 
ShardTerm) and need to wait for the leader to come back and then recover...
{noformat}
   [junit4]   2> 42358 INFO  (zkCallback-125-thread-1) [n:127.0.0.1:49158_solr 
c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node6 
x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5 ] 
o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with 
higher term participated in leader election
   [junit4]   2> 42358 INFO  (zkCallback-125-thread-1) [n:127.0.0.1:49158_solr 
c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node6 
x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5 ] 
o.a.s.c.ShardLeaderElectionContext There may be a better leader candidate than 
us - going back into recovery
{noformat}
...but AFAICT, even though '127.0.0.1:49158_solr' has a recovery thread 
(recoveryExecutor-122-thread-1-processing-n:127.0.0.1:49158_solr) that is 
running all the way to the end of the test, that replica never seems to log 
thta it has recovered.

When the leader replica does come back online, AFAICT it successfully PeerSyncs 
to the _other_ replica, but it logs that can *NOT* PeerSync to 49158 – it 
encounters a "Connection Refused" (just like the test thread does later)...
{noformat}
   [junit4]   2> 44867 INFO  (zkCallback-168-thread-2) [n:127.0.0.1:65519_solr 
c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2 
x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 ] 
o.a.s.c.SyncStrategy Sync replicas to 
http://127.0.0.1:65519/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1/
   [junit4]   2> 44877 INFO  (zkCallback-168-thread-2) [n:127.0.0.1:65519_solr 
c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2 
x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 ] 
o.a.s.u.PeerSync PeerSync: 
core=outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 
url=http://127.0.0.1:65519/solr START 
replicas=[http://127.0.0.1:65529/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3/,
 
http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5/]
 nUpdates=100
   [junit4]   2> 44880 INFO  (SocketProxy-Acceptor-65529) [     ] 
o.a.s.c.s.c.SocketProxy accepted 
Socket[addr=/127.0.0.1,port=49422,localport=65529], receiveBufferSize:408300
   [junit4]   2> 44890 INFO  (SocketProxy-Acceptor-65529) [     ] 
o.a.s.c.s.c.SocketProxy proxy connection 
Socket[addr=/127.0.0.1,port=65534,localport=49424], receiveBufferSize=408300
   [junit4]   2> 44892 WARN  (zkCallback-168-thread-2) [n:127.0.0.1:65519_solr 
c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2 
x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 ] 
o.a.s.u.PeerSync PeerSync: 
core=outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 
url=http://127.0.0.1:65519/solr  couldn't connect to 
http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5/,
 counting as success
   [junit4]   2>           => org.apache.solr.client.solrj.SolrServerException: 
Server refused connection at: 
http://127.0.0.1:49158/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n5
{noformat}
Note that while the leader is trying to PeerSync to the 2 replicas, the 
SocketProxy for 65529 records (the _other_ replica) the connection it got, but 
the proxy for 49158 doesn't log anything.

*Perhaps because the proxy for 49158 hasn't successfully restarted yet?*

_(Seems unlikely given that it's had over 10 full seconds to do so since the 
proxies were re-opened ... but then again, now that i look at the logic in 
TestCloudConsistency that "waits" to see if a replica erroneously becomes the 
leader (even though it's ShardTerm is too low), i realize it's a spin loop w/o 
any sleep or waiting ... it could be causing thread starvation?
{code:java}
    TimeOut timeOut = new TimeOut(10, TimeUnit.SECONDS, TimeSource.NANO_TIME);
    while (!timeOut.hasTimedOut()) {
      Replica newLeader = getCollectionState(collection).getLeader("shard1");
      if (newLeader != null && !newLeader.getName().equals(leader.getName()) && 
newLeader.getState() == Replica.State.ACTIVE) {
        fail("Out of sync replica became leader " + newLeader);
      }
    }
{code}
{quote}The request that's failing is actually going to the host, not using the 
proxy (I think).
{quote}
No, it's definitely going to the proxy – what might be confusing you is that 
the testclass spins up the SockProxy instances extenrally from the 
JettySolrRunner's (because MiniSolrCloud doesn't provide an easy way to do 
that) so things like {{JettySolrRunner.getProxy()}} return null – but port 
49158 is still a proxy port (and jett knows it ... note when/where 
{{JettySolrRunner.setProxyPort(int)}} is called in TestCloudConsistency)
{quote}NOTE: it has two other changes. One for an NPE in: 
SolrClientNodeStateProvider that I ran into occasionally, I think a race 
condition when shutting down hit it. And one cleanup in MiniSolrCloudCluster 
where we pass a timeout value to waitForAllNodes then ignore it and use a 
hard-coded 30 seconds.
{quote}
These both seem like very legit bugs unrelated to this specific test that 
should be tracked & fixed in their own jiras ... have you filed new jiras for 
these?
----
I'm attaching 2 independent patches:
 * {{SOLR-14159_proxy_logging.patch}}
 * {{SOLR-14159_waitFor_testfix.patch}}

[~erickerickson]: I would appreciate it if you could try the following and 
report back with logs
 # Try beasting with {{SOLR-14159_proxy_logging.patch}} applied and see if you 
can trigger a "Connection refused" failure
 ** the added logging will be very noisy, and it's possible it will shift the 
thread timing so the error no longer happens, or some other error happens
 ** if you can reproduce with this patch applied, please capture & upload the 
logs before proceeding
 ** if you can't reproduce with this patch applied, go ahead and revert it
 # Then apply {{SOLR-14159_waitFor_testfix.patch}} and see if that can still 
trigger a "Connection refused" failure
 ** If so please report back wih those logs as well (ie: if it still fails, 
then ideally it would be good to have the logs with both patches applied at the 
same time)

> Fix errors in TestCloudConsistency
> ----------------------------------
>
>                 Key: SOLR-14159
>                 URL: https://issues.apache.org/jira/browse/SOLR-14159
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: FailWaitingForCollection_WithHossFix, FailWithHossFix, 
> SOLR-14159_debug.patch, SOLR-14159_proxy_logging.patch, 
> SOLR-14159_waitFor_testfix.patch, WithHossFix.patch, stdout
>
>
> Moving over here from SOLR-13486 as per Hoss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to