Here is my close analysis:
SolrClient request goes to the below method "request " in the class
LBHttpSolrClient.java
There is a for loop to try different live servers , but when doRequest method
(in the request method below) sends exception there is no catch , so next
re-try is not done. To solve this issue , there should be catch around
doRequest and then the second time it will re-try the correct request. But in
case there are multiple live servers, the request might timeout also. This
needs to be fixed to make CDCR bootstrap work reliable. If not sometimes it
will work good and sometimes not. I can work on this patch if this is agreed.
public Rsp request(Req req) throws SolrServerException, IOException {
Rsp rsp = new Rsp();
Exception ex = null;
boolean isNonRetryable = req.request instanceof IsUpdateRequest ||
ADMIN_PATHS.contains(req.request.getPath());
List<ServerWrapper> skipped = null;
final Integer numServersToTry = req.getNumServersToTry();
int numServersTried = 0;
boolean timeAllowedExceeded = false;
long timeAllowedNano = getTimeAllowedInNanos(req.getRequest());
long timeOutTime = System.nanoTime() + timeAllowedNano;
for (String serverStr : req.getServers()) {
if (timeAllowedExceeded = isTimeExceeded(timeAllowedNano, timeOutTime)) {
break;
}
serverStr = normalize(serverStr);
// if the server is currently a zombie, just skip to the next one
ServerWrapper wrapper = zombieServers.get(serverStr);
if (wrapper != null) {
// System.out.println("ZOMBIE SERVER QUERIED: " + serverStr);
final int numDeadServersToTry = req.getNumDeadServersToTry();
if (numDeadServersToTry > 0) {
if (skipped == null) {
skipped = new ArrayList<>(numDeadServersToTry);
skipped.add(wrapper);
}
else if (skipped.size() < numDeadServersToTry) {
skipped.add(wrapper);
}
}
continue;
}
try {
MDC.put("LBHttpSolrClient.url", serverStr);
if (numServersToTry != null && numServersTried >
numServersToTry.intValue()) {
break;
}
HttpSolrClient client = makeSolrClient(serverStr);
++numServersTried;
ex = doRequest(client, req, rsp, isNonRetryable, false, null);
if (ex == null) {
return rsp; // SUCCESS
}
//NO CATCH HERE , SO IT FAILS
} finally {
MDC.remove("LBHttpSolrClient.url");
}
}
// try the servers we previously skipped
if (skipped != null) {
for (ServerWrapper wrapper : skipped) {
if (timeAllowedExceeded = isTimeExceeded(timeAllowedNano, timeOutTime))
{
break;
}
if (numServersToTry != null && numServersTried >
numServersToTry.intValue()) {
break;
}
try {
MDC.put("LBHttpSolrClient.url", wrapper.client.getBaseURL());
++numServersTried;
ex = doRequest(wrapper.client, req, rsp, isNonRetryable, true,
wrapper.getKey());
if (ex == null) {
return rsp; // SUCCESS
}
} finally {
MDC.remove("LBHttpSolrClient.url");
}
}
}
final String solrServerExceptionMessage;
if (timeAllowedExceeded) {
solrServerExceptionMessage = "Time allowed to handle this request
exceeded";
} else {
if (numServersToTry != null && numServersTried >
numServersToTry.intValue()) {
solrServerExceptionMessage = "No live SolrServers available to handle
this request:"
+ " numServersTried="+numServersTried
+ " numServersToTry="+numServersToTry.intValue();
} else {
solrServerExceptionMessage = "No live SolrServers available to handle
this request";
}
}
if (ex == null) {
throw new SolrServerException(solrServerExceptionMessage);
} else {
throw new SolrServerException(solrServerExceptionMessage+":" +
zombieServers.keySet(), ex);
}
}
Thanks,
Rajeswari
On 5/19/19, 9:39 AM, "Natarajan, Rajeswari" <[email protected]>
wrote:
Hi
We are using solr 7.6 and trying out bidirectional CDCR and I also hit this
issue.
Stacktrace
INFO (cdcr-bootstrap-status-17-thread-1) [ ]
o.a.s.h.CdcrReplicatorManager CDCR bootstrap successful in 3 seconds
INFO (cdcr-bootstrap-status-17-thread-1) [ ]
o.a.s.h.CdcrReplicatorManager Create new update log reader for target abcd_ta
with checkpoint -1 @ abcd_ta:shard1
ERROR (cdcr-bootstrap-status-17-thread-1) [ ]
o.a.s.h.CdcrReplicatorManager Unable to bootstrap the target collection abcd_ta
shard: shard1
olrj.impl.HttpSolrClient$RemoteSolrException: Error from server at
http://10.169.50.182:8983/solr: Unable to locate core
kanna_ta_shard1_replica_n1
lr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643)
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize
- 2018-12-07 14:47:53]
lr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize
- 2018-12-07 14:47:53]
lr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize
- 2018-12-07 14:47:53]
lr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:483)
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize
- 2018-12-07 14:47:53]
lr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:413)
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize
- 2018-12-07 14:47:53]
lr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1107)
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize
- 2018-12-07 14:47:53]
lr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:884)
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize
- 2018-12-07 14:47:53]
I stepped through the code
private NamedList sendRequestRecoveryToFollower(SolrClient client, String
coreName) throws SolrServerException, IOException {
CoreAdminRequest.RequestRecovery recoverRequestCmd = new
CoreAdminRequest.RequestRecovery();
recoverRequestCmd.setAction(CoreAdminParams.CoreAdminAction.REQUESTRECOVERY);
recoverRequestCmd.setCoreName(coreName);
return client.request(recoverRequestCmd);
}
In the above method , recovery request command is admin command and it is
specific to a core. In the solrclient.request logic the code gets the
liveservers and execute the command in a loop ,but since this is admin command
this is non re-triable. Depending on which live server the code gets and where
does the core lies , the recover request command might be successful or
failure. So I think there is problem with this code in trying to send the core
command to all available live servers , the code I guess should find the
correct server on which the core lies and send this request.
Regards,
Rajeswari
On 5/15/19, 10:59 AM, "Natarajan, Rajeswari" <[email protected]>
wrote:
I am also facing this issue. Any resolution found on this issue, Please
update. Thanks
On 2/7/19, 10:42 AM, "Tim" <[email protected]> wrote:
So it looks like I'm having an issue with this fix:
https://issues.apache.org/jira/browse/SOLR-11724
So I've messed around with this for a while and every time the
leader to
leader replica portion works fine. But the Recovery portion
(implemented as
part of the fix above) fails.
I've run a few tests and every time the recovery portion kicks off,
it sends
the recovery command to the node which has the leader for a given
replica
instead of the follower.
I've recreated the collection several times so that replicas are on
different nodes with the same results each time. It seems to be
assumed that
the follower is on the same solr node as the leader.
For example, if s3r10 (shard 3, replica 10) is the leader and is on
node1,
while the follower s3r8 is on node2, then the core recovery command
meant
for s3r8 is being sent to node1 instead of node2.
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html