Re: [CDCR]Unable to locate core

Natarajan, Rajeswari Sun, 19 May 2019 14:49:10 -0700

Here is my close analysis:


SolrClient request goes to the below method  "request " in the class 
LBHttpSolrClient.java
There is a for loop to try  different live servers , but when  doRequest method 
 (in the request method below) sends exception there is no catch , so next 
re-try is not done. To solve this issue , there should be catch around 
doRequest and then the second time it will re-try the correct request. But in 
case there are multiple live servers, the request might timeout also.  This 
needs to be fixed to make CDCR bootstrap  work reliable. If not sometimes it 
will work good and sometimes not. I can work on this patch  if this is agreed.


public Rsp request(Req req) throws SolrServerException, IOException {
    Rsp rsp = new Rsp();
    Exception ex = null;
    boolean isNonRetryable = req.request instanceof IsUpdateRequest || 
ADMIN_PATHS.contains(req.request.getPath());
    List<ServerWrapper> skipped = null;

    final Integer numServersToTry = req.getNumServersToTry();
    int numServersTried = 0;

    boolean timeAllowedExceeded = false;
    long timeAllowedNano = getTimeAllowedInNanos(req.getRequest());
    long timeOutTime = System.nanoTime() + timeAllowedNano;
    for (String serverStr : req.getServers()) {
      if (timeAllowedExceeded = isTimeExceeded(timeAllowedNano, timeOutTime)) {
        break;
      }
      
      serverStr = normalize(serverStr);
      // if the server is currently a zombie, just skip to the next one
      ServerWrapper wrapper = zombieServers.get(serverStr);
      if (wrapper != null) {
        // System.out.println("ZOMBIE SERVER QUERIED: " + serverStr);
        final int numDeadServersToTry = req.getNumDeadServersToTry();
        if (numDeadServersToTry > 0) {
          if (skipped == null) {
            skipped = new ArrayList<>(numDeadServersToTry);
            skipped.add(wrapper);
          }
          else if (skipped.size() < numDeadServersToTry) {
            skipped.add(wrapper);
          }
        }
        continue;
      }
      try {
        MDC.put("LBHttpSolrClient.url", serverStr);

        if (numServersToTry != null && numServersTried > 
numServersToTry.intValue()) {
          break;
        } 

        HttpSolrClient client = makeSolrClient(serverStr);

        ++numServersTried;
        ex = doRequest(client, req, rsp, isNonRetryable, false, null);
        if (ex == null) {
          return rsp; // SUCCESS
        }
       //NO CATCH HERE ,  SO IT FAILS
      } finally {
        MDC.remove("LBHttpSolrClient.url");
      }
    }

    // try the servers we previously skipped
    if (skipped != null) {
      for (ServerWrapper wrapper : skipped) {
        if (timeAllowedExceeded = isTimeExceeded(timeAllowedNano, timeOutTime)) 
{
          break;
        }

        if (numServersToTry != null && numServersTried > 
numServersToTry.intValue()) {
          break;
        }

        try {
          MDC.put("LBHttpSolrClient.url", wrapper.client.getBaseURL());
          ++numServersTried;
          ex = doRequest(wrapper.client, req, rsp, isNonRetryable, true, 
wrapper.getKey());
          if (ex == null) {
            return rsp; // SUCCESS
          }
        } finally {
          MDC.remove("LBHttpSolrClient.url");
        }
      }
    }


    final String solrServerExceptionMessage;
    if (timeAllowedExceeded) {
      solrServerExceptionMessage = "Time allowed to handle this request 
exceeded";
    } else {
      if (numServersToTry != null && numServersTried > 
numServersToTry.intValue()) {
        solrServerExceptionMessage = "No live SolrServers available to handle 
this request:"
            + " numServersTried="+numServersTried
            + " numServersToTry="+numServersToTry.intValue();
      } else {
        solrServerExceptionMessage = "No live SolrServers available to handle 
this request";
      }
    }
    if (ex == null) {
      throw new SolrServerException(solrServerExceptionMessage);
    } else {
      throw new SolrServerException(solrServerExceptionMessage+":" + 
zombieServers.keySet(), ex);
    }

  }


Thanks,
Rajeswari


On 5/19/19, 9:39 AM, "Natarajan, Rajeswari" <[email protected]> 
wrote:

    Hi
    
    We are using solr 7.6 and trying out bidirectional CDCR and I also hit this 
issue. 
    
    Stacktrace
    
    INFO  (cdcr-bootstrap-status-17-thread-1) [   ] 
o.a.s.h.CdcrReplicatorManager CDCR bootstrap successful in 3 seconds            
                                                                   
    INFO  (cdcr-bootstrap-status-17-thread-1) [   ] 
o.a.s.h.CdcrReplicatorManager Create new update log reader for target abcd_ta 
with checkpoint -1 @ abcd_ta:shard1                                
    ERROR (cdcr-bootstrap-status-17-thread-1) [   ] 
o.a.s.h.CdcrReplicatorManager Unable to bootstrap the target collection abcd_ta 
shard: shard1                                                     
    olrj.impl.HttpSolrClient$RemoteSolrException: Error from server at 
http://10.169.50.182:8983/solr: Unable to locate core 
kanna_ta_shard1_replica_n1                                                
    lr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643) 
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize 
- 2018-12-07 14:47:53]
    lr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) 
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize 
- 2018-12-07 14:47:53]         
    lr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) 
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize 
- 2018-12-07 14:47:53]
    lr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:483) 
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize 
- 2018-12-07 14:47:53]
    lr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:413) 
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize 
- 2018-12-07 14:47:53]
    lr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1107) 
~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize 
- 2018-12-07 14:47:53]
    
lr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:884)
 ~[solr-solrj-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize 
- 2018-12-07 14:47:53]
    
    
    I stepped through the code
    
    private NamedList sendRequestRecoveryToFollower(SolrClient client, String 
coreName) throws SolrServerException, IOException {
        CoreAdminRequest.RequestRecovery recoverRequestCmd = new 
CoreAdminRequest.RequestRecovery();
        
recoverRequestCmd.setAction(CoreAdminParams.CoreAdminAction.REQUESTRECOVERY);
        recoverRequestCmd.setCoreName(coreName);
        return client.request(recoverRequestCmd);
      }
    
     In the above method , recovery request command is admin command and it is 
specific to a core. In the  solrclient.request logic the code gets the 
liveservers and execute the command in a loop ,but  since this is admin command 
this is non re-triable.  Depending on which live server the code gets and where 
does the core lies , the recover request command might be successful or 
failure.  So I think there is problem with this code in trying to send the core 
command to all available live servers , the code I guess should find the 
correct server on which the core lies and send this request.
    
    Regards,
    Rajeswari
    
    On 5/15/19, 10:59 AM, "Natarajan, Rajeswari" <[email protected]> 
wrote:
    
        I am also facing this issue. Any resolution found on this issue, Please 
update. Thanks
        
        On 2/7/19, 10:42 AM, "Tim" <[email protected]> wrote:
        
            So it looks like I'm having an issue with this fix:
            https://issues.apache.org/jira/browse/SOLR-11724
            
            So I've messed around with this for a while and every time the 
leader to
            leader replica portion works fine. But the Recovery portion 
(implemented as
            part of the fix above) fails. 
            
            I've run a few tests and every time the recovery portion kicks off, 
it sends
            the recovery command to the node which has the leader for a given 
replica
            instead of the follower. 
            I've recreated the collection several times so that replicas are on
            different nodes with the same results each time. It seems to be 
assumed that
            the follower is on the same solr node as the leader. 
             
            For example, if s3r10 (shard 3, replica 10) is the leader and is on 
node1,
            while the follower s3r8 is on node2, then the core recovery command 
meant
            for s3r8 is being sent to node1 instead of node2.
            
            
            
            
            
            --
            Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: [CDCR]Unable to locate core

Reply via email to