[ https://issues.apache.org/jira/browse/GEODE-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250659#comment-17250659 ]
ASF GitHub Bot commented on GEODE-7861: --------------------------------------- echobravopapa commented on a change in pull request #5839: URL: https://github.com/apache/geode/pull/5839#discussion_r544649359 ########## File path: geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeave.java ########## @@ -428,6 +432,24 @@ public boolean join() throws MemberStartupException { } } + boolean hasCoordinatorJoinedCluster(int viewId, long retrySleep) Review comment: looks like this extraction was test driven ########## File path: geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeave.java ########## @@ -383,40 +383,44 @@ public boolean join() throws MemberStartupException { break; } } - try { - if (found && !state.hasContactedAJoinedLocator) { - // if locators are restarting they may be handing out IDs from a stale view that - // we should go through quickly. Otherwise we should sleep a bit to let failure - // detection select a new coordinator - if (state.possibleCoordinator.getVmViewId() < 0) { - logger.debug("sleeping for {} before making another attempt to find the coordinator", - retrySleep); - Thread.sleep(retrySleep); - } else { + if (found && !state.hasContactedAJoinedLocator) { + try { + if (hasCoordinatorJoinedCluster(state.possibleCoordinator.getVmViewId(), retrySleep)) { // since we were given a coordinator that couldn't be used we should keep trying tries = 0; giveupTime = System.currentTimeMillis() + timeout; } + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new MembershipConfigurationException( + "Retry sleep interrupted. Giving up on joining the distributed system."); } - } catch (InterruptedException e) { - logger.debug("retry sleep interrupted - giving up on joining the distributed system"); - return false; } } // for if (!this.isJoined) { logger.debug("giving up attempting to join the distributed system after " + (System.currentTimeMillis() - startTime) + "ms"); - } - // to preserve old behavior we need to throw a MemberStartupException if - // unable to contact any of the locators - if (!this.isJoined && state.hasContactedAJoinedLocator) { - throw new MemberStartupException("Unable to join the distributed system in " - + (System.currentTimeMillis() - startTime) + "ms"); - } + // to preserve old behavior we need to throw a MemberStartupException if + // unable to contact any of the locators + if (state.hasContactedAJoinedLocator) { + throw new MemberStartupException("Unable to join the distributed system in " + + (System.currentTimeMillis() - startTime) + "ms"); + } - return this.isJoined; + if (state.locatorsContacted == 0) { + throw new MembershipConfigurationException( + "Unable to join the distributed system. Could not contact any of the locators: " Review comment: as I'm understanding this change, this section is the critical change to improve upon error reporting... I don't see unit testing for the improvement; there is a new test that looks for the above Exception string. Only it seems that a test could validate the expanded error information has been sent as well. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Improve error reporting when a member is unable to contact a locator > -------------------------------------------------------------------- > > Key: GEODE-7861 > URL: https://issues.apache.org/jira/browse/GEODE-7861 > Project: Geode > Issue Type: Improvement > Components: membership > Reporter: Dan Smith > Assignee: Kamilla Aslami > Priority: Major > Labels: pull-request-available > > When a member is unable to contact a join the system due to a failure to > contact a locator, we could be a lot more specific, which would help users > debug issues in their environment. We currently print out > {noformat} > Unable to join the distributed system. Operation either timed out, was > stopped or Locator does not exist. > {noformat} > We should include the underlying exception from the last locator we tried to > contact as a cause, so that users can see why it failed (DNS error, ???). > Perhaps also the list of locators we tried to contact. -- This message was sent by Atlassian Jira (v8.3.4#803005)