[jira] [Commented] (GEODE-7861) Improve error reporting when a member is unable to contact a locator

ASF GitHub Bot (Jira) Wed, 16 Dec 2020 13:54:05 -0800


    [ 
https://issues.apache.org/jira/browse/GEODE-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250659#comment-17250659
 ]


ASF GitHub Bot commented on GEODE-7861:
---------------------------------------

echobravopapa commented on a change in pull request #5839:
URL: https://github.com/apache/geode/pull/5839#discussion_r544649359



##########
File path: 
geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeave.java
##########
@@ -428,6 +432,24 @@ public boolean join() throws MemberStartupException {
     }
   }
 
+  boolean hasCoordinatorJoinedCluster(int viewId, long retrySleep)

Review comment:
       looks like this extraction was test driven

##########
File path: 
geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeave.java
##########
@@ -383,40 +383,44 @@ public boolean join() throws MemberStartupException {
             break;
           }
         }
-        try {
-          if (found && !state.hasContactedAJoinedLocator) {
-            // if locators are restarting they may be handing out IDs from a 
stale view that
-            // we should go through quickly. Otherwise we should sleep a bit 
to let failure
-            // detection select a new coordinator
-            if (state.possibleCoordinator.getVmViewId() < 0) {
-              logger.debug("sleeping for {} before making another attempt to 
find the coordinator",
-                  retrySleep);
-              Thread.sleep(retrySleep);
-            } else {
+        if (found && !state.hasContactedAJoinedLocator) {
+          try {
+            if 
(hasCoordinatorJoinedCluster(state.possibleCoordinator.getVmViewId(), 
retrySleep)) {
               // since we were given a coordinator that couldn't be used we 
should keep trying
               tries = 0;
               giveupTime = System.currentTimeMillis() + timeout;
             }
+          } catch (InterruptedException e) {
+            Thread.currentThread().interrupt();
+            throw new MembershipConfigurationException(
+                "Retry sleep interrupted. Giving up on joining the distributed 
system.");
           }
-        } catch (InterruptedException e) {
-          logger.debug("retry sleep interrupted - giving up on joining the 
distributed system");
-          return false;
         }
       } // for
 
       if (!this.isJoined) {
         logger.debug("giving up attempting to join the distributed system 
after "
             + (System.currentTimeMillis() - startTime) + "ms");
-      }
 
-      // to preserve old behavior we need to throw a MemberStartupException if
-      // unable to contact any of the locators
-      if (!this.isJoined && state.hasContactedAJoinedLocator) {
-        throw new MemberStartupException("Unable to join the distributed 
system in "
-            + (System.currentTimeMillis() - startTime) + "ms");
-      }
+        // to preserve old behavior we need to throw a MemberStartupException 
if
+        // unable to contact any of the locators
+        if (state.hasContactedAJoinedLocator) {
+          throw new MemberStartupException("Unable to join the distributed 
system in "
+              + (System.currentTimeMillis() - startTime) + "ms");
+        }
 
-      return this.isJoined;
+        if (state.locatorsContacted == 0) {
+          throw new MembershipConfigurationException(
+              "Unable to join the distributed system. Could not contact any of 
the locators: "

Review comment:
       as I'm understanding this change, this section is the critical change to 
improve upon error reporting... I don't see unit testing for the improvement; 
there is a new test that looks for the above Exception string. Only it seems 
that a test could validate the expanded error information has been sent as well.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve error reporting when a member is unable to contact a locator
> --------------------------------------------------------------------
>
>                 Key: GEODE-7861
>                 URL: https://issues.apache.org/jira/browse/GEODE-7861
>             Project: Geode
>          Issue Type: Improvement
>          Components: membership
>            Reporter: Dan Smith
>            Assignee: Kamilla Aslami
>            Priority: Major
>              Labels: pull-request-available
>
> When a member is unable to contact a join the system due to a failure to 
> contact a locator, we could be a lot more specific, which would help users 
> debug issues in their environment. We currently print out 
> {noformat}
> Unable to join the distributed system.  Operation either timed out, was 
> stopped or Locator does not exist.
> {noformat}
> We should include the underlying exception from the last locator we tried to 
> contact as a cause, so that users can see why it failed (DNS error, ???). 
> Perhaps also the list of locators we tried to contact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-7861) Improve error reporting when a member is unable to contact a locator

Reply via email to