gortiz commented on code in PR #10528:
URL: https://github.com/apache/pinot/pull/10528#discussion_r1204059055


##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/MultiNodesOfflineClusterIntegrationTest.java:
##########
@@ -138,28 +140,33 @@ public void testServerHardFailure()
 
     // Take a server and shut down its query server to mimic a hard failure
     BaseServerStarter serverStarter = _serverStarters.get(NUM_SERVERS - 1);
-    serverStarter.getServerInstance().shutDown();
-
-    // First query should hit all servers and get connection refused exception
-    testCountStarQuery(NUM_SERVERS, true);
-
-    // Second query should not hit the failed server, and should return the 
correct result
-    testCountStarQuery(NUM_SERVERS - 1, false);
-
-    // Restart the failed server, and it should be included in the routing 
again
-    serverStarter.stop();
-    serverStarter = startOneServer(NUM_SERVERS - 1);
-    _serverStarters.set(NUM_SERVERS - 1, serverStarter);
-    TestUtils.waitForCondition(aVoid -> {
-      try {
-        JsonNode queryResult = postQuery("SELECT COUNT(*) FROM mytable");
-        // Result should always be correct
-        
assertEquals(queryResult.get("resultTable").get("rows").get(0).get(0).longValue(),
 getCountStarResult());
-        return queryResult.get("numServersQueried").intValue() == NUM_SERVERS;
-      } catch (Exception e) {
-        throw new RuntimeException(e);
-      }
-    }, 10_000L, "Failed to include the restarted server into the routing");
+    try {
+      serverStarter.getServerInstance().shutDown();
+
+      // First query should hit all servers and get connection refused 
exception
+      TestUtils.waitForCondition(() -> {
+        testCountStarQuery(NUM_SERVERS, true);

Review Comment:
   `serverStarter.getServerInstance().shutDown()` does not wait until the 
server is stopped. At the same time, `testCountStarQuery` asserts that there 
should be an error and it should be a `Connection refused`. That is 
problematic, because there is a race condition between the shutdown and the 
query. It is very rare (but maybe not impossible) to find that the query does 
not fail, but it is not so uncommon to find that the query starts before the 
server starts to reject queries. That means that the query will fail with a 
`Connection reset` instead of a `Connection refuse`.
   
   It is very difficult to reproduce the problem in normal computes with 
several CPUs, but given that in this PR we have 3 executions and GA doesn't 
seem to use very powerful machines, it wasn't so rare to find that one of the 
executions failed.
   
   The correct solution would be to make `ServerInstance.shutDown` blocking (or 
at least block in the tests), but given that it may make this PR too complex, I 
think it is good enough to retry here for 5 seconds.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to