mcvsubbu commented on a change in pull request #6890: URL: https://github.com/apache/incubator-pinot/pull/6890#discussion_r633914256
########## File path: pinot-controller/src/main/java/org/apache/pinot/controller/util/ConsumingSegmentInfoReader.java ########## @@ -131,6 +134,51 @@ private String generateServerURL(String tableNameWithType, String endpoint) { return String.format("%s/tables/%s/consumingSegmentsInfo", endpoint, tableNameWithType); } + /** + * Utility method to derive ingestion status from consuming segment Info. Status is HEALTHY if + * consuming segment info specifies CONSUMING state for all active segments across all servers + * including replicas. + */ + public TableStatus.IngestionStatus getIngestionStatus(String tableNameWithType, int timeoutMs) { + try { + ConsumingSegmentsInfoMap consumingSegmentsInfoMap = getConsumingSegmentsInfo(tableNameWithType, timeoutMs); + for (Map.Entry<String, List<ConsumingSegmentInfo>> consumingSegmentInfoEntry : consumingSegmentsInfoMap._segmentToConsumingInfoMap + .entrySet()) { + String segmentName = consumingSegmentInfoEntry.getKey(); + List<ConsumingSegmentInfo> consumingSegmentInfoList = consumingSegmentInfoEntry.getValue(); + if (consumingSegmentInfoList == null || consumingSegmentInfoList.isEmpty()) { + String errorMessage = "Did not get any response from servers for segment: " + segmentName; + return TableStatus.IngestionStatus.newIngestionStatus(TableStatus.IngestionState.UNHEALTHY, errorMessage); + } + + // Check if any responses are missing + Set<String> serversForSegment = _pinotHelixResourceManager.getServersForSegment(tableNameWithType, segmentName); + if (serversForSegment.size() != consumingSegmentInfoList.size()) { + Set<String> serversResponded = + consumingSegmentInfoList.stream().map(c -> c._serverName).collect(Collectors.toSet()); + serversForSegment.removeAll(serversResponded); + String errorMessage = + "Not all servers responded for segment: " + segmentName + " Missing servers : " + serversForSegment; + return TableStatus.IngestionStatus.newIngestionStatus(TableStatus.IngestionState.UNHEALTHY, errorMessage); + } + + for (ConsumingSegmentInfo consumingSegmentInfo : consumingSegmentInfoList) { + if (consumingSegmentInfo._consumerState + .equals(ConsumerState.NOT_CONSUMING.toString())) { Review comment: This means that if for some reason we have a transient failure with Kafka that results in CONSUMING segments going OFFLINE, we could be displaying HEALTHY for a long time. We have had multiple combination of such issues at Linkedin, and over time, have arrived at the conclusion that monitoring the consumption metric (we emit 1 when consuming and 0 when not), is really the best alternative to decide whether or not some manual intervention is needed. A persistent value of 0 (where the time of persistence depends on application tolerance to non-fresh data) indicates call to action. We can also check with how Uber handles this, since they are another large installation of real-time Pinot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org