tibrewalpratik17 opened a new issue, #13683: URL: https://github.com/apache/pinot/issues/13683
Recently, we experienced an incident in our production environment where numerous query failures occurred due to the absence of a replica group to route the queries to. We are using the `strictReplicaGroup` policy. Upon debugging, we identified the following issue: - A new consuming segment was added to the Ideal state at `19:32` - Both replica servers received Helix transition message `SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline()` at 19:32 - Both replicas actually began consuming (`Created new consumer thread Thread:`) at 19:39, and the External view was updated at `19:39`. Here's how the broker-segment-state updates occurred: - The segment was added to `_newSegmentStateMap` at approximately `19:33` : [Ref](https://github.com/apache/pinot/blob/00871f242fa7342e252d956b54d246d289887408/pinot-broker/src/main/java/org/apache/pinot/broker/routing/instanceselector/BaseInstanceSelector.java#L398-L417) - By `19:38`, it was no longer considered a new segment due to this clause ([Ref](https://github.com/apache/pinot/blob/00871f242fa7342e252d956b54d246d289887408/pinot-broker/src/main/java/org/apache/pinot/broker/routing/instanceselector/InstanceSelector.java#L38-L40)) and was moved from `_newSegmentStateMap` to `_oldSegmentCandidatesMap`. - At `19:38`, we started seeing query failures as both instances were tagged unavailable, thus making the replicas unavailable. Both instances were tagged unavailable because the External view was NULL for this "old-tagged" segment. LOG: ``` Found unavailable instance: <SampleInstance1> in instance group: [SampleInstance1, SampleInstance2] for segment: SegmentName1, table: TableName1 (IS: {SampleInstance1=CONSUMING, SampleInstance2=CONSUMING}, EV: null) ``` Possible solutions to explore: - Make the `NEW_SEGMENT_EXPIRATION_MILLIS` parameter configurable here: https://github.com/apache/pinot/blob/00871f242fa7342e252d956b54d246d289887408/pinot-broker/src/main/java/org/apache/pinot/broker/routing/instanceselector/InstanceSelector.java#L38-L40 - Enhance the logic to move a new segment from `_newSegmentStateMap` to `_oldSegmentCandidatesMap` by checking if the segment has actually started consuming (EV update) rather than relying on a default time period. Note: The strategy discussed in #13284 would also have not helped in this scenario as both the replicas were not present in ExternalView. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org