tibrewalpratik17 opened a new issue, #13683:
URL: https://github.com/apache/pinot/issues/13683

   Recently, we experienced an incident in our production environment where 
numerous query failures occurred due to the absence of a replica group to route 
the queries to. We are using the `strictReplicaGroup` policy.
   
   Upon debugging, we identified the following issue:
   
   - A new consuming segment was added to the Ideal state at `19:32`
   - Both replica servers received Helix transition message 
`SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline()` at 19:32
   - Both replicas actually began consuming (`Created new consumer thread 
Thread:`) at 19:39, and the External view was updated at `19:39`.
   
   Here's how the broker-segment-state updates occurred:
   
   -  The segment was added to `_newSegmentStateMap` at approximately `19:33` : 
[Ref](https://github.com/apache/pinot/blob/00871f242fa7342e252d956b54d246d289887408/pinot-broker/src/main/java/org/apache/pinot/broker/routing/instanceselector/BaseInstanceSelector.java#L398-L417)
   - By `19:38`, it was no longer considered a new segment due to this clause 
([Ref](https://github.com/apache/pinot/blob/00871f242fa7342e252d956b54d246d289887408/pinot-broker/src/main/java/org/apache/pinot/broker/routing/instanceselector/InstanceSelector.java#L38-L40))
 and was moved from `_newSegmentStateMap` to `_oldSegmentCandidatesMap`.
   - At `19:38`, we started seeing query failures as both instances were tagged 
unavailable, thus making the replicas unavailable. Both instances were tagged 
unavailable because the External view was NULL for this "old-tagged" segment.
   LOG:
   ```
   Found unavailable instance: <SampleInstance1> in instance group: 
[SampleInstance1, SampleInstance2] for segment: 
   SegmentName1, table: TableName1 (IS: {SampleInstance1=CONSUMING, 
SampleInstance2=CONSUMING}, EV: null)
   ```
   
   Possible solutions to explore:
   
   - Make the `NEW_SEGMENT_EXPIRATION_MILLIS` parameter configurable here:
   
https://github.com/apache/pinot/blob/00871f242fa7342e252d956b54d246d289887408/pinot-broker/src/main/java/org/apache/pinot/broker/routing/instanceselector/InstanceSelector.java#L38-L40
 
   
   - Enhance the logic to move a new segment from `_newSegmentStateMap` to 
`_oldSegmentCandidatesMap` by checking if the segment has actually started 
consuming (EV update) rather than relying on a default time period.
   
   Note: The strategy discussed in #13284 would also have not helped in this 
scenario as both the replicas were not present in ExternalView.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to