cypherean opened a new issue, #13990:
URL: https://github.com/apache/pinot/issues/13990

   We faced an issue in production wherein the controller leader node went down 
but re-election was not triggered. This lead to segment upload errors for 
tables trying to commit a segment, instances being marked as unavailable for 
the segment and finally queries failing for the tables with segments 
unavailable error. 
   Pinot version - 1.0.x
   
   Timeline for this was as follows:
   1. A GET call for a large table's segments' metadata 
`tables/<tablename>/segments/<segmentName>/metadata?columns=*` spawns ~75k 
threads. This cause a huge memory spike and heap to go out of memory (heap size 
being 128GB here), maybe crashing the node. We suspect it was because of reload 
status button which triggers the segment metadata call
   ![Screenshot 2024-09-12 at 7 11 38 
PM](https://github.com/user-attachments/assets/e0f27b0b-8934-416e-b608-0a139aba7f96)
   
   2. The node's 2 zk sessions time out at 17:19:56, the node tries to 
reestablish connection but it keeps emitting metrics as a leader until 17:29:x
   ![Screenshot 2024-09-12 at 7 11 05 
PM](https://github.com/user-attachments/assets/e50bac7f-0d27-4c3b-9592-958ccb06ddc4)
   
   3. The health check for node starts failing around 17:20:x, but standby 
controller nodes keep polling and getting failed leader's session ID as leader 
until 18:10 when we triggered a force replacement of error node
   ![Screenshot 2024-09-12 at 7 19 24 
PM](https://github.com/user-attachments/assets/4ec05d8a-f92b-4650-afb5-91f95130602c)
   
   ```
   Instance 123 is not leader of cluster production-cluster due to current 
session 702147796290178 does not match leader session 702147796290171
   ```
   
   Ideally the re-election should've triggered around 17:20.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to