[PR] Optimize segment commit to not read partition group metadata [pinot]

via GitHub Thu, 02 Nov 2023 17:41:59 -0700


Jackie-Jiang opened a new pull request, #11943:
URL: https://github.com/apache/pinot/pull/11943


   Currently when committing a real-time segment, controller needs to read 
partition group metadata for all partitions from upstream, which can be very 
slow for stream with lots of partitions.
   The partition group metadata is used only to extract the partition ids, 
which can be simply derived from partition count except for stream that closes 
partitions such as Kinesis.
   
   In this PR, we made the following changes:
   1. Only read partition count from upstream if available
   2. If partition count is not available, fall back to the current approach
   3. Log the time spent in each step for debugging
   4. In `SegmentFlushThresholdComputer`, remove the logic of only counting the 
segments with smallest partition id for the size ratio because it complicates 
the handling (quite anti-pattern as any segment commit requires info from all 
partitions) a lot, and I don't see much value from it. Quickly converging to 
the size ratio of recent data trend should be a pro instead of a con because 
this ratio is used to decide the segment size to consume data for the same 
period of time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[PR] Optimize segment commit to not read partition group metadata [pinot]

Reply via email to