itschrispeck commented on PR #14765:
URL: https://github.com/apache/pinot/pull/14765#issuecomment-2574381377

   > Can you elaborate more on why do we want to separate size info for each 
topic? If the data is ingested into the same table, they should have the same 
schema, and most of the case similar data distribution. I guess in most cases 
user still want to track per table threshold.
   
   For our multi-stream ingestion use case data rarely has the same schema or a 
similar data distribution. For us, topics are logically separate datasets. The 
current implementation lead to many index build failures, e.g. forward index 
size, too many MV values, etc. Beyond build failures, we also saw wild swings 
in segment sizes as many segments from one topic flushed, and then many 
segments from another topic flushed. 
   
   A concrete example is for observability, we have service A and service B 
emitting to topics A and B, both ingesting into a single table. Generally they 
are not emitting the same shape or volume of logs, which causes inaccurate 
segment size estimations. Even if they are emitting the same shape/size, 
deployments that change the log fingerprints cannot always happen in sync, so 
if computed at a table level and service A was upgraded and B wasn't yet, there 
would be a period where segment build failures are likely. Another case is that 
we turn on debug level logging temporarily for A, and the same issues happen. 
   
   > If this is indeed needed, can we make it configurable, and add a config 
flag to enable it?
   
   What's the downside of leaving this per topic by default? If we make the 
above assumption, that data should have the same schema/similar data 
distribution, then computing it at a per topic level should not be measurably 
different than computing it per table. If the assumption doesn't hold, then we 
have a better behavior by default. I feel that an extra 
`SegmentSizeBasedFlushThresholdUpdater` instance per topic should not be too 
expensive to hold. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to