itschrispeck commented on PR #13215:
URL: https://github.com/apache/pinot/pull/13215#issuecomment-2130517286

   > Having 2GB as start size doesn't look correct. Can you check the high 
level logic and see if this is expected? Seems like we are trying to use one 
single buffer to hold everything?
   
   Looks like we're hitting an edge case. The contributing factors are: 
   1. MV columns [will always use a mutable 
dictionary](https://github.com/apache/pinot/blob/fed2d5f1b613371237b5a29348f0c043200671ad/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L450)
   2. We have a extremely large MV raw column generated by 
SchemaConformingTransformerV2
   3. Column is text indexed, so we use `noRawDataForTextIndex` config and 
final segment is not nearly as large
   
   Together they can result in the estimated size based on 
`RealtimeSegmentStatsHistory` being extremely large even though our target 
segment size is ~1.2G. 
   
   I think the solution is to allow MV columns to be raw encoded even in the 
mutable segment - but I'm not sure that should be in the scope of this PR. What 
do you think? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to