itschrispeck commented on PR #13215: URL: https://github.com/apache/pinot/pull/13215#issuecomment-2130517286
> Having 2GB as start size doesn't look correct. Can you check the high level logic and see if this is expected? Seems like we are trying to use one single buffer to hold everything? Looks like we're hitting an edge case. The contributing factors are: 1. MV columns [will always use a mutable dictionary](https://github.com/apache/pinot/blob/fed2d5f1b613371237b5a29348f0c043200671ad/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L450) 2. We have a extremely large MV raw column generated by SchemaConformingTransformerV2 3. Column is text indexed, so we use `noRawDataForTextIndex` config and final segment is not nearly as large Together they can result in the estimated size based on `RealtimeSegmentStatsHistory` being extremely large even though our target segment size is ~1.2G. I think the solution is to allow MV columns to be raw encoded even in the mutable segment - but I'm not sure that should be in the scope of this PR. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org