vvivekiyer opened a new issue, #10127: URL: https://github.com/apache/pinot/issues/10127
### Issue Description This issue is only applicable when a **REALTIME** segment contains a column with the following properties: 1. Multivalue (MV) column 2. VarByte datatype - String, Bytes, BigDecimal 3. Raw aka noDictionary When a consuming segment has a column with the above properties, segment building fails with the following error: [pinot-server] [] Could not build segment ``` [pinot-server] [] Could not build segment java.lang.IllegalArgumentException: integer overflow detected at com.google.common.base.Preconditions.checkArgument(Preconditions.java:145) ~[guava-31.1-jre.jar:?] at org.apache.pinot.segment.local.segment.creator.impl.fwd.MultiValueVarByteRawIndexCreator.getTotalRowStorageBytes(MultiValueVarByteRawIndexCreator.java:154) ~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.segment.local.segment.creator.impl.fwd.MultiValueVarByteRawIndexCreator.<init>(MultiValueVarByteRawIndexCreator.java:78) ~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.segment.local.segment.creator.impl.DefaultIndexCreatorProvider.getRawIndexCreatorForMVColumn(DefaultIndexCreatorProvider.java:263) ~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.segment.local.segment.creator.impl.DefaultIndexCreatorProvider.newForwardIndexCreator(DefaultIndexCreatorProvider.java:87) ~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.segment.spi.index.IndexingOverrides$Default.newForwardIndexCreator(IndexingOverrides.java:156) ~[pinot-segment-spi-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator.init(SegmentColumnarIndexCreator.java:228) ~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:211) ~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.segment.local.realtime.converter.RealtimeSegmentConverter.build(RealtimeSegmentConverter.java:110) ~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentInternal(LLRealtimeSegmentDataManager.java:895) [pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentForCommit(LLRealtimeSegmentDataManager.java:806) [pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:705) [pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b] at java.lang.Thread.run(Thread.java:834) [?:?] ``` ### Root Cause Analysis The mutable segment creates the column as dictEnabled. However, the offline segment creation attempts to create the column as noDict. But the Writer doesn't have maxRowLengthInBytes metadata to construct the forwardIndex. A longer version of the above RCA is below: 1. In a real time table, if an MV column of dataType String, Bytes, BigDecimal is created with noDictionary, we still end up creating a dictionary for the MutableSegment. This limitation is because we don't have an implementation for `MutableForwardIndex` that handles noDict VarByte columns. https://github.com/apache/pinot/blob/ca86efca006453d407475ba074af1d4d492b920f/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L427 2. When `RealtimeSegmentConverter` tries to build a completed segment in this case, the segment build is done in two phases - (i) collect column Statistics for each column (ii) Read each mutable segment row and index it for offline segment creation. Note that the column stats gathering for mutable segments doesn't need to read each record. It is done through `MutableColumnStatistics`. 4. When `SegmentColumnarIndexCreator` tries to create an index creator for this column (mentioned in 1), it honors table config and tries to create a noDict column. So it uses the `MultiValueVarByteRawIndexCreator`. This creator requires `maxRowLengthInBytes` which is not available through `MutableColumnStatistics` and there is no way to compute it on the fly without reading all the records in the mutable segment. ### Potential Solutions 1. (Ideal Solution) Implement a MutableForwardIndex version that supports noDict VarByte columns. This will automatically create a noDict column for the Mutable Segment. Converting a mutable segment to Completed segment when column property is noDict in both will automatically be handled. Until this solution is implemented, we can address the Assert by creating a dictEnabled column in the offline segment as well automatically during conversion. 2. (Hacky Solution): Detect that a column is needs to change from Dict -> noDict during realtime segment conversion. If this is the case, perform an additional read of all the records in the mutable segment to construct ColumnStatistics with maxRowLengthInBytes for this column. Use this to create the `MultiValueVarByteRawIndexCreator` Opening this issue to get feedback from the community about which way to proceed. Also wanted to check if there are other solutions to address this problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org