vvivekiyer opened a new issue, #10127:
URL: https://github.com/apache/pinot/issues/10127

   ### Issue Description 
   This issue is only applicable when a **REALTIME** segment contains a column 
with the following properties:
   1. Multivalue (MV) column
   2. VarByte datatype - String, Bytes, BigDecimal
   3. Raw aka noDictionary
   
   
   When a consuming segment has a column with the above properties, segment 
building fails with the following error:
   [pinot-server] [] Could not build segment
   ```
   [pinot-server] [] Could not build segment
   java.lang.IllegalArgumentException: integer overflow detected
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:145) 
~[guava-31.1-jre.jar:?]
        at 
org.apache.pinot.segment.local.segment.creator.impl.fwd.MultiValueVarByteRawIndexCreator.getTotalRowStorageBytes(MultiValueVarByteRawIndexCreator.java:154)
 
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.segment.local.segment.creator.impl.fwd.MultiValueVarByteRawIndexCreator.<init>(MultiValueVarByteRawIndexCreator.java:78)
 
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.segment.local.segment.creator.impl.DefaultIndexCreatorProvider.getRawIndexCreatorForMVColumn(DefaultIndexCreatorProvider.java:263)
 
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.segment.local.segment.creator.impl.DefaultIndexCreatorProvider.newForwardIndexCreator(DefaultIndexCreatorProvider.java:87)
 
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.segment.spi.index.IndexingOverrides$Default.newForwardIndexCreator(IndexingOverrides.java:156)
 
~[pinot-segment-spi-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator.init(SegmentColumnarIndexCreator.java:228)
 
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:211)
 
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.segment.local.realtime.converter.RealtimeSegmentConverter.build(RealtimeSegmentConverter.java:110)
 
~[pinot-segment-local-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentInternal(LLRealtimeSegmentDataManager.java:895)
 
[pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentForCommit(LLRealtimeSegmentDataManager.java:806)
 
[pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at 
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:705)
 
[pinot-core-0.12.0-dev-755.jar:0.12.0-dev-755-d458668690df6366ebab7a162720d081f9647a2b]
        at java.lang.Thread.run(Thread.java:834) [?:?]
   ```
   
   
   ### Root Cause Analysis
   The mutable segment creates the column as dictEnabled. However, the offline 
segment creation attempts to create the column as noDict. But the Writer 
doesn't have maxRowLengthInBytes metadata to construct the forwardIndex. 
   A longer version of the above RCA is below: 
   1. In a real time table, if an MV column of dataType String, Bytes, 
BigDecimal is created with noDictionary, we still end up creating a dictionary 
for the MutableSegment. This limitation is because we don't have an 
implementation for `MutableForwardIndex` that handles noDict VarByte columns. 
https://github.com/apache/pinot/blob/ca86efca006453d407475ba074af1d4d492b920f/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L427
   2. When `RealtimeSegmentConverter` tries to build a completed segment in 
this case, the segment build is done in two phases - (i) collect column 
Statistics for each column  (ii) Read each mutable segment row and index it for 
offline segment creation. Note that the column stats gathering for mutable 
segments doesn't need to read each record. It is done through 
`MutableColumnStatistics`.
   4. When `SegmentColumnarIndexCreator` tries to create an index creator for 
this column (mentioned in 1), it honors table config and tries to create a 
noDict column. So it uses the `MultiValueVarByteRawIndexCreator`. This creator 
requires `maxRowLengthInBytes` which is not available through 
`MutableColumnStatistics` and there is no way to compute it on the fly without 
reading all the records in the mutable segment.
   
   
   
   ### Potential Solutions
   1. (Ideal Solution) Implement a MutableForwardIndex version that supports 
noDict VarByte columns. This will automatically create a noDict column for the 
Mutable Segment. Converting a mutable segment to Completed segment when column 
property is noDict in both will automatically be handled.  Until this solution 
is implemented, we can address the Assert by creating a dictEnabled column in 
the offline segment as well automatically during conversion.
   2. (Hacky Solution): Detect that a column is needs to change from Dict -> 
noDict during realtime segment conversion. If this is the case, perform an 
additional read of all the records in the mutable segment to construct 
ColumnStatistics with maxRowLengthInBytes for this column. Use this to create 
the  `MultiValueVarByteRawIndexCreator`
   
   
   Opening this issue to get feedback from the community about which way to 
proceed. Also wanted to check if there are other solutions to address this 
problem. 
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to