Copilot commented on code in PR #17032: URL: https://github.com/apache/pinot/pull/17032#discussion_r2450036106
########## pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/memory/EmptyIndexBuffer.java: ########## @@ -0,0 +1,304 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.pinot.segment.spi.memory; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.util.Properties; +import org.roaringbitmap.buffer.ImmutableRoaringBitmap; + + +/** + * A specialized PinotDataBuffer implementation for zero-size index entries that contains S3 segment path information. + * This buffer is useful for debugging and tracking purposes when dealing with empty index entries. + */ +public class EmptyIndexBuffer extends PinotDataBuffer { + private final Properties _properties; + private final String _segmentName; + private final String _tableNameWithType; + private final String _segmentPath; + + /** + * Creates a new S3EmptyIndexBuffer for a zero-size index entry Review Comment: Corrected class name reference in Javadoc from 'S3EmptyIndexBuffer' to 'EmptyIndexBuffer'. ```suggestion * Creates a new EmptyIndexBuffer for a zero-size index entry ``` ########## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/SingleFileIndexDirectory.java: ########## @@ -251,11 +263,29 @@ private void loadMap() private void mapBufferEntries() throws IOException { + // Phase 1: Prepare data structures - sort entries by start offset + // Use list to handle multiple entries with same start offset + List<IndexEntry> sortedEntries = _columnEntries.values().stream() + .sorted((e1, e2) -> Long.compare(e1._startOffset, e2._startOffset)) + .collect(java.util.stream.Collectors.toList()); Review Comment: The comment mentions handling "multiple entries with same start offset", but the sorting comparator doesn't handle this case - entries with the same offset will have arbitrary ordering. If this is intentional, the comment should be clarified. Additionally, use the statically imported `Collectors.toList()` instead of the fully qualified name for consistency with project imports. ########## pinot-tools/src/main/resources/examples/minions/batch/baseballStats/baseballStats_offline_table_config_raw_inverted_index.json: ########## @@ -0,0 +1,49 @@ +{ + "tableName": "baseballStats", + "tableType": "OFFLINE", + "segmentsConfig": { + "segmentPushType": "APPEND", + "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy", + "replication": "1" + }, + "tenants": { + }, + "tableIndexConfig": { + "loadMode": "HEAP", + "invertedIndexColumns": [ + "teamID" Review Comment: Missing comma between array elements "teamID" and "playerID" will cause JSON parsing to fail. ```suggestion "teamID", ``` ########## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/SingleFileIndexDirectory.java: ########## @@ -265,15 +295,35 @@ private void mapBufferEntries() runningSize += entry._size; if (runningSize >= MAX_ALLOCATION_SIZE && !offsetAccum.isEmpty()) { - mapAndSliceFile(indexStartMap, offsetAccum, offsetEntry.getKey()); + // Calculate the correct end offset for the previous entries + long lastOffset = offsetAccum.get(offsetAccum.size() - 1); + IndexEntry lastEntry = indexStartMap.get(lastOffset); + long endOffset = lastOffset + lastEntry._size; + mapAndSliceFile(indexStartMap, offsetAccum, endOffset); runningSize = entry._size; offsetAccum.clear(); } offsetAccum.add(offsetEntry.getKey()); } if (!offsetAccum.isEmpty()) { - mapAndSliceFile(indexStartMap, offsetAccum, offsetAccum.get(0) + runningSize); + // Calculate the correct end offset: start of last entry + size of last entry + long lastOffset = offsetAccum.get(offsetAccum.size() - 1); + IndexEntry lastEntry = indexStartMap.get(lastOffset); + long endOffset = lastOffset + lastEntry._size; + mapAndSliceFile(indexStartMap, offsetAccum, endOffset); + } + // Handle zero-size entries with empty buffer + for (IndexEntry entry : sortedEntries) { + if (entry._size == 0) { + Properties properties = new Properties(); + if (_segmentDirectoryLoaderContext != null + && _segmentDirectoryLoaderContext.getSegmentCustomConfigs() != null) { + properties.putAll(_segmentDirectoryLoaderContext.getSegmentCustomConfigs()); + } + entry._buffer = new EmptyIndexBuffer(properties, _segmentMetadata.getName(), + _segmentMetadata.getTableName()); Review Comment: EmptyIndexBuffer constructor expects (properties, segmentName, tableNameWithType) but receives (properties, name, tableName). The third parameter should be tableNameWithType (which includes the type suffix like "_OFFLINE" or "_REALTIME"), not just tableName. Use `_segmentMetadata.getTableName()` only if it already includes the type, otherwise this should be constructed or retrieved from table config. ```suggestion _segmentMetadata.getTableNameWithType()); ``` ########## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/InvertedIndexHandler.java: ########## @@ -109,8 +110,8 @@ public void postUpdateIndicesCleanup(SegmentDirectory.Writer segmentWriter) } private boolean shouldCreateInvertedIndex(ColumnMetadata columnMetadata) { - // Only create inverted index on dictionary-encoded unsorted columns. - return columnMetadata != null && !columnMetadata.isSorted() && columnMetadata.hasDictionary(); + // Only create inverted index on unsorted columns Review Comment: This comment is incomplete and potentially misleading. The original comment clarified that inverted indexes are only created on "dictionary-encoded unsorted columns". The new comment removes the dictionary requirement but doesn't explain the change or the new behavior. It should explain that inverted indexes can now be created on both dictionary-encoded and raw-encoded unsorted columns. ########## pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/memory/EmptyIndexBuffer.java: ########## @@ -0,0 +1,304 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.pinot.segment.spi.memory; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.util.Properties; +import org.roaringbitmap.buffer.ImmutableRoaringBitmap; + + +/** + * A specialized PinotDataBuffer implementation for zero-size index entries that contains S3 segment path information. + * This buffer is useful for debugging and tracking purposes when dealing with empty index entries. + */ +public class EmptyIndexBuffer extends PinotDataBuffer { + private final Properties _properties; + private final String _segmentName; + private final String _tableNameWithType; + private final String _segmentPath; + + /** + * Creates a new S3EmptyIndexBuffer for a zero-size index entry + * + * @param properties Properties containing S3 configuration (bucket, key, etc.) + * @param segmentName The name of the segment + * @param tableNameWithType The table name with type + */ + public EmptyIndexBuffer(Properties properties, String segmentName, String tableNameWithType) { + super(false); // Not closeable since it's just metadata Review Comment: The comment states "Not closeable since it's just metadata", but the class extends PinotDataBuffer which has release() and close operations. This comment is misleading as the buffer can be released/closed (even if they're no-ops), and the 'closeable' parameter refers to whether cleanup is needed, not whether methods can be called. ```suggestion super(false); // 'closeable' is false because no cleanup is needed; buffer can still be closed/released (no-op) ``` ########## pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/creator/RawValueBasedInvertedIndexCreator.java: ########## @@ -59,4 +59,13 @@ public interface RawValueBasedInvertedIndexCreator extends InvertedIndexCreator * For multi-value column, adds the double values for the next document. */ void add(double[] values, int length); + + default void add(String value) { + } + default void add(String[] value, int length) { + } + default void add(byte[] value) { + } Review Comment: These default methods should include Javadoc comments explaining their purpose, parameters, and when they should be overridden. The empty implementations suggest they're optional, but documentation would clarify the expected behavior. ```suggestion /** * For single-value column, adds the String value for the next document. * <p> * Default implementation is a no-op. Override this method if the index supports String values. * * @param value The String value to add for the next document. */ default void add(String value) { } /** * For multi-value column, adds the String values for the next document. * <p> * Default implementation is a no-op. Override this method if the index supports String values. * * @param value The array of String values to add for the next document. * @param length The number of values in the array. */ default void add(String[] value, int length) { } /** * For single-value column, adds the byte[] value for the next document. * <p> * Default implementation is a no-op. Override this method if the index supports byte[] values. * * @param value The byte[] value to add for the next document. */ default void add(byte[] value) { } /** * For multi-value column, adds the byte[][] values for the next document. * <p> * Default implementation is a no-op. Override this method if the index supports byte[][] values. * * @param value The array of byte[] values to add for the next document. * @param length The number of values in the array. */ ``` ########## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/InvertedIndexHandler.java: ########## @@ -135,35 +136,86 @@ private void createInvertedIndexForColumn(SegmentDirectory.Writer segmentWriter, // Create new inverted index for the column. LOGGER.info("Creating new inverted index for segment: {}, column: {}", segmentName, columnName); int numDocs = columnMetadata.getTotalDocs(); - IndexCreationContext.Common context = IndexCreationContext.builder() .withIndexDir(indexDir) .withColumnMetadata(columnMetadata) .withTableNameWithType(_tableConfig.getTableName()) .withContinueOnError(_tableConfig.getIngestionConfig() != null && _tableConfig.getIngestionConfig().isContinueOnError()) .build(); - - try (DictionaryBasedInvertedIndexCreator creator = StandardIndexes.inverted() - .createIndexCreator(context, IndexConfig.ENABLED)) { - IndexReaderFactory<ForwardIndexReader> readerFactory = StandardIndexes.forward().getReaderFactory(); - try (ForwardIndexReader forwardIndexReader = readerFactory.createIndexReader(segmentWriter, - _fieldIndexConfigs.get(columnMetadata.getColumnName()), columnMetadata); + if (columnMetadata.hasDictionary()) { + // Dictionary-based inverted index + + + try (DictionaryBasedInvertedIndexCreator creator = StandardIndexes.inverted() + .createIndexCreator(context, IndexConfig.ENABLED)) { + + try ( + ForwardIndexReader forwardIndexReader = StandardIndexes.forward() + .getReaderFactory() + .createIndexReader(segmentWriter, _fieldIndexConfigs.get(columnName), columnMetadata); + ForwardIndexReaderContext readerContext = forwardIndexReader.createContext()) { + if (columnMetadata.isSingleValue()) { + // Single-value column. + for (int i = 0; i < numDocs; i++) { + creator.add(forwardIndexReader.getDictId(i, readerContext)); + } + } else { + // Multi-value column. + int[] dictIds = new int[columnMetadata.getMaxNumberOfMultiValues()]; + for (int i = 0; i < numDocs; i++) { + int length = forwardIndexReader.getDictIdMV(i, dictIds, readerContext); + creator.add(dictIds, length); + } + } + creator.seal(); + } + } + } else { + // Raw value based inverted index + IndexReaderFactory<ForwardIndexReader> readerFactory = StandardIndexes.forward() + .getReaderFactory(); + try (RawValueBitmapInvertedIndexCreator creator = new RawValueBitmapInvertedIndexCreator(context); + ForwardIndexReader forwardIndexReader = readerFactory + .createIndexReader(segmentWriter, _fieldIndexConfigs.get(columnName), columnMetadata); ForwardIndexReaderContext readerContext = forwardIndexReader.createContext()) { if (columnMetadata.isSingleValue()) { - // Single-value column. - for (int i = 0; i < numDocs; i++) { - creator.add(forwardIndexReader.getDictId(i, readerContext)); + // Single-value column + switch (columnMetadata.getDataType()) { + case INT: + for (int i = 0; i < numDocs; i++) { + creator.add(forwardIndexReader.getInt(i, readerContext)); + } + break; + case LONG: + for (int i = 0; i < numDocs; i++) { + creator.add(forwardIndexReader.getLong(i, readerContext)); + } + break; + case FLOAT: + for (int i = 0; i < numDocs; i++) { + creator.add(forwardIndexReader.getFloat(i, readerContext)); + } + break; + case DOUBLE: + for (int i = 0; i < numDocs; i++) { + creator.add(forwardIndexReader.getDouble(i, readerContext)); + } + break; + case STRING: + for (int i = 0; i < numDocs; i++) { + creator.add(forwardIndexReader.getString(i, readerContext)); + } + break; + default: + throw new IllegalStateException( + "Unsupported data type for raw inverted index: " + columnMetadata.getDataType()); } } else { - // Multi-value column. - int[] dictIds = new int[columnMetadata.getMaxNumberOfMultiValues()]; - for (int i = 0; i < numDocs; i++) { - int length = forwardIndexReader.getDictIdMV(i, dictIds, readerContext); - creator.add(dictIds, length); - } + // Multi-value columns not supported for raw inverted index + throw new IllegalStateException("Raw inverted index not supported for multi-value columns: " + columnName); Review Comment: The error message should be more specific about why multi-value columns aren't supported for raw inverted indexes and potentially suggest using dictionary encoding as an alternative. ```suggestion throw new IllegalStateException("Raw inverted index is not supported for multi-value columns: " + columnName + ". This is because raw inverted indexes require a single value per document and do not use a dictionary. " + "Please enable dictionary encoding for this column to use an inverted index."); ``` ########## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/inv/RawValueBitmapInvertedIndexCreator.java: ########## @@ -0,0 +1,334 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.pinot.segment.local.segment.creator.impl.inv; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.RandomAccessFile; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.channels.FileChannel; +import java.util.TreeMap; +import javax.annotation.Nullable; +import org.apache.commons.io.FileUtils; +import org.apache.pinot.segment.local.segment.creator.impl.SegmentDictionaryCreator; +import org.apache.pinot.segment.spi.V1Constants; +import org.apache.pinot.segment.spi.creator.IndexCreationContext; +import org.apache.pinot.segment.spi.index.creator.RawValueBasedInvertedIndexCreator; +import org.apache.pinot.segment.spi.memory.PinotDataBuffer; +import org.apache.pinot.spi.data.DimensionFieldSpec; +import org.apache.pinot.spi.data.FieldSpec; +import org.apache.pinot.spi.data.FieldSpec.DataType; +import org.roaringbitmap.buffer.MutableRoaringBitmap; + +/** + * Creator for raw value bitmap inverted index. + * File format: + * <ul> + * <li>Header (44 bytes):</li> + * <ul> + * <li>version (4 bytes)</li> + * <li>cardinality (4 bytes)</li> + * <li>maxLength (4 bytes) - for string/bytes data types</li> + * <li>dictionaryOffset (8 bytes)</li> + * <li>dictionaryLength (8 bytes)</li> + * <li>invertedIndexOffset (8 bytes)</li> + * <li>invertedIndexLength (8 bytes)</li> + * </ul> + * <li>Dictionary data</li> + * <li>Inverted index data</li> + * </ul> + */ +public class RawValueBitmapInvertedIndexCreator implements RawValueBasedInvertedIndexCreator { + private static final int VERSION = 1; + private static final int HEADER_LENGTH = 44; Review Comment: HEADER_LENGTH is documented as 36 bytes in RawValueBitmapInvertedIndexReader (line 49) but defined as 44 bytes here. The actual header structure shown in both files consists of: version(4) + cardinality(4) + maxLength(4) + dictionaryOffset(8) + dictionaryLength(8) + invertedIndexOffset(8) + invertedIndexLength(8) = 44 bytes. The documentation in RawValueBitmapInvertedIndexReader should be corrected to 44 bytes. ########## pinot-common/src/main/java/org/apache/pinot/common/tier/TimeBasedTierSegmentSelector.java: ########## @@ -47,6 +47,9 @@ public boolean selectSegment(String tableNameWithType, SegmentZKMetadata segment // get segment end time to decide if segment gets selected long endTimeMs = segmentZKMetadata.getEndTimeMs(); + if (endTimeMs < 0) { + return System.currentTimeMillis() - segmentZKMetadata.getCreationTime() > _segmentAgeMillis; + } Preconditions.checkState(endTimeMs > 0, "Invalid endTimeMs: %s for segment: %s of table: %s", endTimeMs, segmentZKMetadata.getSegmentName(), tableNameWithType); Review Comment: The logic at line 50 checks if `endTimeMs < 0` and handles it, but then line 53 still requires `endTimeMs > 0` with a Preconditions check. This means `endTimeMs == 0` would bypass the fallback logic and fail the precondition check. The precondition should be removed since the fallback already handles invalid endTimeMs values. ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
