Re: [PR] column major segment build for columnar datasource [pinot]

via GitHub Sat, 18 Oct 2025 04:44:06 -0700


rohityadav1993 commented on code in PR #16727:
URL: https://github.com/apache/pinot/pull/16727#discussion_r2430151427



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java:
##########
@@ -396,27 +372,58 @@ public void indexColumn(String columnName, @Nullable 
int[] sortedDocIds, IndexSe
       if (sortedDocIds != null) {
         int onDiskDocId = 0;
         for (int docId : sortedDocIds) {
-          // If validDodIds are provided, only index column if it's a valid doc
-          if (validDocIds == null || validDocIds.contains(docId)) {
-            indexColumnValue(colReader, creatorsByIndex, columnName, 
fieldSpec, dictionaryCreator, docId, onDiskDocId,
-                nullVec);
-            onDiskDocId++;
-          }
+          indexColumnValue(colReader, creatorsByIndex, columnName, fieldSpec, 
dictionaryCreator, docId, onDiskDocId,
+              nullVec);
+          onDiskDocId++;
         }
       } else {
-        int onDiskDocId = 0;
         for (int docId = 0; docId < numDocs; docId++) {
-          // If validDodIds are provided, only index column if it's a valid doc
-          if (validDocIds == null || validDocIds.contains(docId)) {
-            indexColumnValue(colReader, creatorsByIndex, columnName, 
fieldSpec, dictionaryCreator, docId, onDiskDocId,
-                nullVec);
-            onDiskDocId++;
-          }
+          indexColumnValue(colReader, creatorsByIndex, columnName, fieldSpec, 
dictionaryCreator, docId, docId, nullVec);
         }
       }
     }
   }
 
+  /**
+   * Index a column using a ColumnReader (column-major approach).
+   * This method processes the column values using the iterator pattern from 
ColumnReader.
+   *
+   * @param columnName Name of the column to index
+   * @param columnReader ColumnReader for the column data
+   * @throws IOException if indexing fails
+   */
+  @Override
+  public void indexColumn(String columnName, ColumnReader columnReader) throws 
IOException {
+    Map<IndexType<?, ?, ?>, IndexCreator> creatorsByIndex = 
_creatorsByColAndIndex.get(columnName);
+    NullValueVectorCreator nullVec = 
_nullValueVectorCreatorMap.get(columnName);
+    FieldSpec fieldSpec = _schema.getFieldSpecFor(columnName);
+    SegmentDictionaryCreator dictionaryCreator = 
_dictionaryCreatorMap.get(columnName);
+    Object defaultNullValue = fieldSpec.getDefaultNullValue();
+    Object reuseColumnValueToIndex;
+
+    // Reset column reader to start from beginning
+    columnReader.rewind();
+
+    int docId = 0;
+    while (columnReader.hasNext()) {
+      reuseColumnValueToIndex = columnReader.next();
+
+      // Handle null values
+      if (nullVec != null && reuseColumnValueToIndex == null) {
+        nullVec.setNull(docId);
+        reuseColumnValueToIndex = defaultNullValue;

Review Comment:
   Good catch, so in current implementation we are doing segment to segment 
transformation and with above we will never have a null value to index.
   
   For non Pinot segment datasource, we can have a null value, I previously 
removed a util method to do null check on the read value basedon PR comment.
   
   Let me handle case for a read null value with default value when we don't 
have null handling enabled. cc: @Jackie-Jiang , @noob-se7en to review as well.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java:
##########
@@ -171,31 +170,72 @@ public void init(SegmentGeneratorConfig config, 
RecordReader recordReader)
         new TransformPipeline(config.getTableConfig(), config.getSchema()));
   }
 
+  /**
+   * Initialize the driver for columnar segment building using a 
ColumnReaderFactory.
+   * This method sets up the driver to use column-wise input data access 
instead of row-wise.
+   *
+   * @param config Segment generator configuration
+   * @param columnReaderFactory Factory for creating column readers
+   * @throws Exception if initialization fails
+   */
+  public void init(SegmentGeneratorConfig config, ColumnReaderFactory 
columnReaderFactory)
+      throws Exception {
+    // Initialize the column reader factory with target schema
+    columnReaderFactory.init(config.getSchema());
+
+    // Create all column readers for the target schema
+    Map<String, ColumnReader> columnReaders = 
columnReaderFactory.getAllColumnReaders();
+
+    // Create columnar data source
+    ColumnarSegmentCreationDataSource columnarDataSource =
+        new ColumnarSegmentCreationDataSource(columnReaderFactory, 
columnReaders);
+
+    // Use the existing init method with columnar data source and no transform 
pipeline
+    init(config, columnarDataSource, null);

Review Comment:
   Yes, we will add transformation support later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] column major segment build for columnar datasource [pinot]

Reply via email to