Re: [PR] Generalize RAW + dictionary column fix to all aggregation/distinct sites (follow-up to #18500) [pinot]

via GitHub Thu, 14 May 2026 22:33:36 -0700


xiangfu0 commented on code in PR #18504:
URL: https://github.com/apache/pinot/pull/18504#discussion_r3246103166



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/docvalsets/ProjectionBlockValSet.java:
##########
@@ -98,6 +99,19 @@ public Dictionary getDictionary() {
     return _dataSource.getDictionary();
   }
 
+  /// A column with {@code EncodingType.RAW} + an explicit {@code 
dictionaryIndex} has {@link #getDictionary()}
+  /// non-null but a RAW forward index that throws on {@link 
ForwardIndexReader#readDictIds}. Callers selecting
+  /// between dict-id and value paths must gate on this method, not {@code 
getDictionary() != null}.
+  @Override
+  public boolean isDictionaryEncoded() {
+    Dictionary dictionary = _dataSource.getDictionary();
+    if (dictionary == null) {
+      return false;
+    }
+    ForwardIndexReader<?> forwardIndex = _dataSource.getForwardIndex();
+    return forwardIndex == null || forwardIndex.isDictionaryEncoded();

Review Comment:
   Fixed in bf6327d21e — `ProjectionBlockValSet.isDictionaryEncoded()` now 
returns `false` when the forward index is null. Added explicit Javadoc 
enumerating the two cases (RAW + dictionaryIndex, and disabled forward index) 
where a non-null dictionary still means dict-id reads are not callable.



##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/RawForwardIndexWithDictionaryTest.java:
##########
@@ -542,6 +541,146 @@ RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION, 
getTableName(),
         "Multi-column GROUP BY rows must match between dictionary-only and 
raw+dictionary columns");
   }
 
+  /// Multi-column DISTINCT exercises {@link 
org.apache.pinot.core.query.distinct.DistinctExecutorFactory}'s
+  /// multi-column path. The factory routes to {@code 
DictionaryBasedMultiColumnDistinctExecutor} whenever every
+  /// column has a non-null dictionary, then that executor calls {@code 
BlockValSet#getDictionaryIdsSV()}
+  /// — which throws on a RAW+dictionary column.
+  @Test(dataProvider = "useBothQueryEngines")
+  public void 
testMultiColumnDistinctWithRawDictColumnReturnsSameResults(boolean 
useMultiStageQueryEngine)
+      throws Exception {
+    setUseMultiStageQueryEngine(useMultiStageQueryEngine);
+    JsonNode dictRows = postQuery(
+        String.format("SELECT DISTINCT %s, %s FROM %s ORDER BY %s, %s",
+            DICT_DIMENSION, DICT_INT_DIMENSION, getTableName(), 
DICT_DIMENSION, DICT_INT_DIMENSION))
+        .get("resultTable").get("rows");
+    JsonNode rawRows = postQuery(
+        String.format("SELECT DISTINCT %s, %s FROM %s ORDER BY %s, %s",
+            RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION, getTableName(),
+            RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION))
+        .get("resultTable").get("rows");
+    assertEquals(rawRows, dictRows,
+        "Multi-column DISTINCT rows must match between dictionary-only and 
raw+dictionary columns");
+  }
+
+  /// {@code DISTINCTCOUNT} on a RAW+dictionary column drives
+  /// {@link 
org.apache.pinot.core.query.aggregation.function.BaseDistinctAggregateAggregationFunction#svAggregate}
+  /// down the dictionary-id path because {@code blockValSet.getDictionary() 
!= null}. That path then calls
+  /// {@code blockValSet.getDictionaryIdsSV()} on the RAW forward index. A 
{@code WHERE} predicate is included to
+  /// bypass {@link 
org.apache.pinot.core.operator.query.NonScanBasedAggregationOperator}, which 
would otherwise
+  /// serve the aggregation directly from the dictionary and hide the bug.

Review Comment:
   Reworded in bf6327d21e — each affected docstring now describes the pre-fix 
behavior in past tense ("previously crashed inside …") and references 
`BlockValSet#isDictionaryEncoded()` / `ColumnContext#isDictionaryEncoded()` as 
the correct gate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Generalize RAW + dictionary column fix to all aggregation/distinct sites (follow-up to #18500) [pinot]

Reply via email to