Re: [PR] Generalize RAW + dictionary column fix to all aggregation/distinct sites (follow-up to #18500) [pinot]

via GitHub Thu, 14 May 2026 19:18:59 -0700


Copilot commented on code in PR #18504:
URL: https://github.com/apache/pinot/pull/18504#discussion_r3245403955



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/ColumnContext.java:
##########
@@ -49,25 +52,44 @@ public boolean isSingleValue() {
     return _isSingleValue;
   }
 
+  /// Returns the column's dictionary file if one exists, regardless of 
whether the forward index can answer
+  /// dictionary-id reads. Callers that need to select between a dict-id read 
path and a value read path MUST gate
+  /// on {@link #isDictionaryEncoded()} rather than {@code getDictionary() != 
null} — a column declared as
+  /// {@code EncodingType.RAW} with an explicit {@code dictionaryIndex} 
returns a non-null dictionary here but its
+  /// forward index throws on {@link ForwardIndexReader#readDictIds}.
   @Nullable
   public Dictionary getDictionary() {
     return _dictionary;
   }
 
+  /// Returns {@code true} if the column's forward index is dictionary-encoded 
and the dict-id read path
+  /// ({@link org.apache.pinot.core.common.BlockValSet#getDictionaryIdsSV()}) 
is callable. A column with
+  /// {@code EncodingType.RAW} + an explicit {@code dictionaryIndex} returns 
{@code false} here even though
+  /// {@link #getDictionary()} is non-null.
+  public boolean isDictionaryEncoded() {
+    return _dictionaryEncoded;
+  }
+
   @Nullable
   public DataSource getDataSource() {
     return _dataSource;
   }
 
   public static ColumnContext fromDataSource(DataSource dataSource) {
     DataSourceMetadata dataSourceMetadata = dataSource.getDataSourceMetadata();
-    return new ColumnContext(dataSourceMetadata.getDataType(), 
dataSourceMetadata.isSingleValue(),
-        dataSource.getDictionary(), dataSource);
+    Dictionary dictionary = dataSource.getDictionary();
+    ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex();
+    boolean dictEncoded = dictionary != null && (forwardIndex == null || 
forwardIndex.isDictionaryEncoded());

Review Comment:
   ColumnContext#fromDataSource computes dictEncoded as `dictionary != null && 
(forwardIndex == null || forwardIndex.isDictionaryEncoded())`, which returns 
true when the forward index is disabled (null). In that scenario dict-id reads 
from the forward index are not callable, so isDictionaryEncoded() can 
incorrectly steer callers into dict-id code paths. Consider treating 
`forwardIndex == null` as not dictionary-encoded (or updating the contract/docs 
if null forward index is intended to be supported).
   



##########
pinot-core/src/main/java/org/apache/pinot/core/common/BlockValSet.java:
##########
@@ -48,11 +48,27 @@ public interface BlockValSet {
   boolean isSingleValue();
 
   /**
-   * Returns the dictionary for the column, or {@code null} if the column is 
not dictionary-encoded.
+   * Returns the dictionary file for the column if one exists, or {@code null} 
otherwise. The dictionary may be
+   * present even when {@link #isDictionaryEncoded()} returns {@code false} — 
a column declared as
+   * {@code EncodingType.RAW} with an explicit {@code dictionaryIndex} carries 
a dictionary file on disk but a RAW
+   * forward index. Callers that select between a dictionary-id read path
+   * ({@link #getDictionaryIdsSV()} / {@link #getDictionaryIdsMV()}) and a 
value read path MUST gate on
+   * {@link #isDictionaryEncoded()}, not {@code getDictionary() != null}.

Review Comment:
   The updated BlockValSet#getDictionary() Javadoc refers to a "dictionary 
file". BlockValSet is used for transform-layer value sets as well (e.g., 
TransformBlockValSet returns TransformFunction#getDictionary(), which can be an 
in-memory/on-the-fly dictionary), so this wording is inaccurate/misleading. 
Consider rephrasing to simply "dictionary" (may be on-disk or computed) and 
keep the distinction focused on whether dict-id reads are supported via 
isDictionaryEncoded().
   



##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/RawForwardIndexWithDictionaryTest.java:
##########
@@ -542,6 +541,146 @@ RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION, 
getTableName(),
         "Multi-column GROUP BY rows must match between dictionary-only and 
raw+dictionary columns");
   }
 
+  /// Multi-column DISTINCT exercises {@link 
org.apache.pinot.core.query.distinct.DistinctExecutorFactory}'s
+  /// multi-column path. The factory routes to {@code 
DictionaryBasedMultiColumnDistinctExecutor} whenever every
+  /// column has a non-null dictionary, then that executor calls {@code 
BlockValSet#getDictionaryIdsSV()}
+  /// — which throws on a RAW+dictionary column.
+  @Test(dataProvider = "useBothQueryEngines")
+  public void 
testMultiColumnDistinctWithRawDictColumnReturnsSameResults(boolean 
useMultiStageQueryEngine)
+      throws Exception {
+    setUseMultiStageQueryEngine(useMultiStageQueryEngine);
+    JsonNode dictRows = postQuery(
+        String.format("SELECT DISTINCT %s, %s FROM %s ORDER BY %s, %s",
+            DICT_DIMENSION, DICT_INT_DIMENSION, getTableName(), 
DICT_DIMENSION, DICT_INT_DIMENSION))
+        .get("resultTable").get("rows");
+    JsonNode rawRows = postQuery(
+        String.format("SELECT DISTINCT %s, %s FROM %s ORDER BY %s, %s",
+            RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION, getTableName(),
+            RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION))
+        .get("resultTable").get("rows");
+    assertEquals(rawRows, dictRows,
+        "Multi-column DISTINCT rows must match between dictionary-only and 
raw+dictionary columns");
+  }
+
+  /// {@code DISTINCTCOUNT} on a RAW+dictionary column drives
+  /// {@link 
org.apache.pinot.core.query.aggregation.function.BaseDistinctAggregateAggregationFunction#svAggregate}
+  /// down the dictionary-id path because {@code blockValSet.getDictionary() 
!= null}. That path then calls
+  /// {@code blockValSet.getDictionaryIdsSV()} on the RAW forward index. A 
{@code WHERE} predicate is included to
+  /// bypass {@link 
org.apache.pinot.core.operator.query.NonScanBasedAggregationOperator}, which 
would otherwise
+  /// serve the aggregation directly from the dictionary and hide the bug.

Review Comment:
   Several newly added test comments describe the pre-fix behavior as if it is 
current (e.g., DistinctExecutorFactory routing based on `getDictionary() != 
null`, BaseDistinctAggregateAggregationFunction choosing the dict-id path for 
the same reason). Since the production code now gates on isDictionaryEncoded(), 
these comments are now stale and can confuse future debugging; consider 
updating them to describe the historical bug explicitly ("previously...") or 
reference isDictionaryEncoded() as the correct gate.



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/docvalsets/ProjectionBlockValSet.java:
##########
@@ -98,6 +99,19 @@ public Dictionary getDictionary() {
     return _dataSource.getDictionary();
   }
 
+  /// A column with {@code EncodingType.RAW} + an explicit {@code 
dictionaryIndex} has {@link #getDictionary()}
+  /// non-null but a RAW forward index that throws on {@link 
ForwardIndexReader#readDictIds}. Callers selecting
+  /// between dict-id and value paths must gate on this method, not {@code 
getDictionary() != null}.
+  @Override
+  public boolean isDictionaryEncoded() {
+    Dictionary dictionary = _dataSource.getDictionary();
+    if (dictionary == null) {
+      return false;
+    }
+    ForwardIndexReader<?> forwardIndex = _dataSource.getForwardIndex();
+    return forwardIndex == null || forwardIndex.isDictionaryEncoded();

Review Comment:
   ProjectionBlockValSet#isDictionaryEncoded() currently returns true when 
`ForwardIndexReader` is null. Elsewhere in the codebase (e.g., DataFetcher) a 
null forward index indicates the forward index is disabled, and dict-id reads 
are not possible. Returning true here can misroute callers into 
getDictionaryIdsSV()/MV() and lead to failures; consider returning false when 
the forward index is null.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Generalize RAW + dictionary column fix to all aggregation/distinct sites (follow-up to #18500) [pinot]

Reply via email to