xiangfu0 opened a new pull request, #18504:
URL: https://github.com/apache/pinot/pull/18504

   ## Summary
   
   A column declared with `EncodingType.RAW` + an explicit `dictionaryIndex` 
has a `Dictionary` file on disk but a RAW forward index that throws 
`UnsupportedOperationException` on `ForwardIndexReader#readDictIds`. Many 
aggregation, group-by, and distinct executors gated their dict-id read path on 
`blockValSet.getDictionary() != null` alone, so a single such column in a query 
crashes the operator.
   
   [apache#18500](https://github.com/apache/pinot/pull/18500) fixed one site 
(`NoDictionaryMultiColumnGroupKeyGenerator`). A wider scan surfaced ~30 more 
call sites with the same pattern. This PR makes the gate explicit at every site 
so the rule is single-sourced and consistently applied.
   
   ## Fix
   
   Introduce `boolean isDictionaryEncoded()` on `BlockValSet` and 
`ColumnContext` that returns `true` only when the forward index is 
dictionary-encoded.
   
   - `BlockValSet#isDictionaryEncoded()` — default implementation returns 
`getDictionary() != null`, so non-projection value sets (transform / row / 
data-block) keep working unchanged.
   - `ProjectionBlockValSet#isDictionaryEncoded()` — overrides to consult 
`ForwardIndexReader#isDictionaryEncoded()` directly, so `RAW + dictionaryIndex` 
columns correctly report `false`.
   - `ColumnContext#isDictionaryEncoded()` — computed at construction from the 
underlying `DataSource`.
   - `getDictionary()` keeps its straightforward "is there a dictionary file?" 
meaning. Filter operators that obtain the dictionary via 
`DataSource#getDictionary()` for value↔id lookup are unaffected.
   
   Every aggregation/distinct/group-by call site now gates on 
`isDictionaryEncoded()` rather than `getDictionary() != null`:
   
   | Site | Notes |
   |---|---|
   | `DefaultGroupByExecutor` | `!columnContext.isDictionaryEncoded()` routes 
to no-dict generator |
   | `NoDictionaryMultiColumnGroupKeyGenerator` | reuses the flag — same fix 
family as apache#18500 |
   | `DistinctExecutorFactory` | single- and multi-column paths |
   | `BaseDistinctAggregateAggregationFunction` | parent of 
`DISTINCTCOUNT`/`DISTINCTSUM`/`DISTINCTAVG`/MV variants |
   | `DistinctCountBitmap` / `HLL` / `HLLPlus` / `ULL` / `CPCSketch` / 
`OffHeap` | dict-id fast path |
   | `BaseDistinctCountSmartSketch` (`SmartHLL` / `SmartHLLPlus` / `SmartULL`) 
| dict-id fast path |
   | `SegmentPartitionedDistinctCount` | dict-id fast path |
   | `ModeAggregationFunction` | aggregate + SV/MV group-by |
   | `AnyValueAggregationFunction` | dict-id fast path |
   | `FUNNELCOUNT/AggregationStrategy` | precondition now correctly rejects RAW 
+ dict columns with a friendly message instead of 
`UnsupportedOperationException` |
   
   ## Test plan
   
   - New tests in 
[`RawForwardIndexWithDictionaryTest`](pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/RawForwardIndexWithDictionaryTest.java)
 reproduce the crash for the representative paths:
     - `testMultiColumnGroupByWithRawDictColumnReturnsSameResults`
     - `testMultiColumnDistinctWithRawDictColumnReturnsSameResults`
     - `testDistinctWithFilterOnRawDictColumnReturnsSameResults`
     - `testDistinctCountWithFilterOnRawDictColumnReturnsSameResults`
     - `testDistinctCountHLLWithFilterOnRawDictColumnReturnsSameResults`
     - `testDistinctCountBitmapOnRawDictColumnReturnsSameResults`
     - 
`testSegmentPartitionedDistinctCountWithFilterOnRawDictColumnReturnsSameResults`
     - `testModeOnRawDictColumnReturnsSameResults`
     - All 16 runs (V1 + V2 engines) fail on master and pass with this change.
     - Filtered variants use a partial `WHERE` to bypass 
`NonScanBasedAggregationOperator`, which would otherwise serve the aggregation 
from the dictionary directly and hide the bug.
   - Existing regression test suite stays green:
     - 1566 unit tests across `*Distinct*` / `*Aggregation*` / `*GroupBy*` 
patterns
     - `DictionaryBasedGroupKeyGeneratorTest`, 
`NoDictionaryGroupKeyGeneratorTest`, `DistinctQueriesTest`, 
`DefaultAggregationExecutorTest`
   - Spotless, checkstyle, license — all clean on `pinot-core` and 
`pinot-integration-tests`.
   
   ## Backward compatibility
   
   Server-internal query-execution change only. No SPI surface change 
(`BlockValSet` lives in `pinot-core/common`, not in `pinot-spi`; the new method 
is a `default`). No config keys, wire formats, segment versions, REST 
endpoints, or DataTable formats are touched. Queries that previously threw 
`UnsupportedOperationException` now succeed via the value path; queries that 
previously succeeded keep their behavior.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to