[PR] Optimise index stats collector for no dict [pinot]

via GitHub Wed, 17 Sep 2025 22:09:26 -0700


krishan1390 opened a new pull request, #16845:
URL: https://github.com/apache/pinot/pull/16845


   **Summary**
   Avoid storing unique values for columns with dictionary disabled, 
drastically reducing heap usage during segment creation. Track min/max and 
row-length stats without relying on sorted unique sets. Maintain existing 
behavior for dictionary-enabled columns. The sorted unique sets were only 
needed to build dictionaries, which are not created for no-dictionary columns.
   
   **Key Changes**
   1. Added dictionary enablement detection to 
AbstractColumnStatisticsCollector 
   2. Behavior when _dictionaryEnabled == false:
         a. getUniqueValuesSet() returns null.
         b. getCardinality() returns total entries
         c. Cardinality from ColumnIndexCreationInfo.getDistinctValueCount() 
becomes UNKNOWN_CARDINALITY (via null unique values)
   3. Updated collectors to skip unique-value storage for no-dictionary 
columns, lazily allocating sets/arrays only when needed:
   
   Labels: performance
   
   Release Notes - 
   - Approximate cardinality for no dictionary columns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Optimise index stats collector for no dict [pinot]

Reply via email to