[I] [performance] Dictionary Scan Performance [pinot]

via GitHub Fri, 15 Aug 2025 15:49:24 -0700


ankitsultana opened a new issue, #16618:
URL: https://github.com/apache/pinot/issues/16618


   We have seen scans on Dictionary Encoded String columns show up often in our 
profiles for a bunch of use-cases. Often those scans are on the UUID column, 
which nearly always has all elements with the exact same length of 36 bytes.
   
   The profile specifically shows up for the 
`FixedByteValueReaderWriter#getUnpaddedBytes` method. Note that the last 
optimization done for this was back in 2021 by Richard Startin which used SWAR 
with a ZeroInWord bit-trick. #7708
   
   Now it is expected that scans on dict-encoded columns will be slow, but I 
think we should still aim to optimize the scenario when we know that all 
elements in the dictionary have the same unpadded length.
   
   Unfortunately I don't see any quick way to do this and most approaches I can 
think of would require storing something in the file-format, bumping the 
version number and introducing a migration or config opt-in.
   
   Here's a makeshift benchmark I had written in this regard: #16617.
   
   Separately, we can also consider adding a native UUID type to Pinot which I 
think can help specifically optimize not just scan performance but also the 
in-memory representation (UUID can be represented with 16 bytes after all, vs 
the 36 bytes String representation)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [performance] Dictionary Scan Performance [pinot]

Reply via email to