ankitsultana opened a new issue, #16618: URL: https://github.com/apache/pinot/issues/16618
We have seen scans on Dictionary Encoded String columns show up often in our profiles for a bunch of use-cases. Often those scans are on the UUID column, which nearly always has all elements with the exact same length of 36 bytes. The profile specifically shows up for the `FixedByteValueReaderWriter#getUnpaddedBytes` method. Note that the last optimization done for this was back in 2021 by Richard Startin which used SWAR with a ZeroInWord bit-trick. #7708 Now it is expected that scans on dict-encoded columns will be slow, but I think we should still aim to optimize the scenario when we know that all elements in the dictionary have the same unpadded length. Unfortunately I don't see any quick way to do this and most approaches I can think of would require storing something in the file-format, bumping the version number and introducing a migration or config opt-in. Here's a makeshift benchmark I had written in this regard: #16617. Separately, we can also consider adding a native UUID type to Pinot which I think can help specifically optimize not just scan performance but also the in-memory representation (UUID can be represented with 16 bytes after all, vs the 36 bytes String representation) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
