[GitHub] [pinot] jasperjiaguo opened a new issue, #10499: Partitioned Distinct/DistinctCount

via GitHub Tue, 28 Mar 2023 20:34:58 -0700


jasperjiaguo opened a new issue, #10499:
URL: https://github.com/apache/pinot/issues/10499


   For high cardinality columns, the local/intermediate/global merging phase of 
distinct(count) can be pretty memory/cpu heavy as the merger will need to 
ser/de and merge multiple large sets from the responses. In this case, if the 
distinct(count) column is partitioned into disjoint sets, then the merger can 
simply concat (for distinct) or add (for distinctcount) the intermediate 
results. This change can significantly reduce the set ser/de, transmission, and 
merge time/memory footprint. Meanwhile, it can be applicable to different 
levels of the processing depending on the partition granularity.
   
   <img width="757" alt="Screenshot 2023-03-28 at 8 34 39 PM" 
src="https://user-images.githubusercontent.com/10736840/228420057-f4957793-1820-4a6b-9974-45ec0fc80190.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] jasperjiaguo opened a new issue, #10499: Partitioned Distinct/DistinctCount

Reply via email to