DanielTing commented on issue #6439:
URL: https://github.com/apache/pinot/issues/6439#issuecomment-895424222


   Just commenting a bit on how data sketches could be used for this problem 
and some of the limitations. The short is that existing libraries won't provide 
this exact functionality but can help with this. There is extra work involved. 
You'd get an approximate result that covers a good set of cases (but not all). 
   
   So for continuous data with a smooth density, the distribution is pretty 
easy to approximate using a quantile sketch. Both the KLL sketch in Apache 
Datasketches and the t-digest should work well. (See technical note below on 
some details)
   
   For discrete data (the more likely case when you compute the mode in 
database settings), the problem is exactly the same as top-k with k=1, so 
frequent item sketches can help you find it. The main problems are 1) that you 
won't know how to size the sketch without knowing how frequent the mode is in 
advance, and 2) that if the distribution is too "flat", it takes too much 
space. You can likely still get around problem (1) in practice, but not (2). If 
you cover both the continuous/discrete cases, you can consider using a KLL 
quantile sketch to act as a frequent item sketch. It's not quite optimized for 
the frequent item problem, but it could make engineering easier since you don't 
need to know if the data is continuous/discrete in advance. Even though it's a 
quantile sketch, you can still apply it to string data since there's a 
comparison operator. Distinct counting sketches will have too high of a chance 
of discarding the mode because they don't favor keeping around items with lots
  of duplicates.
   
   On the technical side, the KLL sketch has some proofs on the correctness for 
quantiles but doesn't explicitly cover finding the mode. To make sure it works 
with some guarantees, you'd likely have to assume the density is "smooth 
enough" like the derivative of the density is not too big. The simple procedure 
to find the mode would then be to apply a kernel density estimate on top of the 
points in the KLL sketch, and find the mode of the smoothed density.
   
   For the T-digest, you'd pretty much do the same, but the there wouldn't be 
any theoretical guarantees and the (perhaps main) advantage that it has 
accurate tails is irrelevant. The relevant error when computing the mode would 
be the maximum error (which should occur around the median). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to