DanielTing commented on issue #6439: URL: https://github.com/apache/pinot/issues/6439#issuecomment-895424222
Just commenting a bit on how data sketches could be used for this problem and some of the limitations. The short is that existing libraries won't provide this exact functionality but can help with this. There is extra work involved. You'd get an approximate result that covers a good set of cases (but not all). So for continuous data with a smooth density, the distribution is pretty easy to approximate using a quantile sketch. Both the KLL sketch in Apache Datasketches and the t-digest should work well. (See technical note below on some details) For discrete data (the more likely case when you compute the mode in database settings), the problem is exactly the same as top-k with k=1, so frequent item sketches can help you find it. The main problems are 1) that you won't know how to size the sketch without knowing how frequent the mode is in advance, and 2) that if the distribution is too "flat", it takes too much space. You can likely still get around problem (1) in practice, but not (2). If you cover both the continuous/discrete cases, you can consider using a KLL quantile sketch to act as a frequent item sketch. It's not quite optimized for the frequent item problem, but it could make engineering easier since you don't need to know if the data is continuous/discrete in advance. Even though it's a quantile sketch, you can still apply it to string data since there's a comparison operator. Distinct counting sketches will have too high of a chance of discarding the mode because they don't favor keeping around items with lots of duplicates. On the technical side, the KLL sketch has some proofs on the correctness for quantiles but doesn't explicitly cover finding the mode. To make sure it works with some guarantees, you'd likely have to assume the density is "smooth enough" like the derivative of the density is not too big. The simple procedure to find the mode would then be to apply a kernel density estimate on top of the points in the KLL sketch, and find the mode of the smoothed density. For the T-digest, you'd pretty much do the same, but the there wouldn't be any theoretical guarantees and the (perhaps main) advantage that it has accurate tails is irrelevant. The relevant error when computing the mode would be the maximum error (which should occur around the median). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org