gaborkaszab commented on PR #14234: URL: https://github.com/apache/iceberg/pull/14234#issuecomment-4489041586
> > > > Hi @nastra, could you please clarify where we would store NDVs? Apologies if this is already covered in a design doc I may have missed. Also, would it be possible to extend the design to optionally support partition-level statistics structures such as bitmap-based sketches for NDV estimation and histograms (e.g., KLL sketches)? Thank you! > > > > > > > > > @deniskuzZ those type of stats are not handled by this design as those are stored separately in Puffin files (you might want to take a look at `NDVSketchUtil`). Those are then e.g. later used by Spark in `SparkScan#estimateStatistics`. > > > > > > Thanks for confirming that! In that case, I’m not entirely sure why Gabor’s proposal on standardizing and extending column stats was rejected. We were under the impression that the new design would also cover that aspect: https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM/edit?tab=t.0 Should we give it another try and start a new thread? cc @gaborkaszab, @pvary > > I haven't seen this proposal yet, so let me take a look first before responding to it. I also meant to reply to your NDV question earlier and updated my comment to say "NDVs are not handled..." instead of "those type of stats are not handled...". In any case, let me first read that proposal that you linked. Hey @nastra and @deniskuzZ , Thanks for bringing up that proposal doc I had earlier! As I see this: `ContentStats` ATM are meant to be used on a per file basis (and aggregated to each level of the metadata tree structure, if I'm not mistaken). I think this `ContentStats` data structure is something we can re-use for extending `PartitionStatistics` to introduce column-level stats on a per-partition level too. Just for the record this has been asked by Trino and Hive people for a while, I got a number of requests around this since the above mentioned proposal doc went public. Now about sketches (of any kind, e.g. NDV or histograms): - I don't think these should live within ContentStats, but we can examine other opportunities to have them on a per-partition basis, as currently per-table basis is what we have for sketches (well NDV, but histograms could be a meaningful addition IMO). One option is to have puffin files wired in into partition stats too. - I'm somewhat hesitant to add sketches on a per-partition level TBH, but would be nice to hear other opinions here. Just a back of the envelope calculation: a Theta sketch with default size/precision is 50KB, with 100k partitions it's 5GB for a Theta sketch per column. We could reduce this by sacrificing precision, e.g. k=512 => sketch size is 6KB *100k partitions => 600MB per column. These are in-memory sizes, we for sure load only a small subset of them, however, the disk size even though is smaller but could be half of the in-memory size. Still looks a lot, and I recall complaints for this when working on Impala, would be nice to hear other experiences. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
