nastra commented on PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#issuecomment-4479765732

   > > > Hi @nastra, could you please clarify where we would store NDVs? 
Apologies if this is already covered in a design doc I may have missed. Also, 
would it be possible to extend the design to optionally support partition-level 
statistics structures such as bitmap-based sketches for NDV estimation and 
histograms (e.g., KLL sketches)? Thank you!
   > > 
   > > 
   > > @deniskuzZ those type of stats are not handled by this design as those 
are stored separately in Puffin files (you might want to take a look at 
`NDVSketchUtil`). Those are then e.g. later used by Spark in 
`SparkScan#estimateStatistics`.
   > 
   > Thanks for confirming that! In that case, I’m not entirely sure why 
Gabor’s proposal on standardizing and extending column stats was rejected. We 
were under the impression that the new design would also cover that aspect: 
https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM/edit?tab=t.0
 Should we give it another try and start a new thread? cc @gaborkaszab, @pvary
   
   I haven't seen this proposal yet, so let me take a look first before 
responding to it. I also meant to reply to your NDV question earlier and 
updated my comment to say "NDVs are not handled..." instead of "those type of 
stats are not handled...". In any case, let me first read that proposal that 
you linked.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to