Fokko commented on issue #1223:
URL: 
https://github.com/apache/iceberg-python/issues/1223#issuecomment-2445969277

   Thanks @Visorgood for reaching out here, and that's an excellent idea. We 
actually already do this in a project like Datahub, see: 
https://github.com/datahub-project/datahub/blob/0e62c699fc2e4cf2d3525e899037b8277541cfd6/metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py#L141-L162
   
   There are some limitations that @sungwy already pointed out, such as 
applying a filter. There are a couple more, such as when you have positional 
deletes, the row-counts are not accurate anymore. You would need to apply the 
deletes and then count, but this requires computation. Also, the upper- and 
lower bounds are truncated by default when the column is a string. For DataHub 
this is fine, but you need to be aware of the limitations.
   
   That said, I do think there is value in a special API to quickly get 
table/column statistics. I think adding this to the [metadata tables is the 
right place](https://py.iceberg.apache.org/api/#partitions). WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to