Fokko commented on issue #1223: URL: https://github.com/apache/iceberg-python/issues/1223#issuecomment-2445969277
Thanks @Visorgood for reaching out here, and that's an excellent idea. We actually already do this in a project like Datahub, see: https://github.com/datahub-project/datahub/blob/0e62c699fc2e4cf2d3525e899037b8277541cfd6/metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py#L141-L162 There are some limitations that @sungwy already pointed out, such as applying a filter. There are a couple more, such as when you have positional deletes, the row-counts are not accurate anymore. You would need to apply the deletes and then count, but this requires computation. Also, the upper- and lower bounds are truncated by default when the column is a string. For DataHub this is fine, but you need to be aware of the limitations. That said, I do think there is value in a special API to quickly get table/column statistics. I think adding this to the [metadata tables is the right place](https://py.iceberg.apache.org/api/#partitions). WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org