Re: [I] Count rows as a metadata-only operation [iceberg-python]

via GitHub Tue, 29 Oct 2024 23:21:52 -0700


Fokko commented on issue #1223:
URL: 
https://github.com/apache/iceberg-python/issues/1223#issuecomment-2445969277

Thanks @Visorgood for reaching out here, and that's an excellent idea. We
actually already do this in a project like Datahub, see:
https://github.com/datahub-project/datahub/blob/0e62c699fc2e4cf2d3525e899037b8277541cfd6/metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py#L141-L162

There are some limitations that @sungwy already pointed out, such as
applying a filter. There are a couple more, such as when you have positional
deletes, the row-counts are not accurate anymore. You would need to apply the
deletes and then count, but this requires computation. Also, the upper- and
lower bounds are truncated by default when the column is a string. For DataHub
this is fine, but you need to be aware of the limitations.

That said, I do think there is value in a special API to quickly get
table/column statistics. I think adding this to the [metadata tables is the
right place](https://py.iceberg.apache.org/api/#partitions). WDYT?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Count rows as a metadata-only operation [iceberg-python]

Reply via email to