emorfam commented on issue #1202: URL: https://github.com/apache/iceberg-python/issues/1202#issuecomment-2472881903
Currently using PyIceberg for monitoring metadata statistics of Iceberg tables in a custom application (e.g. file count, record count, data distribution across partitions). We periodically compute these statistics and write them to Postgres and hook it up to Grafana. This gives us a better idea how to optimize Iceberg tables further (e.g. partition layout). In the long run we would like to use PyIceberg as a low-cost alternative to Glue streaming (possibly with AWS Lambda or Quix-Streams inside of Fargate). This is especially interesting for applications that are low-volume in data but have harder requirements on timeliness of data compared to batch jobs. Here are some example use cases: - Processing assembly-trees in manufacturing that change over time. - Ingesting sensor data from production plants that can contain duplicate messages. `MERGE` support would be really helpful here. I guess handling the amount of data that is being loaded from target table during the `MERGE` operation (e.g. with push-down predicates) will be the biggest obstacle. Thanks for the great work that the Iceberg community is doing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org