Re: [I] PyIceberg Production Use case survey [iceberg-python]

via GitHub Wed, 13 Nov 2024 00:56:46 -0800


emorfam commented on issue #1202:
URL: 
https://github.com/apache/iceberg-python/issues/1202#issuecomment-2472881903


   Currently using PyIceberg for monitoring metadata statistics of Iceberg 
tables in a custom application (e.g. file count, record count, data 
distribution across partitions). We periodically compute these statistics and 
write them to Postgres and hook it up to Grafana. This gives us a better idea 
how to optimize Iceberg tables further (e.g. partition layout).
   
   In the long run we would like to use PyIceberg as a low-cost alternative to 
Glue streaming (possibly with AWS Lambda or Quix-Streams inside of Fargate). 
This is especially interesting for applications that are low-volume in data but 
have harder requirements on timeliness of data compared to batch jobs. Here are 
some example use cases:
   
   - Processing assembly-trees in manufacturing that change over time.
   - Ingesting sensor data from production plants that can contain duplicate 
messages.
   
   `MERGE` support would be really helpful here. I guess handling the amount of 
data that is being loaded from target table during the `MERGE` operation (e.g. 
with push-down predicates) will be the biggest obstacle.
   
   Thanks for the great work that the Iceberg community is doing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] PyIceberg Production Use case survey [iceberg-python]

Reply via email to