brianfromoregon commented on issue #368: URL: https://github.com/apache/iceberg-python/issues/368#issuecomment-2020990035
Hi @syun64, thanks for chiming in! My batch app store historical data, there is always a date column. It runs for each date and will insert data for that date. Sometimes there is legitimately no data available for a particular date, no matter how many times it runs there will never be data. Other times the app has an error or fails to run and needs to be re-run for a date. I'm trying to allow my app to differentiate between missing dates and present-but-empty dates so it does not constantly try re-running for dates that will never produce data. When I was using raw parquet files I would simply write an empty file for a date to represent present-but-empty. Asking in Slack I learned that Iceberg does not support this concept (for example no empty partitions allowed) so instead I am aiming to use metadata (snapshot properties) to store the date ranges that are reflected in the stored data. In order to implement this with snapshot properties I want my writer to do the following transactionally: 1. Fetch the current snapshot's dateranges property. 2. Modify that dateranges value to include the dates which are about to be written. 3. Merge the new data and update the dateranges snapshot property, in the same new snapshot. If another concurrent writer were to write its own new snapshot between step 1 and 3, I would want my writer to throw an exception and then I'll try again at step 1 starting from the latest snapshot. Today I use PySpark Iceberg for writing because PyIceberg does not yet support partitioned writes. PyIceberg is getting partitioned writes soon, I am excited to try it! But until then I'm using PySpark for writing and want some way to accomplish steps 1-3 from a python client. I hope this explains my goal and motivation. Another approach I had in mind was to be able to read and write snapshot properties from PySpark SQL query. That is appealing because it would be a single-client solution which would also allow my non-python clients to perform writes that honor this dateranges property. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org