Re: [I] Support setting a snapshot property in same commit as spark.sql [iceberg-python]

via GitHub Tue, 26 Mar 2024 10:00:37 -0700


brianfromoregon commented on issue #368:
URL: https://github.com/apache/iceberg-python/issues/368#issuecomment-2020990035


   Hi @syun64, thanks for chiming in! 
   
   My batch app store historical data, there is always a date column. It runs 
for each date and will insert data for that date. Sometimes there is 
legitimately no data available for a particular date, no matter how many times 
it runs there will never be data. Other times the app has an error or fails to 
run and needs to be re-run for a date. I'm trying to allow my app to 
differentiate between missing dates and present-but-empty dates so it does not 
constantly try re-running for dates that will never produce data. When I was 
using raw parquet files I would simply write an empty file for a date to 
represent present-but-empty. Asking in Slack I learned that Iceberg does not 
support this concept (for example no empty partitions allowed) so instead I am 
aiming to use metadata (snapshot properties) to store the date ranges that are 
reflected in the stored data.
   
   In order to implement this with snapshot properties I want my writer to do 
the following transactionally:
   1. Fetch the current snapshot's dateranges property.
   2. Modify that dateranges value to include the dates which are about to be 
written.
   3. Merge the new data and update the dateranges snapshot property, in the 
same new snapshot.
   
   If another concurrent writer were to write its own new snapshot between step 
1 and 3, I would want my writer to throw an exception and then I'll try again 
at step 1 starting from the latest snapshot.
   
   Today I use PySpark Iceberg for writing because PyIceberg does not yet 
support partitioned writes. PyIceberg is getting partitioned writes soon, I am 
excited to try it! But until then I'm using PySpark for writing and want some 
way to accomplish steps 1-3 from a python client. I hope this explains my goal 
and motivation.
   
   Another approach I had in mind was to be able to read and write snapshot 
properties from PySpark SQL query. That is appealing because it would be a 
single-client solution which would also allow my non-python clients to perform 
writes that honor this dateranges property.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Support setting a snapshot property in same commit as spark.sql [iceberg-python]

Reply via email to