Fokko commented on PR #1742:
URL: https://github.com/apache/iceberg-python/pull/1742#issuecomment-2693590912

   I understand that it is a streaming workload? In that case, writing the 
manifest doesn't help a lot. I understand the problem. Let me think out loud:
   
   The problem with the current approach (in this PR) is that when the schema 
or partition spec changes, the `add_files` will compare it against the latest 
schema. While it is perfectly fine to still commit the files with an older 
schema to the table, I think the `add_files` will not allow this. Another 
concern is around setting precedence. For the append operations, this works 
perfectly fine. But when we want to do distributed rewrites or deletes, the 
order of operation is very important.
   
   > We needed to built is dynamic thus we deal with the changes of the 
DataFile class. If we can add method to DataFile for serialize / deserialize 
than the logic can be much simpler.
   
   The most obvious way of serializing it is by using Avro. This is efficient 
over the wire as well (I expect it to be much smaller than jsonpickle or 
regular pickle). I would be in favor of having this in combination with 
`append_data_file` because it is much more robust. This would also play very 
well with the logic suggested in 
https://github.com/apache/iceberg-python/issues/1678, resulting in far fewer 
conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to