Fokko commented on PR #1742: URL: https://github.com/apache/iceberg-python/pull/1742#issuecomment-2693590912
I understand that it is a streaming workload? In that case, writing the manifest doesn't help a lot. I understand the problem. Let me think out loud: The problem with the current approach (in this PR) is that when the schema or partition spec changes, the `add_files` will compare it against the latest schema. While it is perfectly fine to still commit the files with an older schema to the table, I think the `add_files` will not allow this. Another concern is around setting precedence. For the append operations, this works perfectly fine. But when we want to do distributed rewrites or deletes, the order of operation is very important. > We needed to built is dynamic thus we deal with the changes of the DataFile class. If we can add method to DataFile for serialize / deserialize than the logic can be much simpler. The most obvious way of serializing it is by using Avro. This is efficient over the wire as well (I expect it to be much smaller than jsonpickle or regular pickle). I would be in favor of having this in combination with `append_data_file` because it is much more robust. This would also play very well with the logic suggested in https://github.com/apache/iceberg-python/issues/1678, resulting in far fewer conflicts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org