Fokko commented on code in PR #473: URL: https://github.com/apache/iceberg-python/pull/473#discussion_r1508081821
########## mkdocs/docs/configuration.md: ########## @@ -249,3 +249,7 @@ catalog: # Concurrency PyIceberg uses multiple threads to parallelize operations. The number of workers can be configured by supplying a `max-workers` entry in the configuration file, or by setting the `PYICEBERG_MAX_WORKERS` environment variable. The default value depends on the system hardware and Python version. See [the Python documentation](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor) for more details. + +# Backward Compatibility + +Previous versions of Java implementations incorrectly assumes the optional attribute `current-snapshot-id` to be a required attribute in the TableMetadata. Which means that if `current-snapshot-id` is missing in the metadata file (e.g. on table creation), the application will throw an exception without being able to load the table. This assumption has been corrected in more recent Iceberg versions. However, it is possible to force PyIceberg to create table with a metadata file that will be compatible with previous versions. This can be configured by setting the `legacy-current-snapshot-id` entry as "True" in the configuration file, or by setting the `LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR discussion](https://github.com/apache/iceberg-python/pull/473) for more details on the issue Review Comment: ```suggestion Previous versions of Java (`<1.4.0`) implementations incorrectly assume the optional attribute `current-snapshot-id` to be a required attribute in TableMetadata. This means that if `current-snapshot-id` is missing in the metadata file (e.g. on table creation), the application will throw an exception without being able to load the table. This assumption has been corrected in more recent Iceberg versions. However, it is possible to force PyIceberg to create a table with a metadata file that will be compatible with previous versions. This can be configured by setting the `legacy-current-snapshot-id` entry as "True" in the configuration file, or by setting the `LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR discussion](https://github.com/apache/iceberg-python/pull/473) for more details on the issue ``` ########## pyiceberg/serializers.py: ########## @@ -127,6 +129,11 @@ def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`. """ with output_file.create(overwrite=overwrite) as output_stream: - json_bytes = metadata.model_dump_json().encode(UTF8) + model_dump = metadata.model_dump_json() + if Config().get_bool("legacy-current-snapshot-id") and metadata.current_snapshot_id is None: + model_dict = json.loads(model_dump) + model_dict[CURRENT_SNAPSHOT_ID] = -1 + model_dump = json.dumps(model_dict) Review Comment: I don't think this is the best place to fix this. Mostly because we have to deserialize and serialize the metadata, and the rest of the deserialization logic is part of the Pydantic model. I think option 1 is much cleaner. We can set the ignore-None to `False`: ```python @staticmethod def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: bool = False) -> None: """Write a TableMetadata instance to an output file. Args: output_file (OutputFile): A custom implementation of the iceberg.io.file.OutputFile abstract base class. overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`. """ with output_file.create(overwrite=overwrite) as output_stream: json_bytes = metadata.model_dump_json(exclude_none=False).encode(UTF8) json_bytes = Compressor.get_compressor(output_file.location).bytes_compressor()(json_bytes) output_stream.write(json_bytes) ``` ########## pyiceberg/utils/config.py: ########## @@ -154,3 +155,19 @@ def get_catalog_config(self, catalog_name: str) -> Optional[RecursiveDict]: assert isinstance(catalog_conf, dict), f"Configuration path catalogs.{catalog_name_lower} needs to be an object" return catalog_conf return None + Review Comment: Nice, thanks for cleaning this up 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org