Re: [PR] [Bug Fix] cast None `current-snapshot-id` as -1 for Backwards Compatibility [iceberg-python]

via GitHub Thu, 29 Feb 2024 12:48:57 -0800


Fokko commented on code in PR #473:
URL: https://github.com/apache/iceberg-python/pull/473#discussion_r1508081821



##########
mkdocs/docs/configuration.md:
##########
@@ -249,3 +249,7 @@ catalog:
 # Concurrency
 
 PyIceberg uses multiple threads to parallelize operations. The number of 
workers can be configured by supplying a `max-workers` entry in the 
configuration file, or by setting the `PYICEBERG_MAX_WORKERS` environment 
variable. The default value depends on the system hardware and Python version. 
See [the Python 
documentation](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor)
 for more details.
+
+# Backward Compatibility
+
+Previous versions of Java implementations incorrectly assumes the optional 
attribute `current-snapshot-id` to be a required attribute in the 
TableMetadata. Which means that if `current-snapshot-id` is missing in the 
metadata file (e.g. on table creation), the application will throw an exception 
without being able to load the table. This assumption has been corrected in 
more recent Iceberg versions. However, it is possible to force PyIceberg to 
create table with a metadata file that will be compatible with previous 
versions. This can be configured by setting the `legacy-current-snapshot-id` 
entry as "True" in the configuration file, or by setting the 
`LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR 
discussion](https://github.com/apache/iceberg-python/pull/473) for more details 
on the issue

Review Comment:
   ```suggestion
   Previous versions of Java (`<1.4.0`) implementations incorrectly assume the 
optional attribute `current-snapshot-id` to be a required attribute in 
TableMetadata. This means that if `current-snapshot-id` is missing in the 
metadata file (e.g. on table creation), the application will throw an exception 
without being able to load the table. This assumption has been corrected in 
more recent Iceberg versions. However, it is possible to force PyIceberg to 
create a table with a metadata file that will be compatible with previous 
versions. This can be configured by setting the `legacy-current-snapshot-id` 
entry as "True" in the configuration file, or by setting the 
`LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR 
discussion](https://github.com/apache/iceberg-python/pull/473) for more details 
on the issue
   ```



##########
pyiceberg/serializers.py:
##########
@@ -127,6 +129,11 @@ def table_metadata(metadata: TableMetadata, output_file: 
OutputFile, overwrite:
             overwrite (bool): Where to overwrite the file if it already 
exists. Defaults to `False`.
         """
         with output_file.create(overwrite=overwrite) as output_stream:
-            json_bytes = metadata.model_dump_json().encode(UTF8)
+            model_dump = metadata.model_dump_json()
+            if Config().get_bool("legacy-current-snapshot-id") and 
metadata.current_snapshot_id is None:
+                model_dict = json.loads(model_dump)
+                model_dict[CURRENT_SNAPSHOT_ID] = -1
+                model_dump = json.dumps(model_dict)

Review Comment:
   I don't think this is the best place to fix this. Mostly because we have to 
deserialize and serialize the metadata, and the rest of the deserialization 
logic is part of the Pydantic model. I think option 1 is much cleaner. We can 
set the ignore-None to `False`:
   
   ```python
       @staticmethod
       def table_metadata(metadata: TableMetadata, output_file: OutputFile, 
overwrite: bool = False) -> None:
           """Write a TableMetadata instance to an output file.
   
           Args:
               output_file (OutputFile): A custom implementation of the 
iceberg.io.file.OutputFile abstract base class.
               overwrite (bool): Where to overwrite the file if it already 
exists. Defaults to `False`.
           """
           with output_file.create(overwrite=overwrite) as output_stream:
               json_bytes = 
metadata.model_dump_json(exclude_none=False).encode(UTF8)
               json_bytes = 
Compressor.get_compressor(output_file.location).bytes_compressor()(json_bytes)
               output_stream.write(json_bytes)
   ```



##########
pyiceberg/utils/config.py:
##########
@@ -154,3 +155,19 @@ def get_catalog_config(self, catalog_name: str) -> 
Optional[RecursiveDict]:
                 assert isinstance(catalog_conf, dict), f"Configuration path 
catalogs.{catalog_name_lower} needs to be an object"
                 return catalog_conf
         return None
+

Review Comment:
   Nice, thanks for cleaning this up 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [Bug Fix] cast None `current-snapshot-id` as -1 for Backwards Compatibility [iceberg-python]

Reply via email to