kevinjqliu commented on issue #1589: URL: https://github.com/apache/iceberg-python/issues/1589#issuecomment-2646078141
Thanks for chiming in here! I took some time to look into this. Here's what I found: ### On Catalog Configuration The configuration precedence is described in the REST spec for the `/config` endpoint. Based on the REST spec, here is the expected behavior for catalog configurations. https://github.com/apache/iceberg/blob/d935460bbed5da0ac205a67912b70fa682ee84df/open-api/rest-catalog-open-api.yaml#L81-L98 Mainly, ``` Catalog configuration is constructed by setting the defaults, then client- provided configuration, and finally overrides. The final property set is then used to configure the catalog. ``` The REST catalog implementation does indeed follow this order. https://github.com/apache/iceberg-python/blob/826a00628216993de50f7f0c2111b6c824b02939/pyiceberg/catalog/rest.py#L397-L400 So when Nessie [added the feature](https://github.com/projectnessie/nessie/pull/9868/files#diff-35b0e3dabf006d8b221a85dbed21058b2a79856d533015afa740d565514a6eb3R267) to send `py-io-impl=pyiceberg.io.fsspec.FsspecFileIO` using the configurations endpoint (`/config`), it is sent as part of the configurations endpoint's `default` properties. This means the properties specified in `load_catalog` should override this value if it is provided. And the server can use `override` properties to trump over the `load_catalog` properties. To summarize: * Client can override `/config`'s `default` parameters * `config`'s `override` parameters can override client parameters ### On Table Configuration Here's the REST spec for the `loadTable` endpoint https://github.com/apache/iceberg/blob/d935460bbed5da0ac205a67912b70fa682ee84df/open-api/rest-catalog-open-api.yaml#L920-L925 ``` Load a table from the catalog. The response contains both configuration and table metadata. The configuration, if non-empty is used as additional configuration for the table that overrides catalog configuration. For example, this configuration may change the FileIO implementation to be used for the table. ``` This corresponds with https://github.com/apache/iceberg-python/blob/826a00628216993de50f7f0c2111b6c824b02939/pyiceberg/catalog/rest.py#L527-L531 `self._load_file_io` here is defined in the base catalog class and will include the catalog's properties https://github.com/apache/iceberg-python/blob/826a00628216993de50f7f0c2111b6c824b02939/pyiceberg/catalog/__init__.py#L728 Expanding the parameters ``` {**self.properties , {**table_response.metadata.properties, **table_response.config}} -> {**{**config_response.defaults, **self.properties, **config_response.overrides}, {**table_response.metadata.properties, **table_response.config}} -> **{config_response.defaults, self.properties, config_response.overrides, table_response.metadata.properties, table_response.config} ``` where `config_response` is from `/config` endpoint, `self.properties` is the properties given to `load_catalog`, and `table_response` is from `loadTable`. Note, `table_response.metadata.properties` should be the same as the table's own property. ### Conclusion I think the real culprit is from [Nessie `icebergConfigDefaults` setting the `py-io-impl` config](https://github.com/projectnessie/nessie/pull/9868/files#diff-35b0e3dabf006d8b221a85dbed21058b2a79856d533015afa740d565514a6eb3R267). `icebergConfigDefaults` is called by [`configureIcebergTable`](https://github.com/projectnessie/nessie/blob/9ae0f670a8a67aa573612cde4f547aac0b61f112/catalog/files/impl/src/main/java/org/projectnessie/catalog/files/s3/S3ObjectIO.java#L174-L190) which is called by [`icebergConfigPerTable`](https://github.com/projectnessie/nessie/blob/main/catalog/service/rest/src/main/java/org/projectnessie/catalog/service/rest/IcebergConfigurer.java#L216-L255) and then finally called by [loadTable](https://github.com/projectnessie/nessie/blob/main/catalog/service/rest/src/main/java/org/projectnessie/catalog/service/rest/IcebergApiV1TableResource.java#L276-L298) This means every `loadTable` response from Nessie will include the config to override `py-io-impl`. And based on the hierarchy above, this will always override any parameters set by the config endpoint, the catalog itself, and the table's property. In conclusion, pyiceberg REST client is working as intended. Nessie should not send `py-io-impl` config override as part of the loadTable response. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org