bigluck commented on issue #1589: URL: https://github.com/apache/iceberg-python/issues/1589#issuecomment-2622544404
Thanks @kevinjqliu , and happy new year too :) I see your point, and I see how the server should always control the clients, but from my PoV the client should always have a way to overwrite these defaults, otherwise you can't mix different server implementation together by using a single data catalog. The current situation is kind of confusing. ``` catalog = load_catalog( "docs", **{ "uri": "http://127.0.0.1:8181", "s3.endpoint": "http://127.0.0.1:9000", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", "s3.access-key-id": "admin", "s3.secret-access-key": "password", } ) ``` In this example `py-io-impl` seems not to be a "client config" but more a "catalog config", but `uri` is technically a "client config" and in any case both of them are stored on the same property: `self.properties`. Once started the Catalog REST API class send a request to the `/config` endpoint by invoking the `_fetch_config()` function, and it does: ```python config = config_response.defaults config.update(self.properties) config.update(config_response.overrides) self.properties = config ``` It takes the `defaults` from the remote server, it extend them with `self.properties` (my `uri, s3.endpoint, py-io-impl, s3.access-key-id, s3.secret-access-key`) and it overrides the final config with `overrides`. That's fine :) In this case if the server wants to override the `py-io-impl` it can, by setting the corresponding overrides key, otherwise the user wins against the remote config. What's weird is that these properties are not used when we to a `load_table`: ```python identifier_tuple = self._identifier_to_tuple_without_catalog(identifier) response = self._session.get( self.url(Endpoints.load_table, prefixed=True, **self._split_identifier_for_path(identifier_tuple)) ) try: response.raise_for_status() except HTTPError as exc: self._handle_non_200_response(exc, {404: NoSuchTableError}) table_response = TableResponse(**response.json()) return self._response_to_table(identifier_tuple, table_response) ``` we get the table data from the remote server, we ignore the local configuration and we create a new table object: ```python def _response_to_table(self, identifier_tuple: Tuple[str, ...], table_response: TableResponse) -> Table: return Table( identifier=identifier_tuple, metadata_location=table_response.metadata_location, # type: ignore metadata=table_response.metadata, io=self._load_file_io( {**table_response.metadata.properties, **table_response.config}, table_response.metadata_location ), catalog=self, config=table_response.config, ) ``` This is what I think it's not ok: ``` {**table_response.metadata.properties, **table_response.config} ``` We should have a way to overwrite both the table properties and the table config, because the client knows what it can use and what it can't. I think there are ~3 options: 1. `{**table_response.metadata.properties, **table_response.config, **self.properties}` I don't like it because we propagate everything 2. When the user invoke `load_table()` she can optionally set list of properties that will overwrite the remote properties: ```python load_table('my.table', **{"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"}) ``` 3. Similarly to 1, but with a limited set of properties. In other words we set a configurable list of porperties that needs to be propagated to the Table object: ```python {**table_response.metadata.properties, **table_response.config, **{k: v for k, v in self.properties.items() where k in self.white_list_of_keys}} -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org