Re: [I] `py-io-impl` config propagation [iceberg-python]

via GitHub Wed, 29 Jan 2025 10:44:44 -0800


bigluck commented on issue #1589:
URL: 
https://github.com/apache/iceberg-python/issues/1589#issuecomment-2622544404


   Thanks @kevinjqliu , and happy new year too :)
   
   I see your point, and I see how the server should always control the 
clients, but from my PoV the client should always have a way to overwrite these 
defaults, otherwise you can't mix different server implementation together by 
using a single data catalog.
   
   The current situation is kind of confusing.
   
   ```
    catalog = load_catalog( 
        "docs", 
        **{ 
            "uri": "http://127.0.0.1:8181";, 
            "s3.endpoint": "http://127.0.0.1:9000";, 
            "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", 
            "s3.access-key-id": "admin", 
            "s3.secret-access-key": "password", 
        } 
    ) 
   ```
   
   In this example `py-io-impl` seems not to be a "client config" but more a 
"catalog config", but `uri` is technically a "client config" and in any case 
both of them are stored on the same property: `self.properties`.
   
   Once started the Catalog REST API class send a request to the `/config` 
endpoint by invoking the `_fetch_config()` function, and it does:
   
   ```python
           config = config_response.defaults
           config.update(self.properties)
           config.update(config_response.overrides)
           self.properties = config
   ```
   
   It takes the `defaults` from the remote server, it extend them with 
`self.properties` (my `uri, s3.endpoint, py-io-impl, s3.access-key-id, 
s3.secret-access-key`) and it overrides the final config with `overrides`.
   That's fine :)
   
   In this case if the server wants to override the `py-io-impl` it can, by 
setting the corresponding overrides key, otherwise the user wins against the 
remote config.
   
   What's weird is that these properties are not used when we to a `load_table`:
   
   ```python
           identifier_tuple = 
self._identifier_to_tuple_without_catalog(identifier)
           response = self._session.get(
               self.url(Endpoints.load_table, prefixed=True, 
**self._split_identifier_for_path(identifier_tuple))
           )
           try:
               response.raise_for_status()
           except HTTPError as exc:
               self._handle_non_200_response(exc, {404: NoSuchTableError})
   
           table_response = TableResponse(**response.json())
           return self._response_to_table(identifier_tuple, table_response)
   ```
   
   we get the table data from the remote server, we ignore the local 
configuration and we create a new table object:
   
   ```python
   
       def _response_to_table(self, identifier_tuple: Tuple[str, ...], 
table_response: TableResponse) -> Table:
           return Table(
               identifier=identifier_tuple,
               metadata_location=table_response.metadata_location,  # type: 
ignore
               metadata=table_response.metadata,
               io=self._load_file_io(
                   {**table_response.metadata.properties, 
**table_response.config}, table_response.metadata_location
               ),
               catalog=self,
               config=table_response.config,
           )
   ```
   
   This is what I think it's not ok:
   ```
   {**table_response.metadata.properties, **table_response.config}
   ```
   
   We should have a way to overwrite both the table properties and the table 
config, because the client knows what it can use and what it can't.
   
   I think there are ~3 options:
   
   1. `{**table_response.metadata.properties, **table_response.config, 
**self.properties}`
   I don't like it because we propagate everything
   
   2. When the user invoke `load_table()` she can optionally set list of 
properties that will overwrite the remote properties:
   ```python
   load_table('my.table', **{"py-io-impl": 
"pyiceberg.io.pyarrow.PyArrowFileIO"})
   ```
   
   3. Similarly to 1, but with a limited set of properties. In other words we 
set a configurable list of porperties that needs to be propagated to the Table 
object:
   ```python
   {**table_response.metadata.properties, **table_response.config, **{k: v for 
k, v in self.properties.items() where k in self.white_list_of_keys}}


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] `py-io-impl` config propagation [iceberg-python]

Reply via email to