bigluck commented on issue #1589:
URL:
https://github.com/apache/iceberg-python/issues/1589#issuecomment-2622544404
Thanks @kevinjqliu , and happy new year too :)
I see your point, and I see how the server should always control the
clients, but from my PoV the client should always have a way to overwrite these
defaults, otherwise you can't mix different server implementation together by
using a single data catalog.
The current situation is kind of confusing.
```
catalog = load_catalog(
"docs",
**{
"uri": "http://127.0.0.1:8181",
"s3.endpoint": "http://127.0.0.1:9000",
"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
"s3.access-key-id": "admin",
"s3.secret-access-key": "password",
}
)
```
In this example `py-io-impl` seems not to be a "client config" but more a
"catalog config", but `uri` is technically a "client config" and in any case
both of them are stored on the same property: `self.properties`.
Once started the Catalog REST API class send a request to the `/config`
endpoint by invoking the `_fetch_config()` function, and it does:
```python
config = config_response.defaults
config.update(self.properties)
config.update(config_response.overrides)
self.properties = config
```
It takes the `defaults` from the remote server, it extend them with
`self.properties` (my `uri, s3.endpoint, py-io-impl, s3.access-key-id,
s3.secret-access-key`) and it overrides the final config with `overrides`.
That's fine :)
In this case if the server wants to override the `py-io-impl` it can, by
setting the corresponding overrides key, otherwise the user wins against the
remote config.
What's weird is that these properties are not used when we to a `load_table`:
```python
identifier_tuple =
self._identifier_to_tuple_without_catalog(identifier)
response = self._session.get(
self.url(Endpoints.load_table, prefixed=True,
**self._split_identifier_for_path(identifier_tuple))
)
try:
response.raise_for_status()
except HTTPError as exc:
self._handle_non_200_response(exc, {404: NoSuchTableError})
table_response = TableResponse(**response.json())
return self._response_to_table(identifier_tuple, table_response)
```
we get the table data from the remote server, we ignore the local
configuration and we create a new table object:
```python
def _response_to_table(self, identifier_tuple: Tuple[str, ...],
table_response: TableResponse) -> Table:
return Table(
identifier=identifier_tuple,
metadata_location=table_response.metadata_location, # type:
ignore
metadata=table_response.metadata,
io=self._load_file_io(
{**table_response.metadata.properties,
**table_response.config}, table_response.metadata_location
),
catalog=self,
config=table_response.config,
)
```
This is what I think it's not ok:
```
{**table_response.metadata.properties, **table_response.config}
```
We should have a way to overwrite both the table properties and the table
config, because the client knows what it can use and what it can't.
I think there are ~3 options:
1. `{**table_response.metadata.properties, **table_response.config,
**self.properties}`
I don't like it because we propagate everything
2. When the user invoke `load_table()` she can optionally set list of
properties that will overwrite the remote properties:
```python
load_table('my.table', **{"py-io-impl":
"pyiceberg.io.pyarrow.PyArrowFileIO"})
```
3. Similarly to 1, but with a limited set of properties. In other words we
set a configurable list of porperties that needs to be propagated to the Table
object:
```python
{**table_response.metadata.properties, **table_response.config, **{k: v for
k, v in self.properties.items() where k in self.white_list_of_keys}}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]