bigluck opened a new issue, #1589: URL: https://github.com/apache/iceberg-python/issues/1589
### Apache Iceberg version None ### Please describe the bug 🐞 Hey team! 👋 Hope you're doing well! I've been working with PyIceberg and ran into an interesting situation regarding FileIO implementation configurations. TLTR: it seems like `pyiceberg` does not allow to overwrite (locally) some configurations returned by the remote Iceberg REST api, and I would like to understand if this is intended and what is the way to overwrite some of these configurations. ### Context - The REST API server (`/config` endpoint) is configured to use `PyArrow` as the default `FileIO` implementation - The table endpoint returns a configuration specifying a different `FileIO` implementation (`fsspec`) - I want to force `PyArrow` locally since I don't have `fsspec/s3fs` installed (and prefer using `PyArrow`) Starting from [PR 9868](https://github.com/projectnessie/nessie/pull/9868), the `Nessie` Iceberg endpoint always returns `config.py-io-impl=pyiceberg.io.fsspec.FsspecFileIO` when querying a table endpoint. While there might be a separate issue in Nessie's implementation (as my server is configured to override the `py-io-impl` with `pyarrow.PyArrowFileIO`), I believe there's also a concern in PyIceberg's handling of configuration priorities. For more info: - [My chat with the Nessie team on zulip](https://project-nessie.zulipchat.com/#narrow/channel/371187-general/topic/How.20to.20use.20PyarrowFileIO.20on.20pyiceberg.3F/near/494545947) - their original [PR 9868](https://github.com/projectnessie/nessie/pull/9868) - the temp workaround the Nessie team is implementing not [PR 10292](https://github.com/projectnessie/nessie/pull/10292) ### Current Behavior I've created a test case that explores different configuration scenarios. Here's what I'm observing: ```python pythonCopycatalog = load_catalog("docs", **{"uri": "https://a.b.c.d/iceberg", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"}) table = catalog.load_table("my_namespace.my_table") ``` similarly to what is describe on the official documentation: https://github.com/apache/iceberg-python/blob/1adbb87627bfdfe80622d78c57b6214957520be0/mkdocs/docs/api.md?plain=1#L56-L70 The table configuration from the REST API seems to take precedence over both: - The server's default configuration - Local overrides passed during catalog initialization This leads to failures when the table endpoint specifies `fsspec.FsspecFileIO` but `s3fs` isn't available locally: ``` ValueError: Could not initialize FileIO: pyiceberg.io.fsspec.FsspecFileIO ``` ### Real-world Impact This configuration priority issue creates practical problems in multi-system setups. Consider this scenario: - System A uses `fsspec` for writing tables into Nessie/Iceberg - System B needs to read the same tables using `PyArrow` With the current implementation, System B can never successfully read the tables because: - The server forces the client to use `fsspec` - This happens even when the client explicitly requests `PyArrow` - There's no way to override this behavior at the client level ### Question Is there a way for for PyIceberg to use a specific FileIO implementation regardless of what the table endpoint or the server returns? This would be particularly useful in scenarios where: - The client environment is set up for a specific implementation - Different `FileIO` implementations might be more efficient in certain environments - Required dependencies for the server-specified implementation aren't available locally. I've attached a test file that demonstrates the behavior Would love to hear your thoughts on this! Is this the intended behavior? If so, could we perhaps consider adding a way to override the table-level FileIO implementations? Thanks [test.txt](https://github.com/user-attachments/files/18584986/test.txt) ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org