[I] FileIO Implementation Configuration Priority Question [iceberg-python]

via GitHub Wed, 29 Jan 2025 01:22:29 -0800


bigluck opened a new issue, #1589:
URL: https://github.com/apache/iceberg-python/issues/1589


   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   Hey team! 👋 Hope you're doing well!
   I've been working with PyIceberg and ran into an interesting situation 
regarding FileIO implementation configurations.
   
   TLTR: it seems like `pyiceberg` does not allow to overwrite (locally) some 
configurations returned by the remote Iceberg REST api, and I would like to 
understand if this is intended and what is the way to overwrite some of these 
configurations.
   
   ### Context
   
   - The REST API server (`/config` endpoint) is configured to use `PyArrow` as 
the default `FileIO` implementation
   - The table endpoint returns a configuration specifying a different `FileIO` 
implementation (`fsspec`)
   - I want to force `PyArrow` locally since I don't have `fsspec/s3fs` 
installed (and prefer using `PyArrow`)
   
   Starting from [PR 9868](https://github.com/projectnessie/nessie/pull/9868), 
the `Nessie` Iceberg endpoint always returns 
`config.py-io-impl=pyiceberg.io.fsspec.FsspecFileIO` when querying a table 
endpoint. While there might be a separate issue in Nessie's implementation (as 
my server is configured to override the `py-io-impl` with 
`pyarrow.PyArrowFileIO`), I believe there's also a concern in PyIceberg's 
handling of configuration priorities.
   
   For more info:
   - [My chat with the Nessie team on 
zulip](https://project-nessie.zulipchat.com/#narrow/channel/371187-general/topic/How.20to.20use.20PyarrowFileIO.20on.20pyiceberg.3F/near/494545947)
   - their original [PR 9868](https://github.com/projectnessie/nessie/pull/9868)
   - the temp workaround the Nessie team is implementing not [PR 
10292](https://github.com/projectnessie/nessie/pull/10292)
   
   ### Current Behavior
   
   I've created a test case that explores different configuration scenarios. 
Here's what I'm observing:
   
   ```python
   pythonCopycatalog = load_catalog("docs", **{"uri": 
"https://a.b.c.d/iceberg";, "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"})
   table = catalog.load_table("my_namespace.my_table")
   ```
   
   similarly to what is describe on the official documentation:
   
   
https://github.com/apache/iceberg-python/blob/1adbb87627bfdfe80622d78c57b6214957520be0/mkdocs/docs/api.md?plain=1#L56-L70
   
   The table configuration from the REST API seems to take precedence over both:
   - The server's default configuration
   - Local overrides passed during catalog initialization
   
   This leads to failures when the table endpoint specifies 
`fsspec.FsspecFileIO` but `s3fs` isn't available locally:
   
   ```
   ValueError: Could not initialize FileIO: pyiceberg.io.fsspec.FsspecFileIO
   ```
   
   ### Real-world Impact
   
   This configuration priority issue creates practical problems in multi-system 
setups. Consider this scenario:
   
   - System A uses `fsspec` for writing tables into Nessie/Iceberg
   - System B needs to read the same tables using `PyArrow`
   
   With the current implementation, System B can never successfully read the 
tables because:
   
   - The server forces the client to use `fsspec`
   - This happens even when the client explicitly requests `PyArrow`
   - There's no way to override this behavior at the client level
   
   
   ### Question
   
   Is there a way for for PyIceberg to use a specific FileIO implementation 
regardless of what the table endpoint or the server returns?
   This would be particularly useful in scenarios where:
   
   - The client environment is set up for a specific implementation
   - Different `FileIO` implementations might be more efficient in certain 
environments
   - Required dependencies for the server-specified implementation aren't 
available locally.
   
   I've attached a test file that demonstrates the behavior
   
   Would love to hear your thoughts on this! Is this the intended behavior? If 
so, could we perhaps consider adding a way to override the table-level FileIO 
implementations?
   
   
   Thanks
   
   [test.txt](https://github.com/user-attachments/files/18584986/test.txt)
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] FileIO Implementation Configuration Priority Question [iceberg-python]

Reply via email to