Re: [PR] Fix read from multiple s3 regions [iceberg-python]

via GitHub Fri, 20 Dec 2024 13:38:51 -0800


kevinjqliu commented on PR #1453:
URL: https://github.com/apache/iceberg-python/pull/1453#issuecomment-2557749844

> Is the change I made in accordance with this option? What I've done
essentially is using the netloc to determine the bucket region. Only in case
when, for some reason, the region cannot be determined then we fall back to the
properties configuration.

Im dont think `netloc` can be used to determine the region. S3 URI scheme
doesn't use `netloc`, only S3 URL does.
For example, heres how `fs_by_scheme` is typically used

https://github.com/apache/iceberg-python/blob/dbcf65b4892779efca7362e069edecff7f2bf69f/pyiceberg/io/pyarrow.py#L434-L436

and running an example S3 URI:
```
from pyiceberg.io.pyarrow import PyArrowFileIO
scheme, netloc, path = PyArrowFileIO.parse_location("s3://a/b/c/1.txt")
# returns ('s3', 'a', 'a/b/c/1.txt')
```

In order to support multiple regions, we might need to call
`resolve_s3_region` first and pass the `region` to `fs_by_scheme`. If you look
at it from `S3FileSystem`'s perspective, we need a new `S3FileSystem` object
per region. This relates to how the `FileSystem` is cached.

BTW a good test scenario can be a table where my metadata files are stored
in one bucket while my data files are stored in another. We might be able to
construct this test case by modifying the `minio` settings to create different
regional buckets; I haven't tested this yet.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Fix read from multiple s3 regions [iceberg-python]

Reply via email to