kevinjqliu commented on code in PR #1453: URL: https://github.com/apache/iceberg-python/pull/1453#discussion_r1902219038
########## mkdocs/docs/configuration.md: ########## @@ -102,21 +102,21 @@ For the FileIO there are several configuration options available: <!-- markdown-link-check-disable --> -| Key | Example | Description | -|----------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| s3.endpoint | <https://10.0.19.25/> | Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. | -| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. | -| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. | -| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. | -| s3.role-session-name | session | An optional identifier for the assumed role session. | -| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. | -| s3.signer | bearer | Configure the signature version of the FileIO. | -| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. | -| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). | -| s3.region | us-west-2 | Sets the region of the bucket | -| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. | -| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. | -| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. | +| Key | Example | Description | +|----------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| s3.endpoint | <https://10.0.19.25/> | Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. | +| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. | +| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. | +| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. | +| s3.role-session-name | session | An optional identifier for the assumed role session. | +| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. | +| s3.signer | bearer | Configure the signature version of the FileIO. | +| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. | +| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). | +| s3.region | us-west-2 | Configure the default region used to initialize an S3FileSystem. This setting will be overwritten if the bucket actually used resolves to a different region. | Review Comment: ```suggestion | s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. | ``` ########## pyiceberg/io/pyarrow.py: ########## @@ -362,6 +362,12 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION), } + # Override the default s3.region if netloc(bucket) resolves to a different region + try: + client_kwargs["region"] = resolve_s3_region(netloc) Review Comment: I think there are these 3 cases we're worried about: ``` # region match region=us-east-1 s3://foo-us-east-1/ s3://bar-us-east-1/ # region mismatch region=us-west-2 s3://foo-us-east-1/ s3://bar-us-west-2/ # region not provided region=None s3://foo-us-east-1/ s3://bar-us-west-2/ ``` We have 2 options here 1. use `region` when provided, fallback to `resolve_s3_region` 2. always use `resolve_s3_region` 3. `resolve_s3_region`, fall back to `region` Option 1 is difficult since we dont know that the provided `region` is wrong until we try to use the FileIO. The code above uses option 2 which will always make an extra call to S3 to get the correct bucket region. This extra call to S3 is cached though, so `resolve_s3_region` is only called once per bucket. This is similar to the `cache_regions` option for [s3fs.core.S3FileSystem](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem) I like option 3, we can resolve the bucket region and fallback to the provided `region`. It might be confusing to the enduser when a `region` is specified but the FileIO uses a different region, so lets add a warning for that. Something like this ``` # Attempt to resolve the S3 region for the bucket, falling back to configured region if resolution fails # Note, bucket resolution is cached and only called once per bucket provided_region = get_first_property_value(self.properties, S3_REGION, AWS_REGION) try: bucket_region = resolve_s3_region(bucket=netloc) except (OSError, TypeError): bucket_region = None logger.warning(f"Unable to resolve region for bucket {netloc}, using default region {provided_region}") if bucket_region and bucket_region != provided_region: logger.warning( f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: " f"provided region {provided_region}, actual region {bucket_region}" ) region = bucket_region or provided_region client_kwargs: Dict[str, Any] = { "endpoint_override": self.properties.get(S3_ENDPOINT), "access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), "secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), "session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), "region": region, } ``` ########## tests/io/test_pyarrow.py: ########## @@ -2074,3 +2077,65 @@ def test__to_requested_schema_timestamps_without_downcast_raises_exception( _to_requested_schema(requested_schema, file_schema, batch, downcast_ns_timestamp_to_us=False, include_field_ids=False) assert "Unsupported schema projection from timestamp[ns] to timestamp[us]" in str(exc_info.value) + Review Comment: nit: add a test using both LocalFilesystem with S3Filesystem for #1041 ########## pyiceberg/io/pyarrow.py: ########## @@ -352,7 +352,7 @@ def parse_location(location: str) -> Tuple[str, str, str]: def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: if scheme in {"s3", "s3a", "s3n", "oss"}: - from pyarrow.fs import S3FileSystem + from pyarrow.fs import S3FileSystem, resolve_s3_region Review Comment: nit: since `oss` scheme uses this path, does it also support `resolve_s3_region`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org