kevinjqliu commented on code in PR #1453:
URL: https://github.com/apache/iceberg-python/pull/1453#discussion_r1902219038


##########
mkdocs/docs/configuration.md:
##########
@@ -102,21 +102,21 @@ For the FileIO there are several configuration options 
available:
 
 <!-- markdown-link-check-disable -->
 
-| Key                  | Example                    | Description              
                                                                                
                                                                                
                                                                 |
-|----------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| s3.endpoint          | <https://10.0.19.25/>      | Configure an alternative 
endpoint of the S3 service for the FileIO to access. This could be used to use 
S3FileIO with any s3-compatible object storage service that has a different 
endpoint, or access a private S3 endpoint in a virtual private cloud. |
-| s3.access-key-id     | admin                      | Configure the static 
access key id used to access the FileIO.                                        
                                                                                
                                                                     |
-| s3.secret-access-key | password                   | Configure the static 
secret access key used to access the FileIO.                                    
                                                                                
                                                                     |
-| s3.session-token     | AQoDYXdzEJr...             | Configure the static 
session token used to access the FileIO.                                        
                                                                                
                                                                     |
-| s3.role-session-name      | session                    | An optional 
identifier for the assumed role session.                                        
                                                                                
                                                                              |
-| s3.role-arn          | arn:aws:...                | AWS Role ARN. If 
provided instead of access_key and secret_key, temporary credentials will be 
fetched by assuming this role.                                                  
                                                                            |
-| s3.signer            | bearer                     | Configure the signature 
version of the FileIO.                                                          
                                                                                
                                                                  |
-| s3.signer.uri        | <http://my.signer:8080/s3> | Configure the remote 
signing uri if it differs from the catalog uri. Remote signing is only 
implemented for `FsspecFileIO`. The final request is sent to 
`<s3.signer.uri>/<s3.signer.endpoint>`.                                         
                 |
-| s3.signer.endpoint   | v1/main/s3-sign            | Configure the remote 
signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The 
final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : 
v1/aws/s3/sign).                                                            |
-| s3.region            | us-west-2                  | Sets the region of the 
bucket                                                                          
                                                                                
                                                                   |
-| s3.proxy-uri         | <http://my.proxy.com:8080> | Configure the proxy 
server to be used by the FileIO.                                                
                                                                                
                                                                      |
-| s3.connect-timeout   | 60.0                       | Configure socket 
connection timeout, in seconds.                                                 
                                                                                
                                                                         |
-| s3.force-virtual-addressing   | False                       | Whether to use 
virtual addressing of buckets. If true, then virtual addressing is always 
enabled. If false, then virtual addressing is only enabled if endpoint_override 
is empty. This can be used for non-AWS backends that only support virtual 
hosted-style access.                                                            
                                                                                
                                                           |
+| Key                  | Example                    | Description              
                                                                                
                                                                                
                                                                               |
+|----------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| s3.endpoint          | <https://10.0.19.25/>      | Configure an alternative 
endpoint of the S3 service for the FileIO to access. This could be used to use 
S3FileIO with any s3-compatible object storage service that has a different 
endpoint, or access a private S3 endpoint in a virtual private cloud.           
    |
+| s3.access-key-id     | admin                      | Configure the static 
access key id used to access the FileIO.                                        
                                                                                
                                                                                
   |
+| s3.secret-access-key | password                   | Configure the static 
secret access key used to access the FileIO.                                    
                                                                                
                                                                                
   |
+| s3.session-token     | AQoDYXdzEJr...             | Configure the static 
session token used to access the FileIO.                                        
                                                                                
                                                                                
   |
+| s3.role-session-name      | session                    | An optional 
identifier for the assumed role session.                                        
                                                                                
                                                                                
            |
+| s3.role-arn          | arn:aws:...                | AWS Role ARN. If 
provided instead of access_key and secret_key, temporary credentials will be 
fetched by assuming this role.                                                  
                                                                                
          |
+| s3.signer            | bearer                     | Configure the signature 
version of the FileIO.                                                          
                                                                                
                                                                                
|
+| s3.signer.uri        | <http://my.signer:8080/s3> | Configure the remote 
signing uri if it differs from the catalog uri. Remote signing is only 
implemented for `FsspecFileIO`. The final request is sent to 
`<s3.signer.uri>/<s3.signer.endpoint>`.                                         
                               |
+| s3.signer.endpoint   | v1/main/s3-sign            | Configure the remote 
signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The 
final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : 
v1/aws/s3/sign).                                                                
          |
+| s3.region            | us-west-2                  | Configure the default 
region used to initialize an S3FileSystem. This setting will be overwritten if 
the bucket actually used resolves to a different region.                        
                                                                                
   |

Review Comment:
   ```suggestion
   | s3.region            | us-west-2                  | Configure the default 
region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to 
automatically resolve the region for each S3 bucket, falling back to this value 
if resolution fails.                                                            
                                               |
   ```



##########
pyiceberg/io/pyarrow.py:
##########
@@ -362,6 +362,12 @@ def _initialize_fs(self, scheme: str, netloc: 
Optional[str] = None) -> FileSyste
                 "region": get_first_property_value(self.properties, S3_REGION, 
AWS_REGION),
             }
 
+            # Override the default s3.region if netloc(bucket) resolves to a 
different region
+            try:
+                client_kwargs["region"] = resolve_s3_region(netloc)

Review Comment:
   I think there are these 3 cases we're worried about:
   ```
   # region match
   region=us-east-1
   s3://foo-us-east-1/
   s3://bar-us-east-1/
   
   # region mismatch
   region=us-west-2
   s3://foo-us-east-1/
   s3://bar-us-west-2/
   
   # region not provided
   region=None
   s3://foo-us-east-1/
   s3://bar-us-west-2/
   ```
   
   We have 2 options here
   1. use `region` when provided, fallback to `resolve_s3_region`
   2. always use `resolve_s3_region`
   3. `resolve_s3_region`, fall back to `region` 
   
   Option 1 is difficult since we dont know that the provided `region` is wrong 
until we try to use the FileIO. 
   
   The code above uses option 2 which will always make an extra call to S3 to 
get the correct bucket region. This extra call to S3 is cached though, so 
`resolve_s3_region` is only called once per bucket. 
   This is similar to the `cache_regions` option for 
[s3fs.core.S3FileSystem](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem)
   
   I like option 3, we can resolve the bucket region and fallback to the 
provided `region`. It might be confusing to the enduser when a `region` is 
specified but the FileIO uses a different region, so lets add a warning for 
that.  
   
   Something like this
   ```
   # Attempt to resolve the S3 region for the bucket, falling back to 
configured region if resolution fails
   # Note, bucket resolution is cached and only called once per bucket
   provided_region = get_first_property_value(self.properties, S3_REGION, 
AWS_REGION)
   try:
       bucket_region = resolve_s3_region(bucket=netloc)
   except (OSError, TypeError):
       bucket_region = None
       logger.warning(f"Unable to resolve region for bucket {netloc}, using 
default region {provided_region}")
   
   if bucket_region and bucket_region != provided_region:
       logger.warning(
           f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: "
           f"provided region {provided_region}, actual region {bucket_region}"
       )
   region = bucket_region or provided_region
   
   client_kwargs: Dict[str, Any] = {
       "endpoint_override": self.properties.get(S3_ENDPOINT),
       "access_key": get_first_property_value(self.properties, 
S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
       "secret_key": get_first_property_value(self.properties, 
S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
       "session_token": get_first_property_value(self.properties, 
S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
       "region": region,
   }
   ```
   
   
   



##########
tests/io/test_pyarrow.py:
##########
@@ -2074,3 +2077,65 @@ def 
test__to_requested_schema_timestamps_without_downcast_raises_exception(
         _to_requested_schema(requested_schema, file_schema, batch, 
downcast_ns_timestamp_to_us=False, include_field_ids=False)
 
     assert "Unsupported schema projection from timestamp[ns] to timestamp[us]" 
in str(exc_info.value)
+

Review Comment:
   nit: add a test using both LocalFilesystem with S3Filesystem for #1041



##########
pyiceberg/io/pyarrow.py:
##########
@@ -352,7 +352,7 @@ def parse_location(location: str) -> Tuple[str, str, str]:
 
     def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> 
FileSystem:
         if scheme in {"s3", "s3a", "s3n", "oss"}:
-            from pyarrow.fs import S3FileSystem
+            from pyarrow.fs import S3FileSystem, resolve_s3_region

Review Comment:
   nit: since `oss` scheme uses this path, does it also support 
`resolve_s3_region`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to