geruh opened a new issue, #570:
URL: https://github.com/apache/iceberg-python/issues/570

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   When initializing the GlueCatalog with a specific AWS profile, everything 
works as it should with catalog operations. But, we’ve hit a issue when it 
comes to working with S3 via the PyArrow S3FileSystem. Users can specify a 
profile for initiating a boto connection however, this preference doesn’t carry 
over to the S3FileSystem. Instead of using the specified AWS profile, 
   it will check the catalog configs for the s3 configs like:`s3.access-key-id, 
s3.region... `. If those aren't passed in PyArrow's S3Filesystem has it's own 
strategy of inferring credentials such as:
   1. the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN 
environment variables. 
   2. the default profile credentials in your ~/.aws/credentials and 
~/.aws/config.
   
   This workflow leads to some inconsistencies. For example, while Glue 
operations might be using a ux specified profile, S3 operations could end up 
using a different set of credentials or even a different region from what’s set 
in the environment variables or the AWS config files. This is seen in issue 
#515, where one region (like us-west-2) unexpectedly switches to another (like 
us-east-1), causing a 301 exception.
   
   For example:
   
   1. Set up an AWS profile in ~/.aws/config with an incorrect region:
   
   ```
   [default]
   region = us-east-1
   
   [test]
   region = us-west-2
   ```
   
   2. Initialize the GlueCatalog with the correct region you want to use:
   ```
   catalog = pyiceberg.catalog.load_catalog(
       catalog_name, **{"type": "glue", "profile_name": "test", "region_name": 
"us-west-2"}
   )
   ```
   
   3. load a table
   ```
   catalog.load_table("default.test")
   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   OSError: When reading information for key 
'test/metadata/00000-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in 
bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject 
operation: No response body.
   ```
   
   On one hand, we could argue that this profile configuration should only work 
at the catalog level, and for filesystems, the user must specify the 
aforementioned configs like `s3.region`. But on the other hand it seems 
reasonable that the AWS profile config should work uniformly across both the 
catalog and filesystem levels. This unified approach would certainly simplify 
configuration management for users. I’m leaning towards this perspective. 
However, we're currently utilizing PyArrow's S3FileSystem, which doesn't 
inherently support AWS profiles. This means we'd need to bridge that gap 
manually. 
   
   cc: @HonahX @Fokko @kevinjqliu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to