jabbera opened a new issue, #46085:
URL: https://github.com/apache/arrow/issues/46085

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   TLDR: Trying to write a dataset using a user delegated sas token will always 
fail with a 403 error. 
   
   User delegated sas tokens in Azure have no account level permissions. This 
limits their ability to get container level properties. This causes the check 
here: 
https://github.com/apache/arrow/blob/968721b0898457b03d4eebd18a6fdb3156c53423/cpp/src/arrow/filesystem/azurefs.cc#L2223-L2224
 to always return a 403 error.
   
   I think the appropriate thing to do here would be to assume the container 
exists but am happy to be provided an alternate approach. This is the approach 
the adlfs fsspec implementation has [taken]( 
https://github.com/fsspec/adlfs/blob/adb9c53b74a0d420625b86dd00fbe615b43201d2/adlfs/spec.py#L178-L183).
 I'm motivated to resolve this and have contributed in the past 
(https://github.com/apache/arrow/pull/45706 and 
https://github.com/apache/arrow/pull/45759). I'm just looking for some 
direction before I create the PR. 
   
   Raised Error:
   ```
   An error occurred using pyarrow AzureFileSystem: GetProperties for 
'https://abd1234.blob.core.windows.net/lab?se=2025-04-17T03%3A38%3A20Z&sig=2agAW7B47PRDp%2B%2B1qXmdn7TPQ%2BC9Sj5LYf/FbMMgxxs%3D&ske=2025-04-17T03%3A38%3A20Z&skoid=4d2c13a8-9cee-4848-b4bc-d1c6c2ced41b&sks=b&skt=2025-04-10T03%3A38%3A20Z&sktid=337b9f7b-9e69-4689-9b0d-3417bd3d8566&skv=2025-05-05&sp=racwdlmeop&sr=c&sv=2025-05-05'
 failed. Azure Error: [AuthorizationFailure] 403 This request is not authorized 
to perform this operation.
   This request is not authorized to perform this operation.
   ```
   
   Reproduction below:
   
   Create the HNS enabled storage account specified at: STORAGE_ACCOUNT_NAME. 
(Same issue with a blob account however. This repro is just written against 
HNS).
   
   Assign Storage Blob Data Owner RBAC role to the account to whatever Entra ID 
account you are using.
   
   Wait 10-15 minutes for good measure.
   
   below script can be run with : uv run or you can pip install the deps listed 
at the top into a venv.
   
   ```
   # /// script
   # dependencies = [
   #   "adlfs>=2024.12.0",
   #   "pandas>=2.1.0",
   #   "pyarrow>=19",
   #   "azure-storage-blob>=12.25.1",
   #   "azure-storage-file-datalake>=12.20.0",
   #   "azure-identity>=1.21.0",
   #   "numpy>=1.24.0",
   # ]
   # ///
   
   from adlfs import AzureBlobFileSystem
   import pyarrow as pa
   import pyarrow.fs as fs
   import pyarrow.dataset as ds
   import numpy as np
   
   from datetime import datetime, timedelta, timezone
   from azure.identity import DefaultAzureCredential
   from azure.storage.filedatalake import (
       generate_file_system_sas,
       DataLakeServiceClient,
       FileSystemSasPermissions,
   )
   
   
   # Generate Random table to write:
   
   # Define the size of the dataset
   rows = 1  # Adjust the number of rows
   cols = 10  # Number of columns
   
   # Generate random data
   data = {f"col_{i}": np.random.rand(rows) for i in range(cols)}
   
   # Create a PyArrow Table
   table = pa.Table.from_pydict(data)
   
   # Generate User Delegated Sas Token
   
   STORAGE_ACCOUNT_NAME = "abd1234"
   
   CONTAINER_NAME = "lab"
   FILE_LOCATION = 
f"{CONTAINER_NAME}/personal/michael.barry/temp/random_dataset/"
   
   dl_client = DataLakeServiceClient(
       account_url=f"https://{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net";,
       credential=DefaultAzureCredential(),
   )
   
   TOKEN_TIME_TO_LIVE = 7 * 24 * 60 * 60
   
   start_time = datetime.now(timezone.utc) - timedelta(
       hours=1
   )  # start in the past to avoid any clock skew issues
   end_time = start_time + timedelta(seconds=TOKEN_TIME_TO_LIVE)
   
   user_delegation_key = dl_client.get_user_delegation_key(start_time, end_time)
   
   all_permissions = FileSystemSasPermissions(
       read=True,
       write=True,
       delete=True,
       list=True,
       add=True,
       create=True,
       move=True,
       execute=True,
       manage_ownership=True,
       manage_access_control=True,
   )
   
   sas_token = generate_file_system_sas(
       STORAGE_ACCOUNT_NAME,
       CONTAINER_NAME,
       credential=user_delegation_key,
       expiry=end_time,
       permission=all_permissions,
   )
   
   
   # write to pyarrow
   
   azure_fs = 
fs.FileSystem.from_uri(f"abfs://{CONTAINER_NAME}@{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/?{sas_token}")[0]
   
   
   try:
       ds.write_dataset(
               table,
               FILE_LOCATION,
               format="parquet",
               filesystem=azure_fs,
               existing_data_behavior="overwrite_or_ignore",
           )
   except Exception as e:
       print(f"An error occurred using pyarrow AzureFileSystem: {e}")
   
   
   print("Writing with adlfs instead to show sas token works fine")
    
   write_fs = AzureBlobFileSystem(
       account_name=STORAGE_ACCOUNT_NAME,
       sas_token=sas_token,
   )
   
   ds.write_dataset(
       table,
       FILE_LOCATION,
       format="parquet",
       filesystem=write_fs,
       existing_data_behavior="overwrite_or_ignore",
   )
      
       
   
   # If the dataset exists, the read works fine so we know the sas_token is okay
   print(ds.dataset(
       FILE_LOCATION,
       format="parquet",
       filesystem=azure_fs,
   ).to_table().to_pandas().shape)
   
   print("Read Successful")
   ```
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to