martin-traverse opened a new issue, #47353:
URL: https://github.com/apache/arrow/issues/47353

   ### Describe the enhancement requested
   
   Hi - great to see the new Azure FileSystem implementation which completes 
the set with AWS and GCP. We're looking at switching our platform to use the 
Arrow native implementation for Azure, instead of going through FS Spec. 
Running locally everything works great.
   
   What doesn't work is workload identity credentials. We use that in CI 
(GitHub Actions with OIDC) and it works perfectly in fsspec / 
AzureBlobFileSystem with no special configuration. After some digging, the 
issue appears to be that the fsspec implementation uses the Azure Python SDK to 
handle credentials, while Arrow goes through the Azure C++ SDK with a lower 
level of control of the over the credentials mechanism. I can also see that in 
fsspec / Python SDK a credentials chain is being used, not sure if that has 
been set up in the Arrow implementation.
   
   Is it possible to get workload identity credentials added to the 
AzureFileSystem implementation? Best I can tell all the pieces do exist in the 
Azure C++ SDK and there seem to be references to workload identity in the Arrow 
C++ code as well. Is there a way to plug it together and expose in the Python 
API?
   
   There was a similar thing a while back with GCP and it got resolved quite 
quickly once support was available in the GCP libraries: 
https://github.com/apache/arrow/issues/34595
   
   In case it is helpful, here are some logs from our CI running fsspec / 
AzureBlobFileSystem with OIDC from GitHub Actions:
   
       tracdap.rt._plugins.storage_azure.AzureBlobStorageProvider - Using 
[default] credentials mechanism
       azure.identity.aio._credentials.environment - No environment 
configuration found.
       azure.identity.aio._credentials.managed_identity - 
ManagedIdentityCredential will use IMDS
       azure.identity._credentials.environment - No environment configuration 
found.
       azure.identity._credentials.managed_identity - ManagedIdentityCredential 
will use IMDS
       tracdap.rt._impl.core.storage.CommonFileStorage - INIT 
[tracdap_ci_storage_setup]: Common file storage, fs = [abfs], impl = [fsspec], 
root = [tracdap-ci-storage/]
       azure.identity.aio._credentials.chained - DefaultAzureCredential 
acquired a token from AzureCliCredential
   
   And this is what happens with the same setup using Arrow's own 
AzureFileSystem:
   
       tracdap.rt._plugins.storage_azure.AzureBlobStorageProvider - Using 
[default] credentials mechanism
       tracdap.rt._impl.core.storage.CommonFileStorage - INIT 
[tracdap_ci_storage_setup]: Common file storage, fs = [abfs], impl = [arrow], 
root = [tracdap-ci-storage/]
       -- snip --
       prior_stat: pa_fs.FileInfo = self._fs.get_file_info(resolved_path)
                                      ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
       File "pyarrow/_fs.pyx", line 615, in pyarrow._fs.FileSystem.get_file_info
       File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
       File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
       pyarrow.lib.ArrowException: Unknown error: 
           Check for Hierarchical Namespace support on 
'https://******.blob.core.windows.net/tracdap-ci-storage' failed:
           N5Azure4Core11Credentials23AuthenticationExceptionE: Failed to get 
token from DefaultAzureCredential.
   
   This is how the account looks in CI:
   
       {
         "environmentName": "AzureCloud",
         "homeTenantId": "***",
         "id": "***",
         "isDefault": true,
         "managedByTenants": [],
         "name": "****",
         "state": "Enabled",
         "tenantId": "***",
         "user": {
           "name": "***",
           "type": "servicePrincipal"
         }
       }
   
   And this is what I have locally, which works fine as you would expect:
   
       {
         "environmentName": "AzureCloud",
         "homeTenantId": "****,
         "id": "****,
         "isDefault": true,
         "managedByTenants": [],
         "name": "****",
         "state": "Enabled",
         "tenantDefaultDomain": "*****.onmicrosoft.com",
         "tenantDisplayName": "Default Directory",
         "tenantId": "****",
         "user": {
           "name": "****",
           "type": "user"
         }
       }
   
   It looks like managed identities are a bit better supported than workload 
identities, so we're going to test that in our sandbox environment. We can 
manage with static secrets in CI if need be, but it would be good to get 
workload identity working. If there's anything I can do to help, please let me 
know!
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to