NikitaMatskevich opened a new pull request, #2111:
URL: https://github.com/apache/iceberg-python/pull/2111

   <!--
   Thanks for opening a pull request!
   -->
   
   <!-- In the case this PR will resolve an issue, please replace 
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
   <!-- Closes #${GITHUB_ISSUE_ID} -->
   
   # Rationale for this change
   
   Starting from version 20, PyArrow supports ADLS filesystem. This PR adds 
Pyarrow Azure support to Pyiceberg.
   
   PyArrow is the [default 
IO](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/__init__.py#L366-L369)
 for Pyiceberg catalogs. In Azure environment it handles wider spectrum of auth 
strategies then Fsspec, including, for instance, [Managed 
Identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview).
 Also, prior to this PR (that is not merged yet) there was no support for 
wasb(s) with Fsspec.
   
   # Are these changes tested?
   
   Tests are added under tests/io/test_pyarrow.py.
   
   # Are there any user-facing changes?
   
   There are no API breaking changes. Direct impact of the PR: Pyarrow FileIO 
in Pyiceberg supports Azure cloud environment. Examples of impact for final 
users: Pyiceberg is usable in services with Managed Identities auth strategy.
   
   Right now the only way to use Managed identities with pyiceberg is extremely 
hacky:
   ```
   import os
   from fsspec import AbstractFileSystem
   from pyiceberg.io.fsspec import FsspecFileIO
   from pyiceberg.catalog.rest import RestCatalog
   from typing import Any
   
   ADLS_ANON = "adls.anon"
   ADLS_CONNECTION_STRING = "adls.connection-string"
   ADLS_ACCOUNT_NAME = "adls.account-name"
   ADLS_ACCOUNT_KEY = "adls.account-key"
   ADLS_SAS_TOKEN = "adls.sas-token"
   ADLS_TENANT_ID = "adls.tenant-id"
   ADLS_CLIENT_ID = "adls.client-id"
   ADLS_CLIENT_SECRET = "adls.client-secret"
   ADLS_ACCOUNT_HOST = "adls.account-host"
   
   Properties = dict[str, Any]
   
   def my_adls(properties: Properties) -> AbstractFileSystem:
       from adlfs import AzureBlobFileSystem
   
       for key, sas_token in {
           key.replace(f"{ADLS_SAS_TOKEN}.", ""): value for key, value in 
properties.items() if key.startswith(ADLS_SAS_TOKEN)
       }.items():
           if ADLS_ACCOUNT_NAME not in properties:
               properties[ADLS_ACCOUNT_NAME] = key.split(".")[0]
           if ADLS_SAS_TOKEN not in properties:
               properties[ADLS_SAS_TOKEN] = sas_token
   
       return AzureBlobFileSystem(
           connection_string=properties.get(ADLS_CONNECTION_STRING),
           anon=properties.get(ADLS_ANON),
           account_name=properties.get(ADLS_ACCOUNT_NAME),
           account_key=properties.get(ADLS_ACCOUNT_KEY),
           sas_token=properties.get(ADLS_SAS_TOKEN),
           tenant_id=properties.get(ADLS_TENANT_ID),
           client_id=properties.get(ADLS_CLIENT_ID),
           client_secret=properties.get(ADLS_CLIENT_SECRET),
           account_host=properties.get(ADLS_ACCOUNT_HOST),
       )
   
   injected_file_io = FsspecFileIO(properties={ADLS_ANON: False, 
ADLS_ACCOUNT_NAME: "my-account"})
   injected_file_io.get_fs = lambda scheme: my_adls(injected_file_io.properties)
   
   catalog = RestCatalog(
       name="test_catalog",
       uri="https://my-url/internal/catalog";,
       properties={
           "io-impl": "pyiceberg.io.fsspec.FsspecFileIO",
       }
   )
   catalog.file_io = injected_file_io
   
   table = catalog.load_table("test.my_test_table")
   table.io = injected_file_io
   table.scan(limit=100).to_pandas()
   ```
   
   As you can see, at least the "anon" flag must be passed to 
AzureBlobFileSystem, which is not currently done by Pyiceberg. Also, IO must be 
injected. With this PR it can be reduced to normal workflow:
   ```
   catalog = load_catalog(uri="https://my-url/internal/catalog";)
   table = catalog.load_table("test.my_test_table")
   table.scan(limit=100).to_pandas()
   ```
   
   <!-- In the case of user-facing changes, please add the changelog label. -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to