NikitaMatskevich opened a new pull request, #2111: URL: https://github.com/apache/iceberg-python/pull/2111
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change Starting from version 20, PyArrow supports ADLS filesystem. This PR adds Pyarrow Azure support to Pyiceberg. PyArrow is the [default IO](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/__init__.py#L366-L369) for Pyiceberg catalogs. In Azure environment it handles wider spectrum of auth strategies then Fsspec, including, for instance, [Managed Identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview). Also, prior to this PR (that is not merged yet) there was no support for wasb(s) with Fsspec. # Are these changes tested? Tests are added under tests/io/test_pyarrow.py. # Are there any user-facing changes? There are no API breaking changes. Direct impact of the PR: Pyarrow FileIO in Pyiceberg supports Azure cloud environment. Examples of impact for final users: Pyiceberg is usable in services with Managed Identities auth strategy. Right now the only way to use Managed identities with pyiceberg is extremely hacky: ``` import os from fsspec import AbstractFileSystem from pyiceberg.io.fsspec import FsspecFileIO from pyiceberg.catalog.rest import RestCatalog from typing import Any ADLS_ANON = "adls.anon" ADLS_CONNECTION_STRING = "adls.connection-string" ADLS_ACCOUNT_NAME = "adls.account-name" ADLS_ACCOUNT_KEY = "adls.account-key" ADLS_SAS_TOKEN = "adls.sas-token" ADLS_TENANT_ID = "adls.tenant-id" ADLS_CLIENT_ID = "adls.client-id" ADLS_CLIENT_SECRET = "adls.client-secret" ADLS_ACCOUNT_HOST = "adls.account-host" Properties = dict[str, Any] def my_adls(properties: Properties) -> AbstractFileSystem: from adlfs import AzureBlobFileSystem for key, sas_token in { key.replace(f"{ADLS_SAS_TOKEN}.", ""): value for key, value in properties.items() if key.startswith(ADLS_SAS_TOKEN) }.items(): if ADLS_ACCOUNT_NAME not in properties: properties[ADLS_ACCOUNT_NAME] = key.split(".")[0] if ADLS_SAS_TOKEN not in properties: properties[ADLS_SAS_TOKEN] = sas_token return AzureBlobFileSystem( connection_string=properties.get(ADLS_CONNECTION_STRING), anon=properties.get(ADLS_ANON), account_name=properties.get(ADLS_ACCOUNT_NAME), account_key=properties.get(ADLS_ACCOUNT_KEY), sas_token=properties.get(ADLS_SAS_TOKEN), tenant_id=properties.get(ADLS_TENANT_ID), client_id=properties.get(ADLS_CLIENT_ID), client_secret=properties.get(ADLS_CLIENT_SECRET), account_host=properties.get(ADLS_ACCOUNT_HOST), ) injected_file_io = FsspecFileIO(properties={ADLS_ANON: False, ADLS_ACCOUNT_NAME: "my-account"}) injected_file_io.get_fs = lambda scheme: my_adls(injected_file_io.properties) catalog = RestCatalog( name="test_catalog", uri="https://my-url/internal/catalog", properties={ "io-impl": "pyiceberg.io.fsspec.FsspecFileIO", } ) catalog.file_io = injected_file_io table = catalog.load_table("test.my_test_table") table.io = injected_file_io table.scan(limit=100).to_pandas() ``` As you can see, at least the "anon" flag must be passed to AzureBlobFileSystem, which is not currently done by Pyiceberg. Also, IO must be injected. With this PR it can be reduced to normal workflow: ``` catalog = load_catalog(uri="https://my-url/internal/catalog") table = catalog.load_table("test.my_test_table") table.scan(limit=100).to_pandas() ``` <!-- In the case of user-facing changes, please add the changelog label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org