dennishuo commented on issue #10127: URL: https://github.com/apache/iceberg/issues/10127#issuecomment-2080302481
@ms1111 raises a good point, as there are some known incompatibilities in low-level Blob vs ADLS APIs: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues However, it looks like the discrepancies may or may not be fundamental to Iceberg/Hadoop use cases depending on what subset of each APIs the different client-side implementations use. It looks like while the legacy Hadoop `wasbs` impl is being deprecated, some of the underlying Blob APIs are still being used both by the legacy `BlobClient` as well as newer `DataLake*Client`. For example, the `DataLakeFileSystemClientBuilder` appears to always simultaneously create both an internal "datalake client" as well as a "blob client", pointing respectively at `dfs.core.windows.net` and `blob.core.windows.net`: https://github.com/Azure/azure-sdk-for-java/blob/0aa45226a625aa19da7183800bb90531eb1f1ee2/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileSystemClientBuilder.java#L175 public DataLakeFileSystemClientBuilder endpoint(String endpoint) { // Ensure endpoint provided is dfs endpoint endpoint = DataLakeImplUtils.endpointToDesiredEndpoint(endpoint, "dfs", "blob"); blobContainerClientBuilder.endpoint(DataLakeImplUtils.endpointToDesiredEndpoint(endpoint, "blob", "dfs")); And the helper function doesn't really care if the input is `dfs` or `blob` already, it just constructs with the desired one: https://github.com/Azure/azure-sdk-for-java/blob/21eb8bc8b4cce3e365bbcb9d139b07a3a554a2d9/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/implementation/util/DataLakeImplUtils.java#L16 public static String endpointToDesiredEndpoint(String endpoint, String desiredEndpoint, String currentEndpoint) { // Add the . on either side to prevent overwriting an account name. String desiredStringToMatch = "." + desiredEndpoint + "."; String currentRegexToMatch = "\\." + currentEndpoint + "\\."; if (endpoint.contains(desiredStringToMatch)) { return endpoint; } else { return endpoint.replaceFirst(currentRegexToMatch, desiredStringToMatch); } } And then some of the methods just delegate through to the "blob client" instead of the "datalake client": https://github.com/Azure/azure-sdk-for-java/blob/0aa45226a625aa19da7183800bb90531eb1f1ee2/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L1079 public FileReadResponse readWithResponse(OutputStream stream, FileRange range, DownloadRetryOptions options, DataLakeRequestConditions requestConditions, boolean getRangeContentMd5, Duration timeout, Context context) { return DataLakeImplUtils.returnOrConvertException(() -> { BlobDownloadResponse response = blockBlobClient.downloadWithResponse(stream, Transforms.toBlobRange(range), Transforms.toBlobDownloadRetryOptions(options), Transforms.toBlobRequestConditions(requestConditions), getRangeContentMd5, timeout, context); return Transforms.toFileReadResponse(response); }, LOGGER); } While it'll be important for all service providers to also migrate to no longer producing `wasbs://` paths by default and ensuring users all have ADLSv2 enabled, it seems to me the Iceberg metadata/manifest-list/manifest files will be fairly sticky for lots of large Iceberg deployments with exabytes of data involved, so even if new data is migrated quickly, the libraries really need to support ingesting `wasbs://` scheme for the foreseeable future. Are there any Azure experts who can confirm that `ADLSFileIO` strictly adheres to the subset of `DataLakeFileSystemClient` that only depends on the `Blob` semantics? If so, could that mean it's possible to use `DataLakeFileSystemClient` even for blob storage accounts that don't enable ADLSv2? If it'll be guaranteed to remain drop-in compatible, it seems like one approach could be to include `wasbs://` in `ResolvingFileIO` to also map to `ADLSFileIO` and then make `ADLSLocation` more permissive to accept `wasbs://` URIs. Looks like it may not technically even need to do any explicit path-translation to `abfss` since the scheme prefix `abfss` is discarded anyways, and the code I pasted above from Azure client libraries seems to not actually care too much about whether you're trying to configure the client with `dfs.core.windows.net` or `blob.core.windows.net`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org