dennishuo commented on issue #10127:
URL: https://github.com/apache/iceberg/issues/10127#issuecomment-2080302481

   @ms1111 raises a good point, as there are some known incompatibilities in 
low-level Blob vs ADLS APIs: 
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues
   
   However, it looks like the discrepancies may or may not be fundamental to 
Iceberg/Hadoop use cases depending on what subset of each APIs the different 
client-side implementations use.
   
   It looks like while the legacy Hadoop `wasbs` impl is being deprecated, some 
of the underlying Blob APIs are still being used both by the legacy 
`BlobClient` as well as newer `DataLake*Client`. For example, the 
`DataLakeFileSystemClientBuilder` appears to always simultaneously create both 
an internal "datalake client" as well as a "blob client", pointing respectively 
at `dfs.core.windows.net` and `blob.core.windows.net`: 
https://github.com/Azure/azure-sdk-for-java/blob/0aa45226a625aa19da7183800bb90531eb1f1ee2/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileSystemClientBuilder.java#L175
   
       public DataLakeFileSystemClientBuilder endpoint(String endpoint) {
           // Ensure endpoint provided is dfs endpoint
           endpoint = DataLakeImplUtils.endpointToDesiredEndpoint(endpoint, 
"dfs", "blob");
           
blobContainerClientBuilder.endpoint(DataLakeImplUtils.endpointToDesiredEndpoint(endpoint,
 "blob", "dfs"));
   
   
   And the helper function doesn't really care if the input is `dfs` or `blob` 
already, it just constructs with the desired one: 
https://github.com/Azure/azure-sdk-for-java/blob/21eb8bc8b4cce3e365bbcb9d139b07a3a554a2d9/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/implementation/util/DataLakeImplUtils.java#L16
   
       public static String endpointToDesiredEndpoint(String endpoint, String 
desiredEndpoint, String currentEndpoint) {
           // Add the . on either side to prevent overwriting an account name.
           String desiredStringToMatch = "." + desiredEndpoint + ".";
           String currentRegexToMatch = "\\." + currentEndpoint + "\\.";
           if (endpoint.contains(desiredStringToMatch)) {
               return endpoint;
           } else {
               return endpoint.replaceFirst(currentRegexToMatch, 
desiredStringToMatch);
           }
       }
           
   And then some of the methods just delegate through to the "blob client" 
instead of the "datalake client": 
https://github.com/Azure/azure-sdk-for-java/blob/0aa45226a625aa19da7183800bb90531eb1f1ee2/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L1079
   
       public FileReadResponse readWithResponse(OutputStream stream, FileRange 
range, DownloadRetryOptions options,
           DataLakeRequestConditions requestConditions, boolean 
getRangeContentMd5, Duration timeout, Context context) {
           return DataLakeImplUtils.returnOrConvertException(() -> {
               BlobDownloadResponse response = 
blockBlobClient.downloadWithResponse(stream, Transforms.toBlobRange(range),
                   Transforms.toBlobDownloadRetryOptions(options), 
Transforms.toBlobRequestConditions(requestConditions),
                   getRangeContentMd5, timeout, context);
               return Transforms.toFileReadResponse(response);
           }, LOGGER);
       }
   
   While it'll be important for all service providers to also migrate to no 
longer producing `wasbs://` paths by default and ensuring users all have ADLSv2 
enabled, it seems to me the Iceberg metadata/manifest-list/manifest files will 
be fairly sticky for lots of large Iceberg deployments with exabytes of data 
involved, so even if new data is migrated quickly, the libraries really need to 
support ingesting `wasbs://` scheme for the foreseeable future.
   
   Are there any Azure experts who can confirm that `ADLSFileIO` strictly 
adheres to the subset of `DataLakeFileSystemClient` that only depends on the 
`Blob` semantics? If so, could that mean it's possible to use 
`DataLakeFileSystemClient` even for blob storage accounts that don't enable 
ADLSv2?
   
   If it'll be guaranteed to remain drop-in compatible, it seems like one 
approach could be to include `wasbs://` in `ResolvingFileIO` to also map to 
`ADLSFileIO` and then make `ADLSLocation` more permissive to accept `wasbs://` 
URIs. Looks like it may not technically even need to do any explicit 
path-translation to `abfss` since the scheme prefix `abfss` is discarded 
anyways, and the code I pasted above from Azure client libraries seems to not 
actually care too much about whether you're trying to configure the client with 
`dfs.core.windows.net` or `blob.core.windows.net`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to