[
https://issues.apache.org/jira/browse/HADOOP-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012602#comment-18012602
]
ASF GitHub Bot commented on HADOOP-19604:
-----------------------------------------
anujmodi2021 commented on code in PR #7853:
URL: https://github.com/apache/hadoop/pull/7853#discussion_r2260558410
##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/ITestAbfsOutputStream.java:
##########
@@ -481,6 +481,8 @@ public void testResetCalledOnExceptionInRemoteFlush()
throws Exception {
//expected exception
}
// Verify that reset was called on the message digest
Review Comment:
Can we add a similar test where with config disabled we assert that methods
to compute md5 hash were not called at all?
##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AbfsConfiguration.java:
##########
@@ -438,6 +438,10 @@ public class AbfsConfiguration{
FS_AZURE_ABFS_ENABLE_CHECKSUM_VALIDATION, DefaultValue =
DEFAULT_ENABLE_ABFS_CHECKSUM_VALIDATION)
private boolean isChecksumValidationEnabled;
+ @BooleanConfigurationValidatorAnnotation(ConfigurationKey =
+ FS_AZURE_ABFS_ENABLE_FULL_BLOB_CHECKSUM_VALIDATION, DefaultValue =
DEFAULT_ENABLE_FULL_BLOB_ABFS_CHECKSUM_VALIDATION)
Review Comment:
Do we need "ABFS" in Configuration Key variable name? We always have AZURE
alone.
##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/AbfsClientTestUtil.java:
##########
@@ -166,7 +166,12 @@ public static void
setMockAbfsRestOperationForFlushOperation(
requestHeaders.add(new AbfsHttpHeader(CONTENT_LENGTH,
String.valueOf(buffer.length)));
requestHeaders.add(new AbfsHttpHeader(CONTENT_TYPE, APPLICATION_XML));
requestHeaders.add(new AbfsHttpHeader(IF_MATCH, eTag));
- requestHeaders.add(new AbfsHttpHeader(X_MS_BLOB_CONTENT_MD5, blobMd5));
+ if (spiedClient.isFullBlobChecksumValidationEnabled()) {
Review Comment:
Can be simplified by having a single statement for adding blobMd5 to request
header.
blobMd5 value can be computed using a trilean operator.
Similar can be used in production code to make sure one of the way is used
to compute Md5 and variable is never null.
##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java:
##########
@@ -1103,8 +1107,21 @@ public AbfsRestOperation flush(byte[] buffer,
AbfsRestOperation op1 = getPathStatus(path, true, tracingContext,
contextEncryptionAdapter);
String metadataMd5 = op1.getResult().getResponseHeader(CONTENT_MD5);
- if (blobMd5 != null && !blobMd5.equals(metadataMd5)) {
- throw ex;
+ /*
+ * Validate the response by comparing the server's MD5 metadata
against either:
+ * 1. The full blob content MD5 (if full blob checksum validation is
enabled), or
+ * 2. The full block ID buffer MD5 (fallback if blob checksum
validation is disabled)
+ */
+ if (getAbfsConfiguration().isFullBlobChecksumValidationEnabled() &&
blobMd5 != null) {
Review Comment:
We can simply call the new method introduced in client here as well
`isFullBlobChecksumValidationEnabled`
##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/ITestAbfsOutputStream.java:
##########
@@ -481,6 +481,8 @@ public void testResetCalledOnExceptionInRemoteFlush()
throws Exception {
//expected exception
}
// Verify that reset was called on the message digest
- Mockito.verify(mockMessageDigest, Mockito.times(1)).reset();
+ if (spiedClient.isChecksumValidationEnabled()) {
+ Mockito.verify(mockMessageDigest, Mockito.times(1)).reset();
Review Comment:
Nit: assert message.
##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java:
##########
@@ -1073,11 +1073,15 @@ public AbfsRestOperation flush(byte[] buffer,
requestHeaders.add(new AbfsHttpHeader(CONTENT_LENGTH,
String.valueOf(buffer.length)));
requestHeaders.add(new AbfsHttpHeader(CONTENT_TYPE, APPLICATION_XML));
requestHeaders.add(new AbfsHttpHeader(IF_MATCH, eTag));
+ String md5Hash = null;
if (leaseId != null) {
requestHeaders.add(new AbfsHttpHeader(X_MS_LEASE_ID, leaseId));
}
- if (blobMd5 != null) {
+ if (isFullBlobChecksumValidationEnabled() && blobMd5 != null) {
requestHeaders.add(new AbfsHttpHeader(X_MS_BLOB_CONTENT_MD5, blobMd5));
+ } else {
Review Comment:
Is there a config even to avoid this checksum computation?
##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestWasbAbfsCompatibility.java:
##########
@@ -113,7 +114,10 @@ public void testReadFile() throws Exception {
boolean[] createFileWithAbfs = new boolean[]{false, true, false, true};
boolean[] readFileWithAbfs = new boolean[]{false, true, true, false};
- AzureBlobFileSystem abfs = getFileSystem();
+ Configuration conf = getRawConfiguration();
+ conf.setBoolean(FS_AZURE_ABFS_ENABLE_FULL_BLOB_CHECKSUM_VALIDATION, true);
+ FileSystem fileSystem = FileSystem.newInstance(conf);
Review Comment:
When creating new instance, we need to close it. Can you check for all the
tests in this class and make sure any new instance sis getting closed within
test itself?
##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsOutputStream.java:
##########
@@ -544,7 +547,12 @@ private void uploadBlockAsync(AbfsBlock blockToUpload,
outputStreamStatistics.bytesToUpload(bytesLength);
outputStreamStatistics.writeCurrentBuffer();
DataBlocks.BlockUploadData blockUploadData = blockToUpload.startUpload();
- String md5Hash = getMd5();
+ String md5Hash;
+ if (getClient().getAbfsConfiguration().getIsChecksumValidationEnabled()) {
Review Comment:
`isChecksumValidationEnabled()` can be used here?
##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java:
##########
@@ -1103,8 +1107,21 @@ public AbfsRestOperation flush(byte[] buffer,
AbfsRestOperation op1 = getPathStatus(path, true, tracingContext,
contextEncryptionAdapter);
String metadataMd5 = op1.getResult().getResponseHeader(CONTENT_MD5);
- if (blobMd5 != null && !blobMd5.equals(metadataMd5)) {
- throw ex;
+ /*
+ * Validate the response by comparing the server's MD5 metadata
against either:
+ * 1. The full blob content MD5 (if full blob checksum validation is
enabled), or
+ * 2. The full block ID buffer MD5 (fallback if blob checksum
validation is disabled)
+ */
+ if (getAbfsConfiguration().isFullBlobChecksumValidationEnabled() &&
blobMd5 != null) {
+ // Full blob content MD5 mismatch — integrity check failed
+ if (!blobMd5.equals(metadataMd5)) {
Review Comment:
`blobMd5` and `md5Hash` are the 2 variables used here for holding the md5
hash of data. For a single flush case there will either be a valid blobMd5 or a
valid md5Hash. Can we not merge them into a single variable and use that
everywher? Can there be a case where both are null??
If not we can simply use same variable for them and avoid null checks and
any chances of mixmathching variables as they 2 sounds very similar.
> ABFS: Fix WASB ABFS compatibility issues
> ----------------------------------------
>
> Key: HADOOP-19604
> URL: https://issues.apache.org/jira/browse/HADOOP-19604
> Project: Hadoop Common
> Issue Type: Sub-task
> Affects Versions: 3.4.1
> Reporter: Anmol Asrani
> Assignee: Anmol Asrani
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.1
>
>
> Fix WASB ABFS compatibility issues. Fix issues such as:-
> # BlockId computation to be consistent across clients for PutBlock and
> PutBlockList
> # Restrict url encoding of certain json metadata during setXAttr calls.
> # Maintain the md5 hash of whole block to validate data integrity during
> flush.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]