[
https://issues.apache.org/jira/browse/HADOOP-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anuj Modi resolved HADOOP-19620.
--------------------------------
Fix Version/s: 3.5.0
3.4.2
3.4.1
Assignee: Anuj Modi
Resolution: Not A Bug
> [ABFS] AzureADAuthenticator should be able to retry on UnknownHostException
> ---------------------------------------------------------------------------
>
> Key: HADOOP-19620
> URL: https://issues.apache.org/jira/browse/HADOOP-19620
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/azure
> Affects Versions: 3.4.1
> Reporter: Serhii Nesterov
> Assignee: Anuj Modi
> Priority: Minor
> Fix For: 3.5.0, 3.4.2, 3.4.1
>
>
> When Hadoop is requested to perform operations against ADLS Gen2 storage,
> *AbfsRestOperation* attempts to obtain an access token from Microsoft.
> Underneath the hood, it uses a simple *java.net.HttpURLConnection* HTTP
> client.
> Occasionally, environments may run into network intermittent issues,
> including DNS-related {*}UnknownHostException{*}. Technically, the HTTP
> client throws *IOException* whose cause is {*}UnknownHostException{*}.
> *AzureADAuthenticator* in its turn catches {*}IOException{*}, sets *httperror
> = -1* and then checks whether the error is recoverable and can be retried.
> However, it's neither an instance of {*}MalformedURLException{*}, nor an
> instance of {*}FileNotFoundException{*}, nor a recoverable status code ({*}<
> 100 || == 408 || >= 500 && != 501 && != 505{*}), hence a retry never occurs
> which is sensitive for our project causing problems with state recovery.
> The final exception stack trace on the client side looks as follows (Apache
> Spark application, tenant ID is redacted):
> {code:java}
> Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most
> recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor
> 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error
> -1; url='https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token'
> AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException:
> login.microsoftonline.com
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation
> AbfsRestOperation.java:321
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute
> AbfsRestOperation.java:263
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0
> AbfsRestOperation.java:235
> at
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
> IOStatisticsBinding.java:494
> at
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
> IOStatisticsBinding.java:465
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute
> AbfsRestOperation.java:233
> at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus
> AbfsClient.java:1099
> at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus
> AzureBlobFileSystemStore.java:1164
> at org.apache.hadoop.fs.azurebfs.Azure BlobFileSystem.getFileStatus
> AzureBlobFileSystem.java:766
> at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus
> AzureBlobFileSystem.java:756
> at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath
> HadoopInputFile.java:39
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter
> ParquetFooterReader.java:39
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$lzycompute$1
> ParquetFileFormat.scala:211
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$1
> ParquetFile Format.scala:210
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
> ParquetFileFormat.scala:213
> ...{code}
> I can see this exception is recovered in other parts of the Hadoop project
> (e.g., {*}DefaultAMSProcessor{*})
> We would like to have similar retry mechanisms for fetching tokens. Moreover,
> *AbfsRestOperation* already handles and retries *UnknownHostException* but
> that part seems to be applicable only to storage communication, not token
> retrieval. I suppose the solution would be simple - just match the cause's
> class name of *IOException* if it is an instance of *UnknownHostException*
> and apply retry policies as for other types of recoverable errors.
> The link to the code where I believe *UnknownHostException* would be checked
> for:
> [https://github.com/apache/hadoop/blob/61096793f6368d16a21cde8b1c8f8dce41a4c102/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java#L354]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]