Serhii Nesterov created HADOOP-19620:
----------------------------------------
Summary: AzureADAuthenticator should be able to retry on
UnknownHostException
Key: HADOOP-19620
URL: https://issues.apache.org/jira/browse/HADOOP-19620
Project: Hadoop Common
Issue Type: Improvement
Components: auth
Affects Versions: 3.4.1
Reporter: Serhii Nesterov
When Hadoop is requested to perform operations against ADLS Gen2 storage,
`AbfsRestOperation` attempts to obtain an access token from Microsoft.
Underneath the hood, it uses a simple `java.net.HttpURLConnection` HTTP client.
Occasionally, enviroments may run into network intermittent issues, including
DNS-related `UnknownHostException`. Technically, the HTTP client throws
`IOException` whose cause is `UnknownHostException`. AzureADAuthenticator in
turn catches `IOException`, sets `httperror = -1` and then checks whether the
error is recoverable and can be retried. It's neither an instance of
`MalformedURLException`, nor an instance of `FileNotFoundException`, nor a
recoverable status code (< 100 || == 408 || >= 500 && != 501 && != 505), hence
a retry never occurs which is sensitive for our project causing problems with
state recovery.
The final exception stack trace on the client side looks as follows (Apache
Spark application):
{code:java}
Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most
recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 29
: Status code: -1 error code: null error message: Auth failure: HTTP Error -1;
url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token'
AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException:
login.microsoftonline.com
at org.apache.hadoop.fs.azurebfs.services. Abfs
RestOperation.executeHttpOperation Abfs RestOperation.java:321
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute
AbfsRestOperation.java:263
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.lambda$exe_cute$0
AbfsRestOperation.java:235
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
IOStatisticsBinding.java:494
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
IOStatisticsBinding.java:465
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs
RestOperation.java:233
at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus
AbfsClient.java:1099
at
org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus
AzureBlobFileSystemStore.java:1164
at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus
AzureBlobFileSystem.java:766
at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus
AzureBlobFileSystem.java:756
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath
HadoopInputFile.java:39
at org.apache.spark.sql.execution.datasources. parquet.
ParquetFooterReader.readFooter ParquetFooterReader.java:39
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
ParquetFileFormat.scala:213
...{code}
I can see this exception is recovered in other parts of the Hadoop project
(e.g., `DefaultAMSProcessor`)
We would like to have similar retry mechanisms for fetching tokens. Moreover,
`AbfsRestOperation` already handles and retries `UnknownHostException` but that
part seems to be applicable only to storage communication, not token retrieval.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]