[
https://issues.apache.org/jira/browse/HADOOP-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Serhii Nesterov updated HADOOP-19620:
-------------------------------------
Description:
When Hadoop is requested to perform operations against ADLS Gen2 storage,
AbfsRestOperation attempts to obtain an access token from Microsoft. Underneath
the hood, it uses a simple java.net.HttpURLConnection HTTP client.
Occasionally, enviroments may run into network intermittent issues, including
DNS-related UnknownHostException. Technically, the HTTP client throws
IOException whose cause is UnknownHostException. AzureADAuthenticator in turn
catches IOException, sets httperror = -1 and then checks whether the error is
recoverable and can be retried. However, it's neither an instance of
MalformedURLException, nor an instance of FileNotFoundException, nor a
recoverable status code (< 100 || == 408 || >= 500 && != 501 && != 505), hence
a retry never occurs which is sensitive for our project causing problems with
state recovery.
The final exception stack trace on the client side looks as follows (Apache
Spark application, tenant ID is redacted):
{code:java}
Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most
recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 29
: Status code: -1 error code: null error message: Auth failure: HTTP Error -1;
url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token'
AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException:
login.microsoftonline.com
at org.apache.hadoop.fs.azurebfs.services. Abfs
RestOperation.executeHttpOperation Abfs RestOperation.java:321
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute
AbfsRestOperation.java:263
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.lambda$exe_cute$0
AbfsRestOperation.java:235
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
IOStatisticsBinding.java:494
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
IOStatisticsBinding.java:465
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs
RestOperation.java:233
at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus
AbfsClient.java:1099
at
org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus
AzureBlobFileSystemStore.java:1164
at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus
AzureBlobFileSystem.java:766
at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus
AzureBlobFileSystem.java:756
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath
HadoopInputFile.java:39
at org.apache.spark.sql.execution.datasources. parquet.
ParquetFooterReader.readFooter ParquetFooterReader.java:39
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
ParquetFileFormat.scala:213
...{code}
I can see this exception is recovered in other parts of the Hadoop project
(e.g., DefaultAMSProcessor)
We would like to have similar retry mechanisms for fetching tokens. Moreover,
AbfsRestOperation already handles and retries UnknownHostException but that
part seems to be applicable only to storage communication, not token retrieval.
was:
When Hadoop is requested to perform operations against ADLS Gen2 storage,
`AbfsRestOperation` attempts to obtain an access token from Microsoft.
Underneath the hood, it uses a simple `java.net.HttpURLConnection` HTTP client.
Occasionally, enviroments may run into network intermittent issues, including
DNS-related `UnknownHostException`. Technically, the HTTP client throws
`IOException` whose cause is `UnknownHostException`. AzureADAuthenticator in
turn catches `IOException`, sets `httperror = -1` and then checks whether the
error is recoverable and can be retried. It's neither an instance of
`MalformedURLException`, nor an instance of `FileNotFoundException`, nor a
recoverable status code (< 100 || == 408 || >= 500 && != 501 && != 505), hence
a retry never occurs which is sensitive for our project causing problems with
state recovery.
The final exception stack trace on the client side looks as follows (Apache
Spark application):
{code:java}
Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most
recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 29
: Status code: -1 error code: null error message: Auth failure: HTTP Error -1;
url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token'
AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException:
login.microsoftonline.com
at org.apache.hadoop.fs.azurebfs.services. Abfs
RestOperation.executeHttpOperation Abfs RestOperation.java:321
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute
AbfsRestOperation.java:263
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.lambda$exe_cute$0
AbfsRestOperation.java:235
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
IOStatisticsBinding.java:494
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
IOStatisticsBinding.java:465
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs
RestOperation.java:233
at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus
AbfsClient.java:1099
at
org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus
AzureBlobFileSystemStore.java:1164
at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus
AzureBlobFileSystem.java:766
at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus
AzureBlobFileSystem.java:756
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath
HadoopInputFile.java:39
at org.apache.spark.sql.execution.datasources. parquet.
ParquetFooterReader.readFooter ParquetFooterReader.java:39
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210
at org.apache.spark.sql.execution.datasources.parquet.
ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
ParquetFileFormat.scala:213
...{code}
I can see this exception is recovered in other parts of the Hadoop project
(e.g., `DefaultAMSProcessor`)
We would like to have similar retry mechanisms for fetching tokens. Moreover,
`AbfsRestOperation` already handles and retries `UnknownHostException` but that
part seems to be applicable only to storage communication, not token retrieval.
> AzureADAuthenticator should be able to retry on UnknownHostException
> --------------------------------------------------------------------
>
> Key: HADOOP-19620
> URL: https://issues.apache.org/jira/browse/HADOOP-19620
> Project: Hadoop Common
> Issue Type: Improvement
> Components: auth
> Affects Versions: 3.4.1
> Reporter: Serhii Nesterov
> Priority: Minor
>
> When Hadoop is requested to perform operations against ADLS Gen2 storage,
> AbfsRestOperation attempts to obtain an access token from Microsoft.
> Underneath the hood, it uses a simple java.net.HttpURLConnection HTTP client.
> Occasionally, enviroments may run into network intermittent issues, including
> DNS-related UnknownHostException. Technically, the HTTP client throws
> IOException whose cause is UnknownHostException. AzureADAuthenticator in turn
> catches IOException, sets httperror = -1 and then checks whether the error is
> recoverable and can be retried. However, it's neither an instance of
> MalformedURLException, nor an instance of FileNotFoundException, nor a
> recoverable status code (< 100 || == 408 || >= 500 && != 501 && != 505),
> hence a retry never occurs which is sensitive for our project causing
> problems with state recovery.
> The final exception stack trace on the client side looks as follows (Apache
> Spark application, tenant ID is redacted):
> {code:java}
> Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most
> recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor
> 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error
> -1; url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token'
> AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException:
> login.microsoftonline.com
> at org.apache.hadoop.fs.azurebfs.services. Abfs
> RestOperation.executeHttpOperation Abfs RestOperation.java:321
> at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute
> AbfsRestOperation.java:263
> at org.apache.hadoop.fs.azurebfs.services.
> AbfsRestOperation.lambda$exe_cute$0 AbfsRestOperation.java:235
> at
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
> IOStatisticsBinding.java:494
> at
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
> IOStatisticsBinding.java:465
> at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs
> RestOperation.java:233
> at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus
> AbfsClient.java:1099
> at
> org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus
> AzureBlobFileSystemStore.java:1164
> at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus
> AzureBlobFileSystem.java:766
> at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus
> AzureBlobFileSystem.java:756
> at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath
> HadoopInputFile.java:39
> at org.apache.spark.sql.execution.datasources. parquet.
> ParquetFooterReader.readFooter ParquetFooterReader.java:39
> at org.apache.spark.sql.execution.datasources.parquet.
> ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211
> at org.apache.spark.sql.execution.datasources.parquet.
> ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210
> at org.apache.spark.sql.execution.datasources.parquet.
> ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
> ParquetFileFormat.scala:213
> ...{code}
> I can see this exception is recovered in other parts of the Hadoop project
> (e.g., DefaultAMSProcessor)
> We would like to have similar retry mechanisms for fetching tokens. Moreover,
> AbfsRestOperation already handles and retries UnknownHostException but that
> part seems to be applicable only to storage communication, not token
> retrieval.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]