[jira] [Updated] (HADOOP-18847) mapreduce job encounters java.io.IOException when dfs.client.rbf.observer.read.enable is true

Chunyi Yang (Jira) Wed, 09 Aug 2023 23:34:05 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chunyi Yang updated HADOOP-18847:
---------------------------------
    Description: 
While executing a mapreduce job in an environment utilizing Router-Based 
Federation with Observer read enabled, there is an estimated 1% chance of 
encountering the following error.
{code:java}
"java.io.IOException: Resource 
hdfs://XXXX/user/XXXX/.staging/job_XXXXXX/.tez/application_XXXXXX/tez-conf.pb 
changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
\"2023-07-07T12:41:16.822+0900\", current time: 
\"2023-07-07T12:41:22.386+0900\"",
{code}
This error happens in function verifyAndCopy inside FSDownload.java when 
nodemanager tries to download a file right after the file has been written to 
the HDFS. The write operation runs on active namenode and read operation runs 
on observer namenode as expected.

The edits file and hdfs-audit files indicate that the expected timestamp 
mentioned in the error message aligns with the OP_CLOSE MTIME of the 
'tez-conf.pb' file (which is correct). However, the actual timestamp retrieved 
from the read operation corresponds to the OP_ADD MTIME of the target 
'tez-conf.pf' file (which is incorrect). This inconsistency suggests that the 
observer namenode responds to the client before its edits file is updated with 
the latest stateId.

Further troubleshooting has revealed that during write operations, the router 
responds to the client before receiving the latest stateId from the active 
namenode. Consequently, the outdated stateId is then used in the subsequent 
read operation on the observer namenode, leading to inaccuracies in the 
information provided by the observer namenode.

To resolve this issue, it is essential to ensure that the router sends a 
response to the client only after receiving the latest stateId from the active 
namenode.

  was:
While executing a mapreduce job in an environment utilizing Router-Based 
Federation with Observer read enabled, there is an estimated 1% chance of 
encountering the following error.
{code:java}
"java.io.IOException: Resource 
hdfs://XXXX/user/XXXX/.staging/job_XXXXXX/.tez/application_XXXXXX/tez-conf.pb 
changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
\"2023-07-07T12:41:16.822+0900\", current time: 
\"2023-07-07T12:41:22.386+0900\"",
{code}
This error happens in function verifyAndCopy inside FSDownload.java when 
nodemanager tries to download a file right after the file has been written to 
the HDFS. The write operation runs on active namenode and read operation runs 
on observer namenode as expected.

The edits file and hdfs-audit files indicate that the expected timestamp 
mentioned in the error message aligns with the OP_CLOSE MTIME of the 
'tez-conf.pb' file (which is accurate). However, the actual timestamp retrieved 
from the read operation corresponds to the OP_ADD MTIME of the target 
'tez-conf.pf' file (which is incorrect). This inconsistency suggests that the 
observer namenode responds to the client before its edits file is updated with 
the latest stateId.

Further troubleshooting has revealed that during write operations, the router 
responds to the client before receiving the latest stateId from the active 
namenode. Consequently, the outdated stateId is then used in the subsequent 
read operation on the observer namenode, leading to inaccuracies in the 
information provided by the observer namenode.

To resolve this issue, it is essential to ensure that the router sends a 
response to the client only after receiving the latest stateId from the active 
namenode.


> mapreduce job encounters java.io.IOException when 
> dfs.client.rbf.observer.read.enable is true
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18847
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18847
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>            Reporter: Chunyi Yang
>            Priority: Minor
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs://XXXX/user/XXXX/.staging/job_XXXXXX/.tez/application_XXXXXX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-18847) mapreduce job encounters java.io.IOException when dfs.client.rbf.observer.read.enable is true

Reply via email to