Hi Rohith, Thanks for your suggestion. I was tracing the issue and found out it's caused by the incompatibility from these two changes. The tokens have been changed.
YARN-668. Changed NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use protobuf object as the payload. Contributed by Junping Du. YARN-2615. Changed ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use protobuf as payload. Contributed by Junping Du I was testing new RM with old NM. Followup on the the order of Yarn upgrade. I checked the HWX blog <https://hortonworks.com/blog/introducing-rolling-upgrades-downgrades-apache-hadoop-yarn-cluster/> about rolling upgrade and it's suggesting to upgrade RM first. But you are saying we should NM first and RM second? Can you confirm? Thanks, Aihua On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S <[email protected]> wrote: > Hi Aihua, > > Could you give more clarity on when job is submitted like a) before > starting upgrade b) after RM upgrade and before NM upgrade c) after YARN > upgrade fully? > Typically, order of upgrade suggested is NM's first and RM second. > > Reg the NM warn messages you might be hitting > https://issues.apache.org/jira/browse/HADOOP-11692. > > Doesn't any subsequent jobs succeeded post upgrade? > -Rohith Sharma K S > > On Thu, 7 Feb 2019 at 03:20, Aihua Xu <[email protected]> wrote: > >> Hi all, >> >> I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop >> 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade >> NodeManager. When I submit a yarn job, RM fails with the following >> exception: >> >> Application application_1549408943468_0001 failed 2 times due to Error >> launching appattempt_1549408943468_0001_000002. Got exception: >> java.io.IOException: Failed on local exception: java.io.IOException: >> java.io.EOFException; Host Details : local host is: >> "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: >> "hadoopbencha22-sjc1.prod.uber.internal":8041; >> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805) >> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) >> at org.apache.hadoop.ipc.Client.call(Client.java:1439) >> at org.apache.hadoop.ipc.Client.call(Client.java:1349) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) >> at com.sun.proxy.$Proxy87.startContainers(Unknown Source) >> at >> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:498) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) >> at com.sun.proxy.$Proxy88.startContainers(Unknown Source) >> at >> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122) >> at >> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.io.IOException: java.io.EOFException >> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) >> at >> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720) >> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813) >> at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411) >> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554) >> at org.apache.hadoop.ipc.Client.call(Client.java:1385) >> ... 20 more >> Caused by: java.io.EOFException >> at java.io.DataInputStream.readInt(DataInputStream.java:392) >> at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798) >> at >> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365) >> at >> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615) >> at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411) >> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800) >> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) >> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795) >> ... 23 more >> >> >> and NM with >> >> 2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: >> Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring >> password) with true cause: (null) >> >> >> I'm wondering if it's a known issue and anybody has an insight for it. >> >> Thanks, >> Aihua >> >> >>
