Hi,
I am facing a problem where the AM goes done once it is finished but, the API
keeps on polling to AM for JobStatus/Report and faces SocketTimeout
By default ipc.client.connect.max.retries.on.timeouts is set to 45 which is
very high in this case. JobClient can very well update this in its
configuration but, it effects at whole IPC level.
By going through the code same sort of stuff is handled in
ClientServiceDelegate class by providing a Yarn Level configuration to
over-ride IPC level configuration. It would be better if the same is done for
the mentioned property.
The code snippet where the problem is handled in ClientServiceDelegate
constructor:
this.conf.setInt(
CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_KEY,
this.conf.getInt(MRJobConfig.MR_CLIENT_TO_AM_IPC_MAX_RETRIES,
MRJobConfig.DEFAULT_MR_CLIENT_TO_AM_IPC_MAX_RETRIES));
The stack trace for Client keep on polling on SocketTimeout:
Daemon Thread [MrPlanRunner] (Suspended (breakpoint at line 682 in
Client$Connection))
Client$Connection.handleConnectionFailure(int, int, IOException) line:
682
Client$Connection.setupConnection() line: 484
Client$Connection.setupIOstreams() line: 565
Client$Connection.access$2000(Client$Connection) line: 213
Client.getConnection(Client$ConnectionId, Client$Call) line: 1269
Client.call(RpcPayloadHeader$RpcKind, Writable, Client$ConnectionId)
line: 1139
Client.call(Writable, Client$ConnectionId) line: 1122
ProtoOverHadoopRpcEngine$Invoker.invoke(Object, Method, Object[]) line:
148
$Proxy42.getJobReport(RpcController,
MRServiceProtos$GetJobReportRequestProto) line: not available
MRClientProtocolPBClientImpl.getJobReport(GetJobReportRequest) line:
111
GeneratedMethodAccessor11.invoke(Object, Object[]) line: not available
DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
Method.invoke(Object, Object...) line: 597
ClientServiceDelegate.invoke(String, Class, Object) line: 296
ClientServiceDelegate.getJobStatus(JobID) line: 373
YARNRunner.getJobStatus(JobID) line: 483
Job$1.run() line: 322
Job$1.run() line: 319
AccessController.doPrivileged(PrivilegedExceptionAction<T>,
AccessControlContext) line: not available [native method]
Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 396
UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1177
Job.updateStatus() line: 319
Job.isComplete() line: 598
Job.monitorAndPrintJob() line: 1280
JobClient$NetworkedJob.monitorAndPrintJob() line: 432
JobClient.monitorAndPrintJob(JobConf, RunningJob) line: 902
Any thoughts!!!!!!
Cheers,
Subroto Sanyal