[
https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HADOOP-6762:
--------------------------------
Attachment: hadoop-6762.txt
Here's an updated patch against trunk.
I ran all of the unit tests in the ipc package locally and they passed. I also
tried the new unit tests _without_ the patch, and they failed as expected.
Given that there was a deadlock found in an early rev of this patch, I also ran
all of the IPC unit tests under jcarder to look for lock inversions and it
found none.
I ran the RPCCallBenchmark for 30 seconds with and without the patch, with the
following results:
With patch:
====== Results ======
Options:
rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
serverThreads=30
serverReaderThreads=4
clientThreads=30
host=0.0.0.0
port=12345
secondsToRun=30
msgSize=1024
Total calls per second: 24668.0
CPU time per call on client: 58639 ns
CPU time per call on server: 64893 ns
Without patch:
====== Results ======
Options:
rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
serverThreads=30
serverReaderThreads=4
clientThreads=30
host=0.0.0.0
port=12345
secondsToRun=30
msgSize=1024
Total calls per second: 27881.0
CPU time per call on client: 68079 ns
CPU time per call on server: 62582 ns
As expected, the CPU time on the client was increased and the throughput went
down by about 13%, since the RPC calls are now being shuttled between threads
on the client side. That's unfortunate, but given that this fixes an important
bug, and given that _client_ side RPC throughput is rarely a bottleneck in
common usage scenarios, I think it is acceptable.
This patch is also nearly identical to a patch that we've shipped in CDH since
June 2010, so I'm fairly confident that the approach is correct.
> exception while doing RPC I/O closes channel
> --------------------------------------------
>
> Key: HADOOP-6762
> URL: https://issues.apache.org/jira/browse/HADOOP-6762
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.20.2
> Reporter: sam rash
> Assignee: Todd Lipcon
> Priority: Critical
> Attachments: hadoop-6762-10.txt, hadoop-6762-1.txt,
> hadoop-6762-2.txt, hadoop-6762-3.txt, hadoop-6762-4.txt, hadoop-6762-6.txt,
> hadoop-6762-7.txt, hadoop-6762-8.txt, hadoop-6762-9.txt, HADOOP-6762.patch,
> hadoop-6762.txt, hadoop-6762.txt, hadoop-6762.txt
>
>
> If a single process creates two unique fileSystems to the same NN using
> FileSystem.newInstance(), and one of them issues a close(), the leasechecker
> thread is interrupted. This interrupt races with the rpc namenode.renew()
> and can cause a ClosedByInterruptException. This closes the underlying
> channel and the other filesystem, sharing the connection will get errors.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira