Thank you all for your help .
Solution that worked for me is as follows:
I opened ssh tunnel for namenode which ensure that hadoop fs -ls works
In order for hadoop fs -put to work (as it was timing out because namenode
was returning private ip addresses of datanode which cant be resolved by
edge machine) I routed private ip addresses of data node to "some_port" on
localhost of edge machine by adding an entry in iptable for each data
node.In addition to that I opened  ssh tunnel which forwards all traffic
from localhost : <some_port> to data node private ips via gateway machine.

I was hoping that there would be some hadoop configuration which I  can set
so that I dont have to do all this setup on my own .I found out about
hadoop.socks.server config .But it didnt work for me /Ii tried setting this
hadoop.socks.server to localhost:<port> (whose traffic is tunneled via
gateway node) and setting socksfactory config in core-site of client then
on server side .




On Fri, Sep 13, 2019 at 7:04 PM Hariharan Iyer <[email protected]> wrote:

> You will have to use a socks proxy (-D option in ssh tunnel). In addition,
> when invoking hadoop fs command, you will have to add -Dsocks.proxyHost and
> - Dsocks.proxyPort.
>
> Thanks,
> Hariharan
>
> On Thu, 12 Sep 2019, 23:26 saurabh pratap singh, <[email protected]>
> wrote:
>
>> Thank you so much for your reply .
>> I have further question there are some blogs which talks about some
>> similar setup like this one
>>
>> https://github.com/vkovalchuk/hadoop-2.6.0-windows/wiki/How-to-access-HDFS-behind-firewall-using-SOCKS-proxy
>>
>>
>> I am just curious how does that works.
>>
>> On Thu, Sep 12, 2019 at 11:05 PM Tony S. Wu <[email protected]>
>> wrote:
>>
>>> You need connectivity from edge node to the entire cluster, not just
>>> namenode. Your topology, unfortunately, probably won’t work too well. A
>>> proper VPN / IPSec tunnel might be a better idea.
>>>
>>> On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh <
>>> [email protected]> wrote:
>>>
>>>> Hadoop version : 2.8.5
>>>> I have a hdfs set up in private data center (which is not exposed to
>>>> internet ) .In the same data center I have another node (gateway
>>>> node).Purpose of this gateway node is to provide access to hdfs from edge
>>>> machine (which is present outside of data center) through public internet .
>>>> To enable this kind of setup I have setup an ssh tunnel from edge
>>>> machine to name node host and port(9000) through gateway node .
>>>> something like
>>>>
>>>> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
>>>> <gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .
>>>>
>>>> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine
>>>> from edge machine but
>>>> when i executed hadoop fs -put
>>>> <some-file> hdfs://localhost:<local-port> it fails with following error
>>>> message.
>>>>
>>>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout
>>>> while waiting for channel to be ready for connect. ch :
>>>> java.nio.channels.SocketChannel[connection-pending
>>>> remote=/<private-ip-of-datanode>:50010]
>>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>>>> at
>>>> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
>>>> at
>>>> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
>>>> at
>>>> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
>>>> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
>>>>
>>>>
>>>> Looks like it is trying to write directly to private ip address of data
>>>> node .How do i resolve this?
>>>>
>>>> Do let me know if some other information is needed .
>>>>
>>>> Thanks
>>>>
>>>

Reply via email to