Thank you all for your help . Solution that worked for me is as follows: I opened ssh tunnel for namenode which ensure that hadoop fs -ls works In order for hadoop fs -put to work (as it was timing out because namenode was returning private ip addresses of datanode which cant be resolved by edge machine) I routed private ip addresses of data node to "some_port" on localhost of edge machine by adding an entry in iptable for each data node.In addition to that I opened ssh tunnel which forwards all traffic from localhost : <some_port> to data node private ips via gateway machine.
I was hoping that there would be some hadoop configuration which I can set so that I dont have to do all this setup on my own .I found out about hadoop.socks.server config .But it didnt work for me /Ii tried setting this hadoop.socks.server to localhost:<port> (whose traffic is tunneled via gateway node) and setting socksfactory config in core-site of client then on server side . On Fri, Sep 13, 2019 at 7:04 PM Hariharan Iyer <[email protected]> wrote: > You will have to use a socks proxy (-D option in ssh tunnel). In addition, > when invoking hadoop fs command, you will have to add -Dsocks.proxyHost and > - Dsocks.proxyPort. > > Thanks, > Hariharan > > On Thu, 12 Sep 2019, 23:26 saurabh pratap singh, <[email protected]> > wrote: > >> Thank you so much for your reply . >> I have further question there are some blogs which talks about some >> similar setup like this one >> >> https://github.com/vkovalchuk/hadoop-2.6.0-windows/wiki/How-to-access-HDFS-behind-firewall-using-SOCKS-proxy >> >> >> I am just curious how does that works. >> >> On Thu, Sep 12, 2019 at 11:05 PM Tony S. Wu <[email protected]> >> wrote: >> >>> You need connectivity from edge node to the entire cluster, not just >>> namenode. Your topology, unfortunately, probably won’t work too well. A >>> proper VPN / IPSec tunnel might be a better idea. >>> >>> On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh < >>> [email protected]> wrote: >>> >>>> Hadoop version : 2.8.5 >>>> I have a hdfs set up in private data center (which is not exposed to >>>> internet ) .In the same data center I have another node (gateway >>>> node).Purpose of this gateway node is to provide access to hdfs from edge >>>> machine (which is present outside of data center) through public internet . >>>> To enable this kind of setup I have setup an ssh tunnel from edge >>>> machine to name node host and port(9000) through gateway node . >>>> something like >>>> >>>> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport> >>>> <gateway-user>@<gatewayhost> -i <ssh-keys> -vvvv . >>>> >>>> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine >>>> from edge machine but >>>> when i executed hadoop fs -put >>>> <some-file> hdfs://localhost:<local-port> it fails with following error >>>> message. >>>> >>>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout >>>> while waiting for channel to be ready for connect. ch : >>>> java.nio.channels.SocketChannel[connection-pending >>>> remote=/<private-ip-of-datanode>:50010] >>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534) >>>> at >>>> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253) >>>> at >>>> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725) >>>> at >>>> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679) >>>> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716) >>>> >>>> >>>> Looks like it is trying to write directly to private ip address of data >>>> node .How do i resolve this? >>>> >>>> Do let me know if some other information is needed . >>>> >>>> Thanks >>>> >>>
