Fwd: queries regarding hadoop DFS

Sachin Tiwari Mon, 18 Dec 2017 08:45:56 -0800

Hi

I am trying to use hadoop as distributed file storage system.


I did a POC with a small cluster with 1 name-node and 4 datanodes and I was
able to get/put files using hdfs client and monitor the datanodes status
on: http://master-machine:50070/dfshealth.html

However, I have few open questions that I would like to discuss with you
guys before, before taking the solution to next level.

*Question are as follows:*

1) Is hdfs good in handling binary data? Like executable, zip, VDI, etc?

2) How many datanodes a namenode can handle? Assuming it's running on 24
core, 90GB RAM and handling files b/w 200MB to 1GB in size? (assuming
deafult block size 128)

3) Is there are way to tune the cluster setup i.e. determine best value for
block size, replication factor, heap etc?

4) I was also curious how much time does a namenode service takes to
acknowledge that a datanode has gone down?

5) What happens next? That is, does namenode starts replicating block of
that down datanode to other available datanodes to meet the replication
factor?

6) what happens when datanode comes back up? won't there more blocks
(replication) in system than expected as namenode has replicated them while
it was down?

7) Also, after coming up does the datanode performs cleanup for the files
(blocks) that were pruned while the datanode was down? That is, reclaim the
diskdpace by deleting blocks that are deleted while it was down?

8) During copying/replication does datanode with more available space get
priority over datanode with comparitively less space?

9) What are your recommendations for a cluster of around 2500 machines with
24 core and 90GB RAM and 500MB to 1TB disk space to spare for HDFS? Are
there any good tools to manage such huge cluster to track its health and
other status?

10) For a non-networking guy like me and not owner of the network topology
of the machineswhat is the best recommendation from your side to make
cluster rack-aware? I mean what should I do get benefited by rack-awareness
in the cluster?


Thanks,
Sachin

Fwd: queries regarding hadoop DFS

Reply via email to