Hi I am trying to use hadoop as distributed file storage system.
I did a POC with a small cluster with 1 name-node and 4 datanodes and I was able to get/put files using hdfs client and monitor the datanodes status on: http://master-machine:50070/dfshealth.html However, I have few open questions that I would like to discuss with you guys before, before taking the solution to next level. *Question are as follows:* 1) Is hdfs good in handling binary data? Like executable, zip, VDI, etc? 2) How many datanodes a namenode can handle? Assuming it's running on 24 core, 90GB RAM and handling files b/w 200MB to 1GB in size? (assuming deafult block size 128) 3) Is there are way to tune the cluster setup i.e. determine best value for block size, replication factor, heap etc? 4) I was also curious how much time does a namenode service takes to acknowledge that a datanode has gone down? 5) What happens next? That is, does namenode starts replicating block of that down datanode to other available datanodes to meet the replication factor? 6) what happens when datanode comes back up? won't there more blocks (replication) in system than expected as namenode has replicated them while it was down? 7) Also, after coming up does the datanode performs cleanup for the files (blocks) that were pruned while the datanode was down? That is, reclaim the diskdpace by deleting blocks that are deleted while it was down? 8) During copying/replication does datanode with more available space get priority over datanode with comparitively less space? 9) What are your recommendations for a cluster of around 2500 machines with 24 core and 90GB RAM and 500MB to 1TB disk space to spare for HDFS? Are there any good tools to manage such huge cluster to track its health and other status? 10) For a non-networking guy like me and not owner of the network topology of the machineswhat is the best recommendation from your side to make cluster rack-aware? I mean what should I do get benefited by rack-awareness in the cluster? Thanks, Sachin
