I’m writing a program using the C API for Hadoop. I have a 4-node cluster. (Cluster was setup according to https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm) Of the 4 nodes, one is the namenode and a datanode, the others are datanodes (with one being a secondary namenode).
I’ve already managed to write about 1.5TB of data to the cluster. My issue is reading data back, specifically, it’s too fast. *Way* too fast, and I don’t understand how or why. The 1.5 TB is stored in the form of about 20,000 60-80MB files. When I read back the files (7 files in parallel) I get read speeds in excess of 75GB/s. Obviously this is DRAM speed, here’s the problem…each of the 4 nodes only has 32GB of RAM, and I’m asking Hadoop to re-read over 400GB of data. I am using the read back data, so it isn’t the compiler optimizing something out, because when I turn off optimization flags, it still runs 10x faster than the network/disks to this box can run. Specifically: 2x10Gb network ports, bonded. Maximum network input 2.5GB/s. (test verified) 16x 4TB hard drives: 2GB/s maximum throughput (test verified; outside of Hadoop). As for how I’m reading my data, hdfsOpenFile(…,O_RDONLY) and hdfsRead(). So, at best, I should get 4.5GB/s, and that’s in a perfect work world. But during my tests I see no network traffic, and very little (~30-70MB/s) disk IO. Yet it manages to return to me 300GB of unique data (the data is real, not a pattern, not something particularly compressible or dedupable). I’m at a complete loss for how 300GB of data is getting sent to me so quickly?! I feel like I’m overlooking something trivial…I’m specifically asking for 10X the system’s memory (and over 2x the cluster’s memory!) in order to *prevent* caching from polluting my numbers. Yet it’s doing something that should be impossible. I’m at a complete loss. I fully expect to facepalm at the end of this. Oh, and here’s the really weird part (to me). If I request all 20,000 files, it zooms past the 5000 I have cached from my 400MB read test and then slows down to a more realistic 2GB/s for the rest of the files. Until I re-run the program a second time…then it returns a result in something like 35 seconds instead of 5 minutes. !!!
