Help me understand hadoop caching behavior

Avery, John Wed, 27 Dec 2017 13:20:26 -0800

I’m writing a program using the C API for Hadoop. I have a 4-node cluster. 
(Cluster was setup according to 
https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm) Of the 4 
nodes, one is the namenode and a datanode, the others are datanodes (with one 
being a secondary namenode).


I’ve already managed to write about 1.5TB of data to the cluster. My issue is 
reading data back, specifically, it’s too fast. *Way* too fast, and I don’t 
understand how or why. The 1.5 TB is stored in the form of about 20,000 60-80MB 
files. When I read back the files (7 files in parallel) I get read speeds in 
excess of 75GB/s. Obviously this is DRAM speed, here’s the problem…each of the 
4 nodes only has 32GB of RAM, and I’m asking Hadoop to re-read over 400GB of 
data. I am using the read back data, so it isn’t the compiler optimizing 
something out, because when I turn off optimization flags, it still runs 10x 
faster than the network/disks to this box can run.

Specifically: 2x10Gb network ports, bonded. Maximum network input 2.5GB/s. 
(test verified)
16x 4TB hard drives: 2GB/s maximum throughput (test verified; outside of 
Hadoop).

As for how I’m reading my data, hdfsOpenFile(…,O_RDONLY) and hdfsRead().

So, at best, I should get 4.5GB/s, and that’s in a perfect work world. But 
during my tests I see no network traffic, and very little (~30-70MB/s) disk IO. 
Yet it manages to return to me 300GB of unique data (the data is real, not a 
pattern, not something particularly compressible or dedupable).

I’m at a complete loss for how 300GB of data is getting sent to me so quickly?! 
I feel like I’m overlooking something trivial…I’m specifically asking for 10X 
the system’s memory (and over 2x the cluster’s memory!) in order to *prevent* 
caching from polluting my numbers. Yet it’s doing something that should be 
impossible. I’m at a complete loss. I fully expect to facepalm at the end of 
this.

Oh, and here’s the really weird part (to me). If I request all 20,000 files, it 
zooms past the 5000 I have cached from my 400MB read test and then slows down 
to a more realistic 2GB/s for the rest of the files. Until I re-run the program 
a second time…then it returns a result in something like 35 seconds instead of 
5 minutes. !!!

Help me understand hadoop caching behavior

Reply via email to