Hi, I measured the timings to load files from a single node HDFS cluster using two different techniques (in C) - 1. Using the libHDFS API (Eg implementation - https://pastebin.com/EBQVBrGx) 2. Querying the fsck output of HDFS to retrieve block locations and loading the individual blocks that make up a file from the native file system; bypassing the HDFS API (Eg implementation - https://pastebin.com/FUvchwnS)
I have attached a plot that demonstrates the difference in speed. (The Local File System speed is for when we load the file on the native file system and load it directly. As expected, this is just slightly faster than (2) for larger files) As demonstrated across file sizes, (2) was considerably faster than (1). I initially thought this was because (2) skipped checksum verification but running some benchmarks proved that the checksum cost was negligible. Also, querying the fsck output takes around 10ms in Python, when using the webHDFS API; so it cannot be a major bottleneck. What could be the overheads imposed by libHDFS API that makes bypassing it so much faster? Thanks, -- Pratyush Das
snodemchunk.pdf
Description: Adobe PDF document
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
