Analysing libHDFS data loading overheads

Pratyush Das Mon, 20 Jun 2022 19:21:41 -0700

Hi,

I measured the timings to load files from a single node HDFS cluster using
two different techniques (in C) -
1. Using the libHDFS API (Eg implementation - https://pastebin.com/EBQVBrGx)
2. Querying the fsck output of HDFS to retrieve block locations and loading
the individual blocks that make up a file from the native file system;
bypassing the HDFS API (Eg implementation - https://pastebin.com/FUvchwnS)


I have attached a plot that demonstrates the difference in speed. (The
Local File System speed is for when we load the file on the native file
system and load it directly. As expected, this is just slightly faster than
(2) for larger files)

As demonstrated across file sizes, (2) was considerably faster than (1). I
initially thought this was because (2) skipped checksum verification but
running some benchmarks proved that the checksum cost was negligible.

Also, querying the fsck output takes around 10ms in Python, when using the
webHDFS API; so it cannot be a major bottleneck.

What could be the overheads imposed by libHDFS API that makes bypassing it
so much faster?

Thanks,

-- 
Pratyush Das

snodemchunk.pdf
Description: Adobe PDF document

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Analysing libHDFS data loading overheads

Reply via email to