I can list all the files out of HDFS in a few hours, not a day. Listing the files in a single directory in the har takes ~50 min. Honestly I'd be happy with only a 10x performance hit. I'm seeing closer to 100-150x.
-Aaron > On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze <[email protected]> wrote: > > ls over files in har:// maybe 10 times slow than ls over regular files. It > does not sound normal unless it would take ~1 day to list out all the 250TB > files when they are stored as regular files. > Tsz-Wo > > > On Monday, August 15, 2016 10:01 AM, Aaron Turner <[email protected]> > wrote: > > > Basically I want to list all the files in a .har file and compare the > file list/sizes to an existing directory in HDFS. The problem is that > running commands like: hdfs dfs -ls -R <path to har file> is orders of > magnitude slower then running the same command against a live HDFS > file system. > > How much slower? I've calculated it will take ~19 days to list all > the files in 250TB worth of content spread between 2 .har files. > > Is this normal? Can I do this faster (write a map/reduce job/etc?) > > -- > Aaron Turner > https://synfin.net/ Twitter: @synfinatic > Those who would give up essential Liberty, to purchase a little temporary > Safety, deserve neither Liberty nor Safety. > -- Benjamin Franklin > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > >
