Steve Loughran created HADOOP-13829:
---------------------------------------
Summary: S3A getContentSummary to use flat listFiles instead of
treewalk
Key: HADOOP-13829
URL: https://issues.apache.org/jira/browse/HADOOP-13829
Project: Hadoop Common
Issue Type: Sub-task
Components: fs/s3
Affects Versions: 2.8.0
Reporter: Steve Loughran
Priority: Minor
FS shell {{-count}} uses {{getContentSummary}} to summarise the contents; this
slows significantly with directory tree depth. On wide directories, as the
FileStatus[] array is built up before recursing down, if there are many
millions of files, memory use becomes an issue
Moving to a flat listFiles listing with iterator-based scanning would allow
directory depth to become a near-non-issue, avoid memory problems. We'd need to
reverse-construct the directory tree for its count summary; some hash map of
parent paths could build that up while iterating through the files and adding
up their sizes
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]