[jira] [Created] (HADOOP-15492) increase performance of s3guard import command

Steve Loughran (JIRA) Thu, 24 May 2018 07:09:19 -0700

Steve Loughran created HADOOP-15492:
---------------------------------------


             Summary: increase performance of s3guard import command
                 Key: HADOOP-15492
                 URL: https://issues.apache.org/jira/browse/HADOOP-15492
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs/s3
            Reporter: Steve Loughran


Some perf improvements which spring to mind having looked at the s3guard import 
command

Key points: it can handle the import of a tree with existing data better

# if the bucket is already under s3guard, then the listing will return all 
listed files, which will then be put() again.
# import calls {{putParentsIfNotPresent()}}, but DDBMetaStore.put() will do the 
parent creation anyway
# For each entry in the store (i.e. a file), the full parent listing is 
created, then a batch write created to put all the parents and the actual file

As a result, it's at risk of doing many more put calls than needed, especially 
for wide/deep directory trees.

It would be much more efficient to put all files in a single directory as part 
of 1+ batch request, with 1 parent tree. Better yet: a get() of that parent 
could skip the put of parent entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HADOOP-15492) increase performance of s3guard import command

Reply via email to