[ 
https://issues.apache.org/jira/browse/HADOOP-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913543#comment-13913543
 ] 

Jordan Mendelson commented on HADOOP-9454:
------------------------------------------

The S3A implementation I wrote is Apache licensed. It has a few benefits over 
the patch including parallel copy (& rename) support, not requiring those 
_$folder$ files (it uses filename/ which is what the Amazon web interface uses 
as well) and allowing multiple buffer output directories (which are required to 
write the file out to before upload so we can get its MD5 which AWS requires). 
I've been using it in production for several months.

The only reason I switched to the AWS SDK was to get abortable HTTP calls since 
Hadoop likes to seek around things like sequence files when searching for split 
points which causes all sorts of performance problems if you can't abort your 
HTTP call (the current s3n implementation iirc will just read down the entire 
request before closing the connection and reopening it).

> Support multipart uploads for s3native
> --------------------------------------
>
>                 Key: HADOOP-9454
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9454
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Jordan Mendelson
>            Assignee: Akira AJISAKA
>             Fix For: 2.4.0
>
>         Attachments: HADOOP-9454-10.patch, HADOOP-9454-11.patch, 
> HADOOP-9454-12.patch
>
>
> The s3native filesystem is limited to 5 GB file uploads to S3, however the 
> newest version of jets3t supports multipart uploads to allow storing multi-TB 
> files. While the s3 filesystem lets you bypass this restriction by uploading 
> blocks, it is necessary for us to output our data into Amazon's 
> publicdatasets bucket which is shared with others.
> Amazon has added a similar feature to their distribution of hadoop as has 
> MapR.
> Please note that while this supports large copies, it does not yet support 
> parallel copies because jets3t doesn't expose an API yet that allows it 
> without hadoop controlling the threads unlike with upload.
> By default, this patch does not enable multipart uploads. To enable them and 
> parallel uploads:
> add the following keys to your hadoop config:
> <property>
>   <name>fs.s3n.multipart.uploads.enabled</name>
>   <value>true</value>
> </property>
> <property>
>   <name>fs.s3n.multipart.uploads.block.size</name>
>   <value>67108864</value>
> </property>
> <property>
>   <name>fs.s3n.multipart.copy.block.size</name>
>   <value>5368709120</value>
> </property>
> create a /etc/hadoop/conf/jets3t.properties file with or similar to:
> storage-service.internal-error-retry-max=5
> storage-service.disable-live-md5=false
> threaded-service.max-thread-count=20
> threaded-service.admin-max-thread-count=20
> s3service.max-thread-count=20
> s3service.admin-max-thread-count=20



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to