Re: Transfer large file >50Gb with DistCp from s3 to cluster

Mischa Tuffield Tue, 04 Sep 2012 13:14:14 -0700

Hello, 

You could try this jar which I found link to from one of the amazon pages.


s3cmd get s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.0.1/s3distcp.jar

s3dist.jar copies via mapreduce to s3 and back .

If you cluster has N number of reducers available, you can : 

hadoop jar s3distcp.jar -D mapred.reduce.tasks=N --src s3://lame/foo --dest 
hdfs:///user/hadoop/lamefoo/

I would run it in a screen session. 
On 4 Sep 2012, at 21:07, Soulghost wrote:

> 
> Hello guys 
> 
> I have a problem using the DistCp to transfer a large file from s3 to HDFS
> cluster, whenever I tried to make the copy, I only saw processing work and
> memory usage in one of the nodes, not in all of them, I don't know if this
> is the proper behaviour of this or if it is a configuration problem. If I
> make the transfer of multiple files each node handles a single file at the
> same time, I understand that this transfer would be in parallel but it
> doesn't seems like that. 
> 
> I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I
> was hoping that any of you have an idea of how it works distCp and which
> properties could I tweak to improve the transfer rate that is currently in
> 0.7 Gb per minute. 
> 
> Regards.
> -- 
> View this message in context: 
> http://old.nabble.com/Transfer-large-file-%3E50Gb-with-DistCp-from-s3-to-cluster-tp34389118p34389118.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 

_____________________________
Mischa Tuffield PhD
http://mmt.me.uk/
http://mmt.me.uk/foaf.rdf#mischa

Re: Transfer large file >50Gb with DistCp from s3 to cluster

Reply via email to