On 2 May 2012, at 18:29, Himanshu Vijay wrote: > Hi, > > I have 100 files each of ~3 GB. I need to distcp them to S3 but copying > fails because of large size of files. The files are not gzipped so they are > splittable. Is there a way or property to tell Distcp to first split the > input files into let's say 200 MB or N lines each before copying to > destination. >
Assuming you're using EMR, use s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html In any case, that's strange because S3's limit is 5GB per PUT request; again if you're running on EMR, try starting your cluster with --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args "-c,fs.s3n.multipart.uploads.enabled=true,-c,fs.s3n.multipart.uploads.split.size=524288000" (or add those to whatever parameters you currently use). Going back to plain distcp, I'm not sure about what the -sizelimit option does, as I've never used it. If push comes to shove, seeing as you have a Hadoop cluster, running a job to write the files to S3 with compression enabled is always an option :) Cheers, Pedro Figueiredo Skype: pfig.89clouds http://89clouds.com/ - Big Data Consulting
