$DuplicationException: Invalid input, there are duplicated files in the sources: hftp://ub13:50070/tmp/Rtmp1BU9Kb/file6abc6ccb6551/_logs/history, hftp://ub13:50070/tmp/Rtmp3yCJhu/file1ca96d9331/_logs/history
Any idea what is the problem here? They are different files how are they conflicting? Thanks & Regards On Tue, May 8, 2012 at 11:52 PM, Adam Faris <[email protected]> wrote: > Hi Austin, > > I'm glad that helped out. Regarding the -p flag for distcp, here's the > online documentation > > http://hadoop.apache.org/common/docs/current/distcp.html#Option+Index > > You can also get this info from running 'hadoop distcp' without any flags. > -------- > -p[rbugp] Preserve > r: replication number > b: block size > u: user > g: group > p: permission > -------- > > -- Adam > > On May 7, 2012, at 10:55 PM, Austin Chungath wrote: > > > Thanks Adam, > > > > That was very helpful. Your second point solved my problems :-) > > The hdfs port number was wrong. > > I didn't use the option -ppgu what does it do? > > > > > > > > On Mon, May 7, 2012 at 8:07 PM, Adam Faris <[email protected]> wrote: > > > >> Hi Austin, > >> > >> I don't know about using CDH3, but we use distcp for moving data between > >> different versions of apache grids and several things come to mind. > >> > >> 1) you should use the -i flag to ignore checksum differences on the > >> blocks. I'm not 100% but want to say hftp doesn't support checksums on > the > >> blocks as they go across the wire. > >> > >> 2) you should read from hftp but write to hdfs. Also make sure to check > >> your port numbers. For example I can read from hftp on port 50070 and > >> write to hdfs on port 9000. You'll find the hftp port in hdfs-site.xml > and > >> hdfs in core-site.xml on apache releases. > >> > >> 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 > support > >> security? If security is enabled on 0.20.205 and CDH3 does not support > >> security, you will need to disable security on 0.20.205. This is > because > >> you are unable to write from a secure to unsecured grid. > >> > >> 4) use the -m flag to limit your mappers so you don't DDOS your network > >> backbone. > >> > >> 5) why isn't your vender helping you with the data migration? :) > >> > >> Otherwise something like this should get you going. > >> > >> hadoop -i -ppgu -log /tmp/mylog -m 20 distcp > >> hftp://mynamenode.grid.one:50070/path/to/my/src/data > >> hdfs://mynamenode.grid.two:9000/path/to/my/dst > >> > >> -- Adam > >> > >> On May 7, 2012, at 4:29 AM, Nitin Pawar wrote: > >> > >>> things to check > >>> > >>> 1) when you launch distcp jobs all the datanodes of older hdfs are live > >> and > >>> connected > >>> 2) when you launch distcp no data is being written/moved/deleteed in > hdfs > >>> 3) you can use option -log to log errors into directory and user -i to > >>> ignore errors > >>> > >>> also u can try using distcp with hdfs protocol instead of hftp ... for > >>> more you can refer > >>> > >> > https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd > >>> > >>> > >>> > >>> if it failed there should be some error > >>> On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <[email protected]> > >> wrote: > >>> > >>>> ok that was a lame mistake. > >>>> $ hadoop distcp hftp://localhost:50070/tmp > >> hftp://localhost:60070/tmp_copy > >>>> I had spelled hdfs instead of "hftp" > >>>> > >>>> $ hadoop distcp hftp://localhost:50070/docs/index.html > >>>> hftp://localhost:60070/user/hadoop > >>>> 12/05/07 16:38:09 INFO tools.DistCp: > >>>> srcPaths=[hftp://localhost:50070/docs/index.html] > >>>> 12/05/07 16:38:09 INFO tools.DistCp: > >>>> destPath=hftp://localhost:60070/user/hadoop > >>>> With failures, global counters are inaccurate; consider running with > -i > >>>> Copy failed: java.io.IOException: Not supported > >>>> at > org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457) > >>>> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963) > >>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672) > >>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > >>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) > >>>> > >>>> Any idea why this error is coming? > >>>> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3 > >>>> (/user/hadoop) > >>>> > >>>> Thanks & Regards, > >>>> Austin > >>>> > >>>> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <[email protected]> > >>>> wrote: > >>>> > >>>>> Thanks, > >>>>> > >>>>> So I decided to try and move using distcp. > >>>>> > >>>>> $ hadoop distcp hdfs://localhost:54310/tmp > >> hdfs://localhost:8021/tmp_copy > >>>>> 12/05/07 14:57:38 INFO tools.DistCp: > >>>> srcPaths=[hdfs://localhost:54310/tmp] > >>>>> 12/05/07 14:57:38 INFO tools.DistCp: > >>>>> destPath=hdfs://localhost:8021/tmp_copy > >>>>> With failures, global counters are inaccurate; consider running with > -i > >>>>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol > >>>>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. > >> (client > >>>> = > >>>>> 63, server = 61) > >>>>> > >>>>> I found that we can do distcp like above only if both are of the same > >>>>> hadoop version. > >>>>> so I tried: > >>>>> > >>>>> $ hadoop distcp hftp://localhost:50070/tmp > >>>> hdfs://localhost:60070/tmp_copy > >>>>> 12/05/07 15:02:44 INFO tools.DistCp: > >>>> srcPaths=[hftp://localhost:50070/tmp] > >>>>> 12/05/07 15:02:44 INFO tools.DistCp: > >>>>> destPath=hdfs://localhost:60070/tmp_copy > >>>>> > >>>>> But this process seemed to be hangs at this stage. What might I be > >> doing > >>>>> wrong? > >>>>> > >>>>> hftp://<dfs.http.address>/<path> > >>>>> hftp://localhost:50070 is dfs.http.address of 0.20.205 > >>>>> hdfs://localhost:60070 is dfs.http.address of cdh3u3 > >>>>> > >>>>> Thanks and regards, > >>>>> Austin > >>>>> > >>>>> > >>>>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel < > >> [email protected] > >>>>> wrote: > >>>>> > >>>>>> Ok... So riddle me this... > >>>>>> I currently have a replication factor of 3. > >>>>>> I reset it to two. > >>>>>> > >>>>>> What do you have to do to get the replication factor of 3 down to 2? > >>>>>> Do I just try to rebalance the nodes? > >>>>>> > >>>>>> The point is that you are looking at a very small cluster. > >>>>>> You may want to start the be cluster with a replication factor of 2 > >> and > >>>>>> then when the data is moved over, increase it to a factor of 3. Or > >> maybe > >>>>>> not. > >>>>>> > >>>>>> I do a distcp to. Copy the data and after each distcp, I do an fsck > >> for > >>>> a > >>>>>> sanity check and then remove the files I copied. As I gain more > room, > >> I > >>>> can > >>>>>> then slowly drop nodes, do an fsck, rebalance and then repeat. > >>>>>> > >>>>>> Even though this us a dev cluster, the OP wants to retain the data. > >>>>>> > >>>>>> There are other options depending on the amount and size of new > >>>> hardware. > >>>>>> I mean make one machine a RAID 5 machine, copy data to it clearing > off > >>>>>> the cluster. > >>>>>> > >>>>>> If 8TB was the amount of disk used, that would be 2.6666 TB used. > >>>>>> Let's say 3TB. Going raid 5, how much disk is that? So you could > fit > >> it > >>>>>> on one machine, depending on hardware, or maybe 2 machines... Now > you > >>>> can > >>>>>> rebuild initial cluster and then move data back. Then rebuild those > >>>>>> machines. Lots of options... ;-) > >>>>>> > >>>>>> Sent from a remote device. Please excuse any typos... > >>>>>> > >>>>>> Mike Segel > >>>>>> > >>>>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas < > [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> This probably is a more relevant question in CDH mailing lists. > That > >>>>>> said, > >>>>>>> what Edward is suggesting seems reasonable. Reduce replication > >> factor, > >>>>>>> decommission some of the nodes and create a new cluster with those > >>>> nodes > >>>>>>> and do distcp. > >>>>>>> > >>>>>>> Could you share with us the reasons you want to migrate from Apache > >>>> 205? > >>>>>>> > >>>>>>> Regards, > >>>>>>> Suresh > >>>>>>> > >>>>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo < > >>>> [email protected] > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably > more > >>>>>>>> or a cross-grade then an upgrade or downgrade. I would just stick > it > >>>>>>>> out. But yes like Michael said two clusters on the same gear and > >>>>>>>> distcp. If you are using RF=3 you could also lower your > replication > >>>> to > >>>>>>>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving > >>>>>>>> stuff. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel < > >>>>>> [email protected]> > >>>>>>>> wrote: > >>>>>>>>> Ok... When you get your new hardware... > >>>>>>>>> > >>>>>>>>> Set up one server as your new NN, JT, SN. > >>>>>>>>> Set up the others as a DN. > >>>>>>>>> (Cloudera CDH3u3) > >>>>>>>>> > >>>>>>>>> On your existing cluster... > >>>>>>>>> Remove your old log files, temp files on HDFS anything you don't > >>>> need. > >>>>>>>>> This should give you some more space. > >>>>>>>>> Start copying some of the directories/files to the new cluster. > >>>>>>>>> As you gain space, decommission a node, rebalance, add node to > new > >>>>>>>> cluster... > >>>>>>>>> > >>>>>>>>> It's a slow process. > >>>>>>>>> > >>>>>>>>> Should I remind you to make sure you up you bandwidth setting, > and > >>>> to > >>>>>>>> clean up the hdfs directories when you repurpose the nodes? > >>>>>>>>> > >>>>>>>>> Does this make sense? > >>>>>>>>> > >>>>>>>>> Sent from a remote device. Please excuse any typos... > >>>>>>>>> > >>>>>>>>> Mike Segel > >>>>>>>>> > >>>>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <[email protected]> > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Yeah I know :-) > >>>>>>>>>> and this is not a production cluster ;-) and yes there is more > >>>>>> hardware > >>>>>>>>>> coming :-) > >>>>>>>>>> > >>>>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel < > >>>>>> [email protected] > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Well, you've kind of painted yourself in to a corner... > >>>>>>>>>>> Not sure why you didn't get a response from the Cloudera lists, > >>>> but > >>>>>>>> it's a > >>>>>>>>>>> generic question... > >>>>>>>>>>> > >>>>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual > >> disks? > >>>>>>>>>>> And please tell me you've already ordered more hardware.. > Right? > >>>>>>>>>>> > >>>>>>>>>>> And please tell me this isn't your production cluster... > >>>>>>>>>>> > >>>>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept > >> my > >>>>>>>>>>> upcoming proposal talk... ;-) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Sent from a remote device. Please excuse any typos... > >>>>>>>>>>> > >>>>>>>>>>> Mike Segel > >>>>>>>>>>> > >>>>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath < > [email protected]> > >>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There > >>>>>> were no > >>>>>>>>>>>> responses. > >>>>>>>>>>>> > >>>>>>>>>>>> But this is not related to cloudera as such. > >>>>>>>>>>>> > >>>>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in > >>>>>> apache > >>>>>>>>>>>> hadoop 0.20.205 > >>>>>>>>>>>> > >>>>>>>>>>>> There is an upgrade namenode option when we are migrating to a > >>>>>> higher > >>>>>>>>>>>> version say from 0.20 to 0.20.205 > >>>>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3) > >>>>>>>>>>>> Is this possible? > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi < > >>>>>>>> [email protected] > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so > >>>> would > >>>>>> not > >>>>>>>>>>> know > >>>>>>>>>>>>> much, but you might find some help moving this to Cloudera > >>>> mailing > >>>>>>>> list. > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath < > >>>>>> [email protected]> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> There is only one cluster. I am not copying between > clusters. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB > >> storage > >>>>>>>>>>> capacity > >>>>>>>>>>>>>> and has about 8 TB of data. > >>>>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use > >> that > >>>>>>>> same 8 > >>>>>>>>>>> TB > >>>>>>>>>>>>>> of data. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 > >> TB > >>>>>> of > >>>>>>>> free > >>>>>>>>>>>>>> space > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar < > >>>>>>>> [email protected]> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> you can actually look at the distcp > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> but this means that you have two different set of clusters > >>>>>>>> available > >>>>>>>>>>> to > >>>>>>>>>>>>>> do > >>>>>>>>>>>>>>> the migration > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath < > >>>>>>>> [email protected]> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks for the suggestions, > >>>>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the > >>>> dfs > >>>>>>>>>>>>> because > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> data is huge. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I > >>>> can > >>>>>> do > >>>>>>>> a > >>>>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to > >> use > >>>>>> CDH3 > >>>>>>>>>>>>> now, > >>>>>>>>>>>>>>>> which is based on 0.20 > >>>>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info > >>>> has > >>>>>> to > >>>>>>>> be > >>>>>>>>>>>>>> used > >>>>>>>>>>>>>>>> by 0.20's namenode. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar < > >>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> i can think of following options > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data > from > >>>>>> DFS > >>>>>>>> and > >>>>>>>>>>>>>>> loads > >>>>>>>>>>>>>>>>> it in dfs > >>>>>>>>>>>>>>>>> 2) see if the distcp between both versions are > compatible > >>>>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few > >>>> hundred > >>>>>>>> GB) > >>>>>>>>>>>>> .. > >>>>>>>>>>>>>>>> did a > >>>>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a > >>>> copyFromLocal > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath < > >>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3. > >>>>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of > >> Apache > >>>>>>>>>>>>> hadoop > >>>>>>>>>>>>>>>>>> 0.20.205. > >>>>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have > >> on > >>>>>>>>>>>>>> 0.20.205. > >>>>>>>>>>>>>>>>>> What is the best practice/ techniques to do this? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks & Regards, > >>>>>>>>>>>>>>>>>> Austin > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>>> Nitin Pawar > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>> Nitin Pawar > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Nitin Pawar > >> > >> > >
