We have all documents moved to HDFS. I understand with our 1st option we need more I/O than what you say but let's say that's not a problem for now.
Could you please point me on 2) option? how could we do that? any tutorial or example? Thanks 2012/8/13 Bertrand Dechoux <[email protected]> > 1) A standard way of doing it would be to have all your files content > inside HDFS. You could then process <key,value> where key would be the name > of the file and value its contents. It would improve performance : data > locality, less network traffic... But you may have constraints... > > 2) Maven is a simple way of doing it. > > Regards > > Bertrand > > On Mon, Aug 13, 2012 at 7:59 PM, Pierre Antoine DuBoDeNa > <[email protected]>wrote: > > > Hello, > > > > We use hadoop to distribute a task over our machines. > > > > This task requires only the mapper class to be defined. We want to do > some > > text processing in thousands of documents. So we create key-value pairs, > > where key is just an increasing number and value is the path of the file > to > > be processed. > > > > We face problem on including an external jar file/class while running a > jar > > file. > > > > $ mkdir Rdg_classes > > $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d > > Rdg_classes Rdg.java > > $ jar -cvf Rdg.jar -C Rdg_classes/ . > > We have tried the following options: > > > > *1. Set HADOOP_CLASSPATH with the location of external jar files or > > external classes.* > > It doesnt help. Instead, it starts de-recognizing the Reducer with below > > error: > > > > java.lang.RuntimeException: java.lang.RuntimeException: > > java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899) > > at > org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1028) > > at > org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1380) > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:981) > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: > > hadoop.Rdg$Reduce > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867) > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891) > > ... 10 more > > Caused by: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > > at java.net.URLClassLoader$1.run(URLClassLoader.java:202) > > at java.security.AccessController.doPrivileged(Native Method) > > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > > at java.lang.Class.forName0(Native Method) > > at java.lang.Class.forName(Class.java:247) > > at > > > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865) > > ... 11 more > > > > *2. Use -libjars option as below:* > > hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output > > > > Where Rdg_lib is the a folder containing all reqd classes/jars stored on > > HDFS. > > But it starts reading -libjars as an input as gives error as: > > > > 12/08/10 08:16:24 ERROR security.UserGroupInformation: > > PriviledgedActionException as:hduser > > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not > > exist: hdfs://nameofserver:54310/user/hduser/-libjars > > Exception in thread "main" > org.apache.hadoop.mapred.InvalidInputException: > > Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars > > > > Is there any other way to do it? or we do anything wrong? > > > > Best, > > > > > > -- > Bertrand Dechoux >
