This is interesting. I changed my command to: -mapper "cat $1 | $GHU_HOME/test2.py $2" \
is producing output to HDFS. But, the output is not what I expected and is not the same as when I do "cat | map " on Linux. It is producing part-00000, part-00001 and part-00002. I expected only one output file with just 2 records. I think I have to understand what exactly "-file" does and what exactly "-input" does. I am experimenting what happens if I give my input files on the command line (like: test2.py arg1 arg2) as against specifying the input files via "-file" and "-input" options... The problem is I have 2 input files...and have no idea how to pass them. SHould I keep one in HDFS and stream in the other? More digging, PD/ On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <[email protected]> wrote: > Hi Bertrand, > No, I do not observe the same when I run using cat | map. I can see > the output in STDOUT when I run my program. > > I do not have any reducer. In my command, I provide > "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be > written directly to HDFS. > > Your suspicion maybe right..about the output. In my counters, the "map > input records" = 40 and "map.output records" = 0. I am trying to see if I > am messing up in my command...(see below) > > Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am > streaming one file in and test2.py takes in only one argument. How should I > frame my command below? I think that is where I am messing up.. > > > run.sh: (run as: cat <arg2> | ./run.sh <arg1> ) > ----------- > > hadoop jar > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ > -D mapred.reduce.tasks=0 \ > -verbose \ > -input "$HDFS_INPUT" \ > -input "$HDFS_INPUT_2" \ > -output "$HDFS_OUTPUT" \ > -file "$GHU_HOME/test2.py" \ > -mapper "python $GHU_HOME/test2.py $1" \ > -file "$GHU_HOME/$1" > > > > If I modify my mapper to take in 2 arguments, then, I would run it as: > > run.sh: (run as: ./run.sh <arg1> <arg2>) > ----------- > > hadoop jar > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ > -D mapred.reduce.tasks=0 \ > -verbose \ > -input "$HDFS_INPUT" \ > -input "$HDFS_INPUT_2" \ > -output "$HDFS_OUTPUT" \ > -file "$GHU_HOME/test2.py" \ > -mapper "python $GHU_HOME/test2.py $1 $2" \ > -file "$GHU_HOME/$1" \ > -file "GHU_HOME/$2" > > > Please let me know if I am making a mistake here. > > > Thanks. > PD > > > > > > > On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <[email protected]>wrote: > >> Do you observe the same thing when running without Hadoop? (cat, map, sort >> and then reduce) >> >> Could you provide the counters of your job? You should be able to get them >> using the job tracker interface. >> >> The most probable answer without more information would be that your >> reducer do not output any <key,value>s. >> >> Regards >> >> Bertrand >> >> >> >> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <[email protected]> >> wrote: >> >> > Hi All, >> > My Hadoop streaming job (in Python) runs to "completion" (both map >> and >> > reduce says 100% complete). But, when I look at the output directory in >> > HDFS, the part files are empty. I do not know what might be causing this >> > behavior. I understand that the percentages represent the records that >> have >> > been read in (not processed). >> > >> > The following are some of the logs. The detailed logs from Cloudera >> Manager >> > says that there were no Map Outputs...which is interesting. Any >> > suggestions? >> > >> > >> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run: >> > 12/08/30 03:27:14 INFO streaming.StreamJob: >> /usr/lib/hadoop-0.20/bin/hadoop >> > job -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill >> job_201208232245_3182 >> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL: >> > http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182 >> > 12/08/30 03:27:15 INFO streaming.StreamJob: map 0% reduce 0% >> > 12/08/30 03:27:20 INFO streaming.StreamJob: map 33% reduce 0% >> > 12/08/30 03:27:23 INFO streaming.StreamJob: map 67% reduce 0% >> > 12/08/30 03:27:29 INFO streaming.StreamJob: map 100% reduce 0% >> > 12/08/30 03:27:33 INFO streaming.StreamJob: map 100% reduce 100% >> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete: >> > job_201208232245_3182 >> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU >> > Thu Aug 30 03:27:24 GMT 2012 >> > *** END >> > bash-3.2$ >> > bash-3.2$ hadoop fs -ls /user/ghu/ >> > Found 5 items >> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 /user/GHU/_SUCCESS >> > drwxrwxrwx - ghu hadoop 0 2012-08-30 03:27 /user/GHU/_logs >> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 >> /user/GHU/part-00000 >> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 >> /user/GHU/part-00001 >> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 >> /user/GHU/part-00002 >> > bash-3.2$ >> > >> > >> -------------------------------------------------------------------------------------------------------------------- >> > >> > >> > Metadata Status Succeeded Type MapReduce Id job_201208232245_3182 >> > Name CaidMatch >> > User srisrini Mapper class PipeMapper Reducer class >> > Scheduler pool name default Job input directory >> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt Job >> output >> > directory hdfs://xxxx.yyyy.com/user/GHU/ Timing >> > Duration 20.977s Submit time Wed, 29 Aug 2012 08:27 PM Start time >> Wed, 29 >> > Aug 2012 08:27 PM Finish time Wed, 29 Aug 2012 08:27 PM >> > >> > >> > >> > >> > >> > >> > Progress and Scheduling Map Progress >> > 100.0% >> > Reduce Progress >> > 100.0% >> > Launched maps 4 Data-local maps 3 Rack-local maps 1 Other local maps >> > Desired maps 3 Launched reducers >> > Desired reducers 0 Fairscheduler running tasks >> > Fairscheduler minimum share >> > Fairscheduler demand >> > Current Resource Usage Current User CPUs 0 Current System CPUs 0 >> > Resident >> > memory 0 B Running maps 0 Running reducers 0 Aggregate Resource Usage >> > and Counters User CPU 0s System CPU 0s Map Slot Time 12.135s Reduce >> slot >> > time 0s Cumulative disk reads >> > Cumulative disk writes 155.0 KiB Cumulative HDFS reads 3.6 KiB >> > Cumulative >> > HDFS writes >> > Map input bytes 2.5 KiB Map input records 45 Map output records 0 >> > Reducer >> > input groups >> > Reducer input records >> > Reducer output records >> > Reducer shuffle bytes >> > Spilled records >> > >> >> >> >> -- >> Bertrand Dechoux >> > >
