Hi Bertrand,
No, I do not observe the same when I run using cat | map. I can see the
output in STDOUT when I run my program.
I do not have any reducer. In my command, I provide
"-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
written directly to HDFS.
Your suspicion maybe right..about the output. In my counters, the "map
input records" = 40 and "map.output records" = 0. I am trying to see if I
am messing up in my command...(see below)
Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
streaming one file in and test2.py takes in only one argument. How should I
frame my command below? I think that is where I am messing up..
run.sh: (run as: cat <arg2> | ./run.sh <arg1> )
-----------
hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
-D mapred.reduce.tasks=0 \
-verbose \
-input "$HDFS_INPUT" \
-input "$HDFS_INPUT_2" \
-output "$HDFS_OUTPUT" \
-file "$GHU_HOME/test2.py" \
-mapper "python $GHU_HOME/test2.py $1" \
-file "$GHU_HOME/$1"
If I modify my mapper to take in 2 arguments, then, I would run it as:
run.sh: (run as: ./run.sh <arg1> <arg2>)
-----------
hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
-D mapred.reduce.tasks=0 \
-verbose \
-input "$HDFS_INPUT" \
-input "$HDFS_INPUT_2" \
-output "$HDFS_OUTPUT" \
-file "$GHU_HOME/test2.py" \
-mapper "python $GHU_HOME/test2.py $1 $2" \
-file "$GHU_HOME/$1" \
-file "GHU_HOME/$2"
Please let me know if I am making a mistake here.
Thanks.
PD
On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <[email protected]>wrote:
> Do you observe the same thing when running without Hadoop? (cat, map, sort
> and then reduce)
>
> Could you provide the counters of your job? You should be able to get them
> using the job tracker interface.
>
> The most probable answer without more information would be that your
> reducer do not output any <key,value>s.
>
> Regards
>
> Bertrand
>
>
>
> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <[email protected]>
> wrote:
>
> > Hi All,
> > My Hadoop streaming job (in Python) runs to "completion" (both map and
> > reduce says 100% complete). But, when I look at the output directory in
> > HDFS, the part files are empty. I do not know what might be causing this
> > behavior. I understand that the percentages represent the records that
> have
> > been read in (not processed).
> >
> > The following are some of the logs. The detailed logs from Cloudera
> Manager
> > says that there were no Map Outputs...which is interesting. Any
> > suggestions?
> >
> >
> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
> > 12/08/30 03:27:14 INFO streaming.StreamJob:
> /usr/lib/hadoop-0.20/bin/hadoop
> > job -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill job_201208232245_3182
> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
> > http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
> > 12/08/30 03:27:15 INFO streaming.StreamJob: map 0% reduce 0%
> > 12/08/30 03:27:20 INFO streaming.StreamJob: map 33% reduce 0%
> > 12/08/30 03:27:23 INFO streaming.StreamJob: map 67% reduce 0%
> > 12/08/30 03:27:29 INFO streaming.StreamJob: map 100% reduce 0%
> > 12/08/30 03:27:33 INFO streaming.StreamJob: map 100% reduce 100%
> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
> > job_201208232245_3182
> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
> > Thu Aug 30 03:27:24 GMT 2012
> > *** END
> > bash-3.2$
> > bash-3.2$ hadoop fs -ls /user/ghu/
> > Found 5 items
> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 /user/GHU/_SUCCESS
> > drwxrwxrwx - ghu hadoop 0 2012-08-30 03:27 /user/GHU/_logs
> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27
> /user/GHU/part-00000
> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27
> /user/GHU/part-00001
> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27
> /user/GHU/part-00002
> > bash-3.2$
> >
> >
> --------------------------------------------------------------------------------------------------------------------
> >
> >
> > Metadata Status Succeeded Type MapReduce Id job_201208232245_3182
> > Name CaidMatch
> > User srisrini Mapper class PipeMapper Reducer class
> > Scheduler pool name default Job input directory
> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt Job
> output
> > directory hdfs://xxxx.yyyy.com/user/GHU/ Timing
> > Duration 20.977s Submit time Wed, 29 Aug 2012 08:27 PM Start time Wed,
> 29
> > Aug 2012 08:27 PM Finish time Wed, 29 Aug 2012 08:27 PM
> >
> >
> >
> >
> >
> >
> > Progress and Scheduling Map Progress
> > 100.0%
> > Reduce Progress
> > 100.0%
> > Launched maps 4 Data-local maps 3 Rack-local maps 1 Other local maps
> > Desired maps 3 Launched reducers
> > Desired reducers 0 Fairscheduler running tasks
> > Fairscheduler minimum share
> > Fairscheduler demand
> > Current Resource Usage Current User CPUs 0 Current System CPUs 0
> > Resident
> > memory 0 B Running maps 0 Running reducers 0 Aggregate Resource Usage
> > and Counters User CPU 0s System CPU 0s Map Slot Time 12.135s Reduce
> slot
> > time 0s Cumulative disk reads
> > Cumulative disk writes 155.0 KiB Cumulative HDFS reads 3.6 KiB
> > Cumulative
> > HDFS writes
> > Map input bytes 2.5 KiB Map input records 45 Map output records 0
> > Reducer
> > input groups
> > Reducer input records
> > Reducer output records
> > Reducer shuffle bytes
> > Spilled records
> >
>
>
>
> --
> Bertrand Dechoux
>