streaming command [Re: no output written to HDFS]

Periya.Data Fri, 31 Aug 2012 10:07:56 -0700

Yes, both input files need to be processed by the mapper..but not in the
same fashion. Essentially, this is what my Python script does:
- read two text files - A and B. file A has a list of account-IDs (all
numeric). File B has about 10 records - some of which has the same
account_ID as those listed in file A.
- mapper: read both the files, compares and prints out those records that
have matching account_IDs.


I have tried placing both the input files under a single input directory.
Same behavior.

And, from what I have read so far, "-mapper" or "-reducer" should have
"ONLY" the name of the executable (like...in my case, "test2.py".). But, if
I do that, nothing happens. I have to explicitly mention:
-mapper "cat $1 | python $GHU_HOME/test2.py $2"...something like
that...which looks unconventional...but, it produces "some" output...not
the correct one though.

Again, if I run my script in just plain linux machine, using the basic
commands :
cat $1 | python test2.py $2,
it produces the expected output.


*Observation*: If I do not specify the two files under "- file" option,
then, I see no output written to HDFS..even though the output directory has
empy part-files and SUCCESS directory. The 3-part files are reasonable - as
3 mappers are configured for each job.


My current command:

hadoop jar ...streaming.jar
         -input /user/ghu/input/* \
         -output /user/ghu/out file /home/ghu/test2.py \
         -mapper "cat $1 | python test2.py $2" \
         -file /home/ghu/$1 \
         -file /home/ghu/$2


Learning,
/PD

On Thu, Aug 30, 2012 at 9:46 PM, Hemanth Yamijala <[email protected]>wrote:

> Hi,
>
> Do both input files contain data that needs to be processed by the
> mapper in the same fashion ? In which case, you could just put the
> input files under a directory in HDFS and provide that as input. The
> -input option does accept a directory as argument.
>
> Otherwise, can you please explain a little more what you're trying to
> do with the two inputs.
>
> Thanks
> Hemanth
>
> On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data <[email protected]>
> wrote:
> > This is interesting. I changed my command to:
> >
> > -mapper "cat $1 |  $GHU_HOME/test2.py $2" \
> >
> > is producing output to HDFS. But, the output is not what I expected and
> is
> > not the same as when I do "cat | map " on Linux. It is producing
> > part-00000, part-00001 and part-00002. I expected only one output file
> with
> > just 2 records.
> >
> > I think I have to understand what exactly "-file" does and what exactly
> > "-input" does. I am experimenting what happens if I give my input files
> on
> > the command line (like: test2.py arg1 arg2) as against specifying the
> input
> > files via "-file" and "-input" options...
> >
> > The problem is I have 2 input files...and have no idea how to pass them.
> > SHould I keep one in HDFS and stream in the other?
> >
> > More digging,
> > PD/
> >
> >
> >
> > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <[email protected]>
> wrote:
> >
> >> Hi Bertrand,
> >>     No, I do not observe the same when I run using cat | map. I can see
> >> the output in STDOUT when I run my program.
> >>
> >> I do not have any reducer. In my command, I provide
> >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
> >> written directly to HDFS.
> >>
> >> Your suspicion maybe right..about the output. In my counters, the "map
> >> input records" = 40 and "map.output records" = 0. I am trying to see if
> I
> >> am messing up in my command...(see below)
> >>
> >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I
> am
> >> streaming one file in and test2.py takes in only one argument. How
> should I
> >> frame my command below? I think that is where I am messing up..
> >>
> >>
> >> run.sh:        (run as:   cat <arg2> | ./run.sh <arg1> )
> >> -----------
> >>
> >> hadoop jar
> >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
> >>         -D mapred.reduce.tasks=0 \
> >>         -verbose \
> >>         -input "$HDFS_INPUT" \
> >>         -input "$HDFS_INPUT_2" \
> >>         -output "$HDFS_OUTPUT" \
> >>         -file   "$GHU_HOME/test2.py" \
> >>         -mapper "python $GHU_HOME/test2.py $1" \
> >>         -file   "$GHU_HOME/$1"
> >>
> >>
> >>
> >> If I modify my mapper to take in 2 arguments, then, I would run it as:
> >>
> >> run.sh:        (run as:   ./run.sh <arg1>  <arg2>)
> >> -----------
> >>
> >> hadoop jar
> >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
> >>         -D mapred.reduce.tasks=0 \
> >>         -verbose \
> >>         -input "$HDFS_INPUT" \
> >>         -input "$HDFS_INPUT_2" \
> >>         -output "$HDFS_OUTPUT" \
> >>         -file   "$GHU_HOME/test2.py" \
> >>         -mapper "python $GHU_HOME/test2.py $1 $2" \
> >>         -file   "$GHU_HOME/$1" \
> >>         -file   "GHU_HOME/$2"
> >>
> >>
> >> Please let me know if I am making a mistake here.
> >>
> >>
> >> Thanks.
> >> PD
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <[email protected]
> >wrote:
> >>
> >>> Do you observe the same thing when running without Hadoop? (cat, map,
> sort
> >>> and then reduce)
> >>>
> >>> Could you provide the counters of your job? You should be able to get
> them
> >>> using the job tracker interface.
> >>>
> >>> The most probable answer without more information would be that your
> >>> reducer do not output any <key,value>s.
> >>>
> >>> Regards
> >>>
> >>> Bertrand
> >>>
> >>>
> >>>
> >>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <[email protected]>
> >>> wrote:
> >>>
> >>> > Hi All,
> >>> >    My Hadoop streaming job (in Python) runs to "completion" (both map
> >>> and
> >>> > reduce says 100% complete). But, when I look at the output directory
> in
> >>> > HDFS, the part files are empty. I do not know what might be causing
> this
> >>> > behavior. I understand that the percentages represent the records
> that
> >>> have
> >>> > been read in (not processed).
> >>> >
> >>> > The following are some of the logs. The detailed logs from Cloudera
> >>> Manager
> >>> > says that there were no Map Outputs...which is interesting. Any
> >>> > suggestions?
> >>> >
> >>> >
> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob:
> >>> /usr/lib/hadoop-0.20/bin/hadoop
> >>> > job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill
> >>> job_201208232245_3182
> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
> >>> >
> http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
> >>> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
> >>> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
> >>> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
> >>> > 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
> >>> > 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
> >>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
> >>> > job_201208232245_3182
> >>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
> >>> > Thu Aug 30 03:27:24 GMT 2012
> >>> > *** END
> >>> > bash-3.2$
> >>> > bash-3.2$ hadoop fs -ls /user/ghu/
> >>> > Found 5 items
> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
> /user/GHU/_SUCCESS
> >>> > drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
> >>> /user/GHU/part-00000
> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
> >>> /user/GHU/part-00001
> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
> >>> /user/GHU/part-00002
> >>> > bash-3.2$
> >>> >
> >>> >
> >>>
> --------------------------------------------------------------------------------------------------------------------
> >>> >
> >>> >
> >>> > Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
> >>> > Name CaidMatch
> >>> >  User srisrini  Mapper class PipeMapper  Reducer class
> >>> >  Scheduler pool name default  Job input directory
> >>> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job
> >>> output
> >>> > directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
> >>> > Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time
> >>> Wed, 29
> >>> > Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >  Progress and Scheduling Map Progress
> >>> > 100.0%
> >>> >  Reduce Progress
> >>> > 100.0%
> >>> >  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local
> maps
> >>> >  Desired maps 3  Launched reducers
> >>> >  Desired reducers 0  Fairscheduler running tasks
> >>> >  Fairscheduler minimum share
> >>> >  Fairscheduler demand
> >>> >  Current Resource Usage Current User CPUs 0  Current System CPUs 0
> >>> >  Resident
> >>> > memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource
> Usage
> >>> > and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s
>  Reduce
> >>> slot
> >>> > time 0s  Cumulative disk reads
> >>> >  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB
> >>> >  Cumulative
> >>> > HDFS writes
> >>> >  Map input bytes 2.5 KiB  Map input records 45  Map output records 0
> >>> >  Reducer
> >>> > input groups
> >>> >  Reducer input records
> >>> >  Reducer output records
> >>> >  Reducer shuffle bytes
> >>> >  Spilled records
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Bertrand Dechoux
> >>>
> >>
> >>
>

streaming command [Re: no output written to HDFS]

Reply via email to