I am trying to run hadoop streaming using perl script as the mapper and with no
reducer. My requirement is for the Mapper to run on one file at a time. since
I have to do pattern processing in the entire contents of one file at a time
and
the file size is small.
Hadoop streaming manual suggests the following solution
* Generate a file containing the full HDFS path of the input files.
Each map
task would get one file name as input.
* Create a mapper script which, given a filename, will get the file to
local
disk, gzip the file and put it back in the desired output directory.
I am running the fllowing command.
hadoop jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar
-input
/user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl
/home/devi/Perl/crash_parser.pl"
/user/devi/file.txt contains the following two lines.
/user/devi/s_input/a.txt
/user/devi/s_input/b.txt
When this runs, instead of spawing two mappers for a.txt and b.txt as per the
document, only one mapper is being spawned and the perl script gets the
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.
How could I make the mapper perl script to run using only one file at a time ?
Appreciate your help, Thanks, Devi