After specifying NLineInputFormat option, streaming job fails with
Error from attempt_201205171448_0092_m_000000_0: java.lang.RuntimeException:
PipeMapRed.waitOutputThreads(): subprocess failed with code 2
It spawns two mappers, but i am not sure whether the mapper runs with file
names
specified in the input option. I was expecting one mapper to run with
/user/devi/s_input/a.txt and one mapper to run with /user/devi/s_input/b.txt. I
digged into the task files, but could not find anything.
Here is the simple mapper perl script .All does is it reads the file and
prints
it. (It needs to do much more stuff, but I could not get the basic job itself
to
run).
$i = 0;
$userinput = <STDIN>;
open(INFILE,"$userinput") || die "could not open the file $userinput \n";
while (<INFILE>) {
my $line = $_;
print "$i".$line ;
$i++;
}
close(INFILE);
exit;
My command is hadoop jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar
-input
/user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl
/home/devi/Perl/crash_parser.pl" -inputformat
org.apache.hadoop.mapred.lib.NLineInputFormat
Really appreciate your help.
Devi
________________________________
From: Robert Evans <[email protected]>
To: "[email protected]" <[email protected]>;
"[email protected]" <[email protected]>
Sent: Thu, August 2, 2012 1:16:54 PM
Subject: Re: Issue with Hadoop Streaming
http://www.mail-archive.com/[email protected]/msg07382.html
From: Devi Kumarappan <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, August 2, 2012 3:03 PM
To: "[email protected]" <[email protected]>,
"[email protected]" <[email protected]>
Subject: Re: Issue with Hadoop Streaming
My mapper is perl script and it is not in Java.So how do I specify the
NLineFormat?
________________________________
From: Robert Evans <[email protected]>
To: "[email protected]" <[email protected]>;
"[email protected]" <[email protected]>
Sent: Thu, August 2, 2012 12:59:50 PM
Subject: Re: Issue with Hadoop Streaming
It depends on the input format you use. You probably want to look at using
NLineInputFormat
From: Devi Kumarappan <[email protected]<mailto:[email protected]>>
Reply-To:
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Wednesday, August 1, 2012 8:09 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Issue with Hadoop Streaming
I am trying to run hadoop streaming using perl script as the mapper and with no
reducer. My requirement is for the Mapper to run on one file at a time. since
I have to do pattern processing in the entire contents of one file at a time
and
the file size is small.
Hadoop streaming manual suggests the following solution
* Generate a file containing the full HDFS path of the input files. Each map
task would get one file name as input.
* Create a mapper script which, given a filename, will get the file to local
disk, gzip the file and put it back in the desired output directory.
I am running the fllowing command.
hadoop jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar
-input
/user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl
/home/devi/Perl/crash_parser.pl"
/user/devi/file.txt contains the following two lines.
/user/devi/s_input/a.txt
/user/devi/s_input/b.txt
When this runs, instead of spawing two mappers for a.txt and b.txt as per the
document, only one mapper is being spawned and the perl script gets the
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.
How could I make the mapper perl script to run using only one file at a time ?
Appreciate your help, Thanks, Devi