Anil,
Thanks for your suggestion. The NLineInputFormat code actually helped.
Incase anybody has the same problem, here's a custom OneLineInputFormat
(that splits the file such that each split contains only one line) you can
use:
public class OneLineInputFormat extends FileInputFormat<LongWritable, Text>
{
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit
split,
TaskAttemptContext context) throws IOException,
InterruptedException {
// from here -
https://issues.apache.org/jira/secure/attachment/12413533/patch-375.txt
context.setStatus(split.toString());
return new LineRecordReader();
}
public List<InputSplit> getSplits(JobContext job)
throws IOException {
List<InputSplit> splits = new ArrayList<InputSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem fs = fileName.getFileSystem(job.getConfiguration());
LineReader lr = null;
try {
FSDataInputStream in = fs.open(fileName);
lr = new LineReader(in, job.getConfiguration());
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
while ((num = lr.readLine(line)) > 0) {
length += num;
if (begin == 0) {
splits.add(new FileSplit(fileName, begin, length - 1,
new String[] {}));
} else {
splits.add(new FileSplit(fileName, begin - 1, length,
new String[] {}));
}
begin += length;
length = 0;
}
} finally {
if (lr != null) {
lr.close();
}
}
}
return splits;
}
}
On Thu, Mar 15, 2012 at 9:38 PM, anil gupta <[email protected]> wrote:
> Have a look at NLineInputFormat class in Hadoop. It is build to split the
> input on the basis of number of lines.
>
> On Thu, Mar 15, 2012 at 6:13 PM, Deepak Nettem <[email protected]
> >wrote:
>
> > Hi,
> >
> > I have this use case - I need to spawn as many mappers as the number of
> > lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually
> > each line represents the path of another data source that the Mappers
> will
> > work on. So each mapper will read 1 line, (the map() method will need to
> be
> > called only once), and work on the data source.
> >
> > What's the best way to construct InputSplit, InputFormat and RecordReader
> > to achieve this? I would appreciate any example code :)
> >
> > Best,
> > Deepak
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>