Hi list,
I'm currently evaluating different scenarios to use Hadoop. I have
access to a Linux cluster running LSF as batch system. I have the idea
to write a small wrapper in Python which
+ generates a Hadoop configuration on a per Job basis
+ formats a per job HDFS
+ brings up the NameNode and the JobTracker
+ copies all necessary files to HDFS
+ launches the actual Map/Reduce instances
+ when the job is finished, copies the produced files from HDFS
+ shuts down the daemons
My questions are:
1) Has someone already put some effort in a project similar to this?
2) Do you estimate the over-head of Hadoop set-up to big to get an
actual performance gain?
I assume (2) to depend on job running time and how big the input data
is. Thus,
3) What do you think are the characteristics of a job to gain
performance improvements?
Regards,
Thomas.