I'm not sure whether there's been any work done on this type of use with LSF, but I recall some work on a similar strategy with SGE/OGE.

From my experience, having access to a shared Linux cluster managed by OGE, this approach doesn't usually work. The main problem is that if your data is big and your only run one MR job per invocation, then you'll probably spend more time copying the data into and out of HDFS than processing it. How big is too big depends on your problem and the advantages that you're getting by running an Hadoop application instead of regular LSF jobs.

What has worked better for us is allocating machines by running a "fake" job through the queue system to occupy them, and then simply ssh'ing into the machines and configuring a temporary Hadoop installation on them. With this method we can keep a temporary cluster up for as long as we need it, or until our we reach the queue's time limit.

Another approach we're starting to adopt is to use Hadoop MapReduce without HDFS. We keep a JobTracker always running on a master node and have a daemon that monitors the number of queued tasks and starts TaskTracker nodes on demand through OGE. It has been working relatively well since we have a parallel file system that is quite fast and we don't have a very large number of nodes. Even without the automatic "elastic" feature, this technique may be applicable to your use case.

HTH,

Luca


On 08/03/2012 12:43 PM, Thomas Bach wrote:
Hi list,

I'm currently evaluating different scenarios to use Hadoop. I have
access to a Linux cluster running LSF as batch system. I have the idea
to write a small wrapper in Python which

+ generates a Hadoop configuration on a per Job basis
+ formats a per job HDFS
+ brings up the NameNode and the JobTracker
+ copies all necessary files to HDFS
+ launches the actual Map/Reduce instances
+ when the job is finished, copies the produced files from HDFS
+ shuts down the daemons

My questions are:
1) Has someone already put some effort in a project similar to this?
2) Do you estimate the over-head of Hadoop set-up to big to get an
actual performance gain?

I assume (2) to depend on job running time and how big the input data
is. Thus,
3) What do you think are the characteristics of a job to gain
performance improvements?

Regards,
        Thomas.



--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452

Reply via email to