Re: Running Hadoop in User-Space on LSF

Luca Pireddu Wed, 08 Aug 2012 05:28:54 -0700

I'm not sure whether there's been any work done on this type of usewith LSF, but I recall some work on a similar strategy with SGE/OGE.

From my experience, having access to a shared Linux cluster managed byOGE, this approach doesn't usually work. The main problem is that ifyour data is big and your only run one MR job per invocation, thenyou'll probably spend more time copying the data into and out of HDFSthan processing it. How big is too big depends on your problem and theadvantages that you're getting by running an Hadoop application insteadof regular LSF jobs.

What has worked better for us is allocating machines by running a "fake"job through the queue system to occupy them, and then simply ssh'inginto the machines and configuring a temporary Hadoop installation onthem. With this method we can keep a temporary cluster up for as longas we need it, or until our we reach the queue's time limit.

Another approach we're starting to adopt is to use Hadoop MapReducewithout HDFS. We keep a JobTracker always running on a master node andhave a daemon that monitors the number of queued tasks and startsTaskTracker nodes on demand through OGE. It has been working relativelywell since we have a parallel file system that is quite fast and wedon't have a very large number of nodes. Even without the automatic"elastic" feature, this technique may be applicable to your use case.


HTH,

Luca


On 08/03/2012 12:43 PM, Thomas Bach wrote:

Hi list,

I'm currently evaluating different scenarios to use Hadoop. I have
access to a Linux cluster running LSF as batch system. I have the idea
to write a small wrapper in Python which

+ generates a Hadoop configuration on a per Job basis
+ formats a per job HDFS
+ brings up the NameNode and the JobTracker
+ copies all necessary files to HDFS
+ launches the actual Map/Reduce instances
+ when the job is finished, copies the produced files from HDFS
+ shuts down the daemons

My questions are:
1) Has someone already put some effort in a project similar to this?
2) Do you estimate the over-head of Hadoop set-up to big to get an
actual performance gain?

I assume (2) to depend on job running time and how big the input data
is. Thus,
3) What do you think are the characteristics of a job to gain
performance improvements?

Regards,
        Thomas.



--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452

Re: Running Hadoop in User-Space on LSF

Reply via email to