I'm not sure whether there's been any work done on this type of use
with LSF, but I recall some work on a similar strategy with SGE/OGE.
From my experience, having access to a shared Linux cluster managed by
OGE, this approach doesn't usually work. The main problem is that if
your data is big and your only run one MR job per invocation, then
you'll probably spend more time copying the data into and out of HDFS
than processing it. How big is too big depends on your problem and the
advantages that you're getting by running an Hadoop application instead
of regular LSF jobs.
What has worked better for us is allocating machines by running a "fake"
job through the queue system to occupy them, and then simply ssh'ing
into the machines and configuring a temporary Hadoop installation on
them. With this method we can keep a temporary cluster up for as long
as we need it, or until our we reach the queue's time limit.
Another approach we're starting to adopt is to use Hadoop MapReduce
without HDFS. We keep a JobTracker always running on a master node and
have a daemon that monitors the number of queued tasks and starts
TaskTracker nodes on demand through OGE. It has been working relatively
well since we have a parallel file system that is quite fast and we
don't have a very large number of nodes. Even without the automatic
"elastic" feature, this technique may be applicable to your use case.
On 08/03/2012 12:43 PM, Thomas Bach wrote:
> Hi list,
> I'm currently evaluating different scenarios to use Hadoop. I have
> access to a Linux cluster running LSF as batch system. I have the idea
> to write a small wrapper in Python which
> + generates a Hadoop configuration on a per Job basis
> + formats a per job HDFS
> + brings up the NameNode and the JobTracker
> + copies all necessary files to HDFS
> + launches the actual Map/Reduce instances
> + when the job is finished, copies the produced files from HDFS
> + shuts down the daemons
> My questions are:
> 1) Has someone already put some effort in a project similar to this?
> 2) Do you estimate the over-head of Hadoop set-up to big to get an
> actual performance gain?
> I assume (2) to depend on job running time and how big the input data
> is. Thus,
> 3) What do you think are the characteristics of a job to gain
> performance improvements?
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452