Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Hadoop and Hibernate


Copy link to this message
-
RE: Hadoop and Hibernate
Leo Leung 2012-03-02, 18:30
Geoffry,

 Hadoop distributedCache (as of now) is used to "cache" M/R application specific files.
 These files are used by M/R app only and not the framework. (Normally as side-lookup)

 You can certainly try to use Hibernate to query your SQL based back-end within the M/R code.
 But think of what happens when a few hundred or thousands of M/R task do that concurrently.
 Your back-end is going to cry. (if it can - before it dies)

 So IMO,  prep your M/R job with distributedCache files (pull it down first) is a better approach.

 Also, MPI is pretty much out of question (not baked into the framework).  
 You'll likely have to roll your own.  (And try to trick the JobTracker in not starting the same task)

 Anyone has a better solution for Geoffry?

-----Original Message-----
From: Geoffry Roberts [mailto:[EMAIL PROTECTED]]
Sent: Friday, March 02, 2012 9:42 AM
To: [EMAIL PROTECTED]
Subject: Re: Hadoop and Hibernate

This is a tardy response.  I'm spread pretty thinly right now.

DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>is
apparently deprecated.  Is there a replacement?  I didn't see anything about this in the documentation, but then I am still using 0.21.0. I have to for performance reasons.  1.0.1 is too slow and the client won't have it.

Also, the DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>approach
seems only to work from within a hadoop job.  i.e. From within a Mapper or a Reducer, but not from within a Driver.  I have libraries that I must access both from both places.  I take it that I am stuck keeping two copies of these libraries in synch--Correct?  It's either that, or copy them into hdfs, replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote:

> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
> <[EMAIL PROTECTED]> wrote:
>
> > If I create an executable jar file that contains all dependencies
> required
> > by the MR job do all said dependencies get distributed to all nodes?
>
> You can make a single jar and that will be distributed to all of the
> machines that run the task, but it is better in most cases to use the
> distributed cache.
>
> See
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
> ibutedCache
>
> > If I specify but one reducer, which node in the cluster will the
> > reducer run on?
>
> The scheduling is done by the JobTracker and it isn't possible to
> control the location of the reducers.
>
> -- Owen
>

--
Geoffry Roberts