|
Geoffry Roberts
2012-02-28, 17:15
Owen O'Malley
2012-02-28, 18:17
Geoffry Roberts
2012-03-02, 17:42
Kunaal
2012-03-02, 17:46
Leo Leung
2012-03-02, 18:30
Geoffry Roberts
2012-03-02, 18:31
Tarjei Huse
2012-03-02, 18:50
Geoffry Roberts
2012-03-02, 18:51
Geoffry Roberts
2012-03-02, 18:59
Tarjei Huse
2012-03-03, 07:24
|
-
Hadoop and HibernateGeoffry Roberts 2012-02-28, 17:15
All,
I am trying to use Hibernate within my reducer and it goeth not well. Has anybody ever successfully done this? I have a java package that contains my Hadoop driver, mapper, and reducer along with a persistence class. I call Hibernate from the cleanup() method in my reducer class. It complains that it cannot find the persistence class. The class is in the same package as the reducer and this all would work outside of Hadoop. The error is thrown when I attempt to begin a transaction. The error: org.hibernate.MappingException: Unknown entity: qq.mob.depart.EpiState The code: protected void cleanup(Context ctx) throws IOException, InterruptedException { ... org.hibernate.cfg.Configuration cfg = new org.hibernate.cfg.Configuration(); SessionFactory sessionFactory cfg.configure("hibernate.cfg.xml").buildSessionFactory(); cfg.addAnnotatedClass(EpiState.class); // This class is in the same package as the reducer. Session session = sessionFactory.openSession(); Transaction tx = session.getTransaction(); tx.begin(); //Error is thrown here. ... } If I create an executable jar file that contains all dependencies required by the MR job do all said dependencies get distributed to all nodes? If I specify but one reducer, which node in the cluster will the reducer run on? Thanks -- Geoffry Roberts
-
Re: Hadoop and HibernateOwen O'Malley 2012-02-28, 18:17
On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
<[EMAIL PROTECTED]> wrote: > If I create an executable jar file that contains all dependencies required > by the MR job do all said dependencies get distributed to all nodes? You can make a single jar and that will be distributed to all of the machines that run the task, but it is better in most cases to use the distributed cache. See http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > If I specify but one reducer, which node in the cluster will the reducer > run on? The scheduling is done by the JobTracker and it isn't possible to control the location of the reducers. -- Owen
-
Re: Hadoop and HibernateGeoffry Roberts 2012-03-02, 17:42
This is a tardy response. I'm spread pretty thinly right now.
DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>is apparently deprecated. Is there a replacement? I didn't see anything about this in the documentation, but then I am still using 0.21.0. I have to for performance reasons. 1.0.1 is too slow and the client won't have it. Also, the DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>approach seems only to work from within a hadoop job. i.e. From within a Mapper or a Reducer, but not from within a Driver. I have libraries that I must access both from both places. I take it that I am stuck keeping two copies of these libraries in synch--Correct? It's either that, or copy them into hdfs, replacing them all at the beginning of each job run. Looking for best practices. Thanks On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > <[EMAIL PROTECTED]> wrote: > > > If I create an executable jar file that contains all dependencies > required > > by the MR job do all said dependencies get distributed to all nodes? > > You can make a single jar and that will be distributed to all of the > machines that run the task, but it is better in most cases to use the > distributed cache. > > See > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > > > If I specify but one reducer, which node in the cluster will the reducer > > run on? > > The scheduling is done by the JobTracker and it isn't possible to > control the location of the reducers. > > -- Owen > -- Geoffry Roberts
-
Re: Hadoop and HibernateKunaal 2012-03-02, 17:46
Are you looking to use DistributedCache for better performance?
On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts <[EMAIL PROTECTED]>wrote: > This is a tardy response. I'm spread pretty thinly right now. > > DistributedCache< > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >is > apparently deprecated. Is there a replacement? I didn't see anything > about this in the documentation, but then I am still using 0.21.0. I have > to for performance reasons. 1.0.1 is too slow and the client won't have > it. > > Also, the DistributedCache< > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >approach > seems only to work from within a hadoop job. i.e. From within a > Mapper or a Reducer, but not from within a Driver. I have libraries that I > must access both from both places. I take it that I am stuck keeping two > copies of these libraries in synch--Correct? It's either that, or copy > them into hdfs, replacing them all at the beginning of each job run. > > Looking for best practices. > > Thanks > > On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > > <[EMAIL PROTECTED]> wrote: > > > > > If I create an executable jar file that contains all dependencies > > required > > > by the MR job do all said dependencies get distributed to all nodes? > > > > You can make a single jar and that will be distributed to all of the > > machines that run the task, but it is better in most cases to use the > > distributed cache. > > > > See > > > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > > > > > If I specify but one reducer, which node in the cluster will the > reducer > > > run on? > > > > The scheduling is done by the JobTracker and it isn't possible to > > control the location of the reducers. > > > > -- Owen > > > > > > -- > Geoffry Roberts > -- "What we are is the universe's gift to us. What we become is our gift to the universe."
-
RE: Hadoop and HibernateLeo Leung 2012-03-02, 18:30
Geoffry,
Hadoop distributedCache (as of now) is used to "cache" M/R application specific files. These files are used by M/R app only and not the framework. (Normally as side-lookup) You can certainly try to use Hibernate to query your SQL based back-end within the M/R code. But think of what happens when a few hundred or thousands of M/R task do that concurrently. Your back-end is going to cry. (if it can - before it dies) So IMO, prep your M/R job with distributedCache files (pull it down first) is a better approach. Also, MPI is pretty much out of question (not baked into the framework). You'll likely have to roll your own. (And try to trick the JobTracker in not starting the same task) Anyone has a better solution for Geoffry? -----Original Message----- From: Geoffry Roberts [mailto:[EMAIL PROTECTED]] Sent: Friday, March 02, 2012 9:42 AM To: [EMAIL PROTECTED] Subject: Re: Hadoop and Hibernate This is a tardy response. I'm spread pretty thinly right now. DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>is apparently deprecated. Is there a replacement? I didn't see anything about this in the documentation, but then I am still using 0.21.0. I have to for performance reasons. 1.0.1 is too slow and the client won't have it. Also, the DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>approach seems only to work from within a hadoop job. i.e. From within a Mapper or a Reducer, but not from within a Driver. I have libraries that I must access both from both places. I take it that I am stuck keeping two copies of these libraries in synch--Correct? It's either that, or copy them into hdfs, replacing them all at the beginning of each job run. Looking for best practices. Thanks On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > <[EMAIL PROTECTED]> wrote: > > > If I create an executable jar file that contains all dependencies > required > > by the MR job do all said dependencies get distributed to all nodes? > > You can make a single jar and that will be distributed to all of the > machines that run the task, but it is better in most cases to use the > distributed cache. > > See > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr > ibutedCache > > > If I specify but one reducer, which node in the cluster will the > > reducer run on? > > The scheduling is done by the JobTracker and it isn't possible to > control the location of the reducers. > > -- Owen > -- Geoffry Roberts
-
Re: Hadoop and HibernateGeoffry Roberts 2012-03-02, 18:31
No, I am using 0.21.0 for better performance. I am interested in
DistributedCache so certain libraries can be found during MR processing. As it is now, I'm getting ClassNotFoundException being thrown by the Reducers. The Driver throws no error, the Reducer(s) does. It would seem something is not being distributed across the cluster as I assumed it would. After all, the whole business is in a single, executable jar file. On 2 March 2012 09:46, Kunaal <[EMAIL PROTECTED]> wrote: > Are you looking to use DistributedCache for better performance? > > On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts > <[EMAIL PROTECTED]>wrote: > > > This is a tardy response. I'm spread pretty thinly right now. > > > > DistributedCache< > > > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > > >is > > apparently deprecated. Is there a replacement? I didn't see anything > > about this in the documentation, but then I am still using 0.21.0. I have > > to for performance reasons. 1.0.1 is too slow and the client won't have > > it. > > > > Also, the DistributedCache< > > > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > > >approach > > seems only to work from within a hadoop job. i.e. From within a > > Mapper or a Reducer, but not from within a Driver. I have libraries > that I > > must access both from both places. I take it that I am stuck keeping two > > copies of these libraries in synch--Correct? It's either that, or copy > > them into hdfs, replacing them all at the beginning of each job run. > > > > Looking for best practices. > > > > Thanks > > > > On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > > > > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > > > <[EMAIL PROTECTED]> wrote: > > > > > > > If I create an executable jar file that contains all dependencies > > > required > > > > by the MR job do all said dependencies get distributed to all nodes? > > > > > > You can make a single jar and that will be distributed to all of the > > > machines that run the task, but it is better in most cases to use the > > > distributed cache. > > > > > > See > > > > > > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > > > > > > > If I specify but one reducer, which node in the cluster will the > > reducer > > > > run on? > > > > > > The scheduling is done by the JobTracker and it isn't possible to > > > control the location of the reducers. > > > > > > -- Owen > > > > > > > > > > > -- > > Geoffry Roberts > > > > > > -- > "What we are is the universe's gift to us. > What we become is our gift to the universe." > -- Geoffry Roberts
-
Re: Hadoop and HibernateTarjei Huse 2012-03-02, 18:50
On 03/02/2012 07:31 PM, Geoffry Roberts wrote:
> No, I am using 0.21.0 for better performance. I am interested in > DistributedCache so certain libraries can be found during MR processing. > As it is now, I'm getting ClassNotFoundException being thrown by the > Reducers. The Driver throws no error, the Reducer(s) does. It would seem > something is not being distributed across the cluster as I assumed it > would. After all, the whole business is in a single, executable jar file. How complex are the queries you are doing? Have you considered one of the following: 1) Use plain jdbc instead of integrating Hibernate into Hadoop. 2) Create a local version of the db that can be in the Distributed Cache. I tried using Hibernate with hadoop (the queries were not an important part of the size of the jobs) but I ran up against so many issues trying to get Hibernate to start up within the MR job that i ended up just exporting the tables, loading them into memory and doing queries against them with basic HashMap lookups. My best advice is that if you can, you should consider a way to abstract away Hibernate from the job and use something closer to the metal like either JDBC or just dump the data to files. Getting Hibernate to run outside of Spring and friends can quickly grow tiresome. T > > On 2 March 2012 09:46, Kunaal <[EMAIL PROTECTED]> wrote: > >> Are you looking to use DistributedCache for better performance? >> >> On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts >> <[EMAIL PROTECTED]>wrote: >> >>> This is a tardy response. I'm spread pretty thinly right now. >>> >>> DistributedCache< >>> >> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache >>>> is >>> apparently deprecated. Is there a replacement? I didn't see anything >>> about this in the documentation, but then I am still using 0.21.0. I have >>> to for performance reasons. 1.0.1 is too slow and the client won't have >>> it. >>> >>> Also, the DistributedCache< >>> >> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache >>>> approach >>> seems only to work from within a hadoop job. i.e. From within a >>> Mapper or a Reducer, but not from within a Driver. I have libraries >> that I >>> must access both from both places. I take it that I am stuck keeping two >>> copies of these libraries in synch--Correct? It's either that, or copy >>> them into hdfs, replacing them all at the beginning of each job run. >>> >>> Looking for best practices. >>> >>> Thanks >>> >>> On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: >>> >>>> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts >>>> <[EMAIL PROTECTED]> wrote: >>>> >>>>> If I create an executable jar file that contains all dependencies >>>> required >>>>> by the MR job do all said dependencies get distributed to all nodes? >>>> You can make a single jar and that will be distributed to all of the >>>> machines that run the task, but it is better in most cases to use the >>>> distributed cache. >>>> >>>> See >>>> >> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache >>>>> If I specify but one reducer, which node in the cluster will the >>> reducer >>>>> run on? >>>> The scheduling is done by the JobTracker and it isn't possible to >>>> control the location of the reducers. >>>> >>>> -- Owen >>>> >>> >>> >>> -- >>> Geoffry Roberts >>> >> >> >> -- >> "What we are is the universe's gift to us. >> What we become is our gift to the universe." >> > > -- Regards / Med vennlig hilsen Tarjei Huse Mobil: 920 63 413
-
Re: Hadoop and HibernateGeoffry Roberts 2012-03-02, 18:51
Thanks Leo. I appreciate your response.
Let me explain my situation more precisely. I am running a series of MR sub-jobs all harnessed together so they run as a single job. The last MR sub-job does nothing more than aggregate the output of the previous sub-job into a single file(s). It does this, by having but a single reducer. I could eliminate this aggregation sub-job if I could have the aforementioned previous sub-job insert its output into a database instead of hdfs. Doing this, would also eliminate my current dependance on MultipleOutputs. The trouble comes when the Reducer(s) cannot find the persistent objects hence the dreaded CNFE. I find this odd because they are in the same package as the Reducer. Your comment about the back end crying is duly noted. btw, MPI = Message Passing Interface? On 2 March 2012 10:30, Leo Leung <[EMAIL PROTECTED]> wrote: > Geoffry, > > Hadoop distributedCache (as of now) is used to "cache" M/R application > specific files. > These files are used by M/R app only and not the framework. (Normally as > side-lookup) > > You can certainly try to use Hibernate to query your SQL based back-end > within the M/R code. > But think of what happens when a few hundred or thousands of M/R task do > that concurrently. > Your back-end is going to cry. (if it can - before it dies) > > So IMO, prep your M/R job with distributedCache files (pull it down > first) is a better approach. > > Also, MPI is pretty much out of question (not baked into the framework). > You'll likely have to roll your own. (And try to trick the JobTracker in > not starting the same task) > > Anyone has a better solution for Geoffry? > > > > -----Original Message----- > From: Geoffry Roberts [mailto:[EMAIL PROTECTED]] > Sent: Friday, March 02, 2012 9:42 AM > To: [EMAIL PROTECTED] > Subject: Re: Hadoop and Hibernate > > This is a tardy response. I'm spread pretty thinly right now. > > DistributedCache< > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >is > apparently deprecated. Is there a replacement? I didn't see anything > about this in the documentation, but then I am still using 0.21.0. I have > to for performance reasons. 1.0.1 is too slow and the client won't have it. > > Also, the DistributedCache< > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >approach > seems only to work from within a hadoop job. i.e. From within a Mapper or > a Reducer, but not from within a Driver. I have libraries that I must > access both from both places. I take it that I am stuck keeping two copies > of these libraries in synch--Correct? It's either that, or copy them into > hdfs, replacing them all at the beginning of each job run. > > Looking for best practices. > > Thanks > > On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > > <[EMAIL PROTECTED]> wrote: > > > > > If I create an executable jar file that contains all dependencies > > required > > > by the MR job do all said dependencies get distributed to all nodes? > > > > You can make a single jar and that will be distributed to all of the > > machines that run the task, but it is better in most cases to use the > > distributed cache. > > > > See > > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr > > ibutedCache > > > > > If I specify but one reducer, which node in the cluster will the > > > reducer run on? > > > > The scheduling is done by the JobTracker and it isn't possible to > > control the location of the reducers. > > > > -- Owen > > > > > > -- > Geoffry Roberts > -- Geoffry Roberts
-
Re: Hadoop and HibernateGeoffry Roberts 2012-03-02, 18:59
Queries are nothing but inserts. Create an object, populated it, persist
it. If it worked, life would be good right now. I've considered JDBC and may yet take that approach. re: Hibernate outside of Spring -- I'm getting tired already. Interesting thing: I use EMF (Eclipse Modelling Framework). The supporting jar files for emf and ecore are built into the job. They are being found by the Driver(s) and the MR(s) no problemo. If these work, why not the hibernate stuff? Mystery! On 2 March 2012 10:50, Tarjei Huse <[EMAIL PROTECTED]> wrote: > On 03/02/2012 07:31 PM, Geoffry Roberts wrote: > > No, I am using 0.21.0 for better performance. I am interested in > > DistributedCache so certain libraries can be found during MR processing. > > As it is now, I'm getting ClassNotFoundException being thrown by the > > Reducers. The Driver throws no error, the Reducer(s) does. It would > seem > > something is not being distributed across the cluster as I assumed it > > would. After all, the whole business is in a single, executable jar > file. > > How complex are the queries you are doing? > > Have you considered one of the following: > > 1) Use plain jdbc instead of integrating Hibernate into Hadoop. > 2) Create a local version of the db that can be in the Distributed Cache. > > I tried using Hibernate with hadoop (the queries were not an important > part of the size of the jobs) but I ran up against so many issues trying > to get Hibernate to start up within the MR job that i ended up just > exporting the tables, loading them into memory and doing queries against > them with basic HashMap lookups. > > My best advice is that if you can, you should consider a way to abstract > away Hibernate from the job and use something closer to the metal like > either JDBC or just dump the data to files. Getting Hibernate to run > outside of Spring and friends can quickly grow tiresome. > > T > > > > On 2 March 2012 09:46, Kunaal <[EMAIL PROTECTED]> wrote: > > > >> Are you looking to use DistributedCache for better performance? > >> > >> On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts > >> <[EMAIL PROTECTED]>wrote: > >> > >>> This is a tardy response. I'm spread pretty thinly right now. > >>> > >>> DistributedCache< > >>> > >> > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >>>> is > >>> apparently deprecated. Is there a replacement? I didn't see anything > >>> about this in the documentation, but then I am still using 0.21.0. I > have > >>> to for performance reasons. 1.0.1 is too slow and the client won't > have > >>> it. > >>> > >>> Also, the DistributedCache< > >>> > >> > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >>>> approach > >>> seems only to work from within a hadoop job. i.e. From within a > >>> Mapper or a Reducer, but not from within a Driver. I have libraries > >> that I > >>> must access both from both places. I take it that I am stuck keeping > two > >>> copies of these libraries in synch--Correct? It's either that, or copy > >>> them into hdfs, replacing them all at the beginning of each job run. > >>> > >>> Looking for best practices. > >>> > >>> Thanks > >>> > >>> On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: > >>> > >>>> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts > >>>> <[EMAIL PROTECTED]> wrote: > >>>> > >>>>> If I create an executable jar file that contains all dependencies > >>>> required > >>>>> by the MR job do all said dependencies get distributed to all nodes? > >>>> You can make a single jar and that will be distributed to all of the > >>>> machines that run the task, but it is better in most cases to use the > >>>> distributed cache. > >>>> > >>>> See > >>>> > >> > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache > >>>>> If I specify but one reducer, which node in the cluster will the > >>> reducer > >>>>> run on? > >>>> The scheduling is done by the JobTracker and it isn't possible to Geoffry Roberts
-
Re: Hadoop and HibernateTarjei Huse 2012-03-03, 07:24
On 03/02/2012 07:59 PM, Geoffry Roberts wrote:
> Queries are nothing but inserts. Create an object, populated it, persist > it. If it worked, life would be good right now. > > I've considered JDBC and may yet take that approach. I used Mybatis on a project now - also worth considering if you want a more orm like feel to the job. > > re: Hibernate outside of Spring -- I'm getting tired already. > > Interesting thing: I use EMF (Eclipse Modelling Framework). The > supporting jar files for emf and ecore are built into the job. They are > being found by the Driver(s) and the MR(s) no problemo. If these work, why > not the hibernate stuff? Mystery! I wish I knew. :) T > > On 2 March 2012 10:50, Tarjei Huse <[EMAIL PROTECTED]> wrote: > >> On 03/02/2012 07:31 PM, Geoffry Roberts wrote: >>> No, I am using 0.21.0 for better performance. I am interested in >>> DistributedCache so certain libraries can be found during MR processing. >>> As it is now, I'm getting ClassNotFoundException being thrown by the >>> Reducers. The Driver throws no error, the Reducer(s) does. It would >> seem >>> something is not being distributed across the cluster as I assumed it >>> would. After all, the whole business is in a single, executable jar >> file. >> >> How complex are the queries you are doing? >> >> Have you considered one of the following: >> >> 1) Use plain jdbc instead of integrating Hibernate into Hadoop. >> 2) Create a local version of the db that can be in the Distributed Cache. >> >> I tried using Hibernate with hadoop (the queries were not an important >> part of the size of the jobs) but I ran up against so many issues trying >> to get Hibernate to start up within the MR job that i ended up just >> exporting the tables, loading them into memory and doing queries against >> them with basic HashMap lookups. >> >> My best advice is that if you can, you should consider a way to abstract >> away Hibernate from the job and use something closer to the metal like >> either JDBC or just dump the data to files. Getting Hibernate to run >> outside of Spring and friends can quickly grow tiresome. >> >> T >>> On 2 March 2012 09:46, Kunaal <[EMAIL PROTECTED]> wrote: >>> >>>> Are you looking to use DistributedCache for better performance? >>>> >>>> On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts >>>> <[EMAIL PROTECTED]>wrote: >>>> >>>>> This is a tardy response. I'm spread pretty thinly right now. >>>>> >>>>> DistributedCache< >>>>> >> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache >>>>>> is >>>>> apparently deprecated. Is there a replacement? I didn't see anything >>>>> about this in the documentation, but then I am still using 0.21.0. I >> have >>>>> to for performance reasons. 1.0.1 is too slow and the client won't >> have >>>>> it. >>>>> >>>>> Also, the DistributedCache< >>>>> >> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache >>>>>> approach >>>>> seems only to work from within a hadoop job. i.e. From within a >>>>> Mapper or a Reducer, but not from within a Driver. I have libraries >>>> that I >>>>> must access both from both places. I take it that I am stuck keeping >> two >>>>> copies of these libraries in synch--Correct? It's either that, or copy >>>>> them into hdfs, replacing them all at the beginning of each job run. >>>>> >>>>> Looking for best practices. >>>>> >>>>> Thanks >>>>> >>>>> On 28 February 2012 10:17, Owen O'Malley <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts >>>>>> <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> If I create an executable jar file that contains all dependencies >>>>>> required >>>>>>> by the MR job do all said dependencies get distributed to all nodes? >>>>>> You can make a single jar and that will be distributed to all of the >>>>>> machines that run the task, but it is better in most cases to use the >>>>>> distributed cache. >>>>>> >>>>>> See >>>>>> >> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache Regards / Med vennlig hilsen Tarjei Huse Mobil: 920 63 413 |