Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pig optimization getting in the way?


Copy link to this message
-
Re: Pig optimization getting in the way?
Dexin Wang 2011-02-22, 18:18
So I can create multiple db connections for each (jdbc_url, table) pairs and
map each pair to its own connection for record writer. Is that what you are
suggesting? Sounds like a good plan. Thanks.

On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:

>  As you are suspecting, both store functions are probably running in the
> same map or reduce task. This is a result of multi-query optimization.
> Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and you
> will be able to verify if the store is happening the same map/reduce task.
>
> Can you can make the db connection a member of the store function/ record
> writer?
> You can also use  "-no_multiquery" to prevent multi-query optimization from
> happening, but that will also result in the MR job being executed again for
> other output.
>
> Thanks,
> Thejas
>
>
>
>
> On 2/18/11 4:48 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:
>
> I hope that's the case. But
>
>  *mapred.job.reuse.jvm.num.tasks* 1
> However it does seem to be doing the write to two DB tables in the same job
> so although it's not re-using jvm, it is already in one jvm since it's the
> same task!
>
> And since the DB connection is static/singleton as you mentioned, and table
> name (which is the only thing that's different) is not part of connection
> URL, they share the same DB connection, and one of them will close the
> connection when it's done.
>
> Hmm, any suggestions how we can handle this? Thanks.
>
> On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
> wrote:
>
> > Let me guess -- you have a static JDBC connection that you open in
> myJDBC,
> > and you have jvm reuse turned on.
> >
> > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> >
> > > I ran into a problem that I have spent quite some time on and start to
> > > think
> > > it's probably pig's doing something optimization that makes this thing
> > > hard.
> > >
> > > This is my pseudo code:
> > >
> > > raw = LOAD ...
> > >
> > > then some crazy stuff like
> > > filter
> > > join
> > > group
> > > UDF
> > > etc
> > >
> > > A = the result from above operation
> > > STORE A INTO 'dummy' USING myJDBC(write to table1);
> > >
> > > This works fine and I have 4 map-red jobs.
> > >
> > > Then I add this after that:
> > >
> > > B = FILTER A BY col1="xyz";
> > > STORE B INTO 'dummy2' USING myJDBC(write to table2);
> > >
> > > basically I do some filtering of A and write it to another table thru
> > JDBC.
> > >
> > > Then I had the problem of jobs failing and saying "PSQLException: This
> > > statement has been closed".
> > >
> > > My workaround now is to add "EXEC;" before B line and make them write
> to
> > DB
> > > in sequence. This works but now it would run the same map-red jobs
> twice
> > -
> > > I
> > > ended up with 8 jobs.
> > >
> > > I think the reason for the failure without EXEC line is because pig
> tries
> > > to
> > > do the two STORE in the same reducer (or mapper maybe) since B only
> > > involves
> > > FILTER which doesn't require a separate map-red job and then got
> > confused.
> > >
> > > Is there a way for this to work without having to duplicate the jobs?
> > > Thanks
> > > a lot!
> > >
> >
>
>
>