Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig optimization getting in the way?


+
Dexin Wang 2011-02-18, 21:41
+
Dmitriy Ryaboy 2011-02-18, 23:38
+
Dexin Wang 2011-02-19, 00:48
Copy link to this message
-
Re: Pig optimization getting in the way?
As you are suspecting, both store functions are probably running in the same map or reduce task. This is a result of multi-query optimization.
Try pig -e 'explain -script yourscript.pig' to see the query plan, and you will be able to verify if the store is happening the same map/reduce task.

Can you can make the db connection a member of the store function/ record writer?
You can also use  "-no_multiquery" to prevent multi-query optimization from happening, but that will also result in the MR job being executed again for other output.

Thanks,
Thejas

On 2/18/11 4:48 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:

I hope that's the case. But

 *mapred.job.reuse.jvm.num.tasks* 1
However it does seem to be doing the write to two DB tables in the same job
so although it's not re-using jvm, it is already in one jvm since it's the
same task!

And since the DB connection is static/singleton as you mentioned, and table
name (which is the only thing that's different) is not part of connection
URL, they share the same DB connection, and one of them will close the
connection when it's done.

Hmm, any suggestions how we can handle this? Thanks.

On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Let me guess -- you have a static JDBC connection that you open in myJDBC,
> and you have jvm reuse turned on.
>
> On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
>
> > I ran into a problem that I have spent quite some time on and start to
> > think
> > it's probably pig's doing something optimization that makes this thing
> > hard.
> >
> > This is my pseudo code:
> >
> > raw = LOAD ...
> >
> > then some crazy stuff like
> > filter
> > join
> > group
> > UDF
> > etc
> >
> > A = the result from above operation
> > STORE A INTO 'dummy' USING myJDBC(write to table1);
> >
> > This works fine and I have 4 map-red jobs.
> >
> > Then I add this after that:
> >
> > B = FILTER A BY col1="xyz";
> > STORE B INTO 'dummy2' USING myJDBC(write to table2);
> >
> > basically I do some filtering of A and write it to another table thru
> JDBC.
> >
> > Then I had the problem of jobs failing and saying "PSQLException: This
> > statement has been closed".
> >
> > My workaround now is to add "EXEC;" before B line and make them write to
> DB
> > in sequence. This works but now it would run the same map-red jobs twice
> -
> > I
> > ended up with 8 jobs.
> >
> > I think the reason for the failure without EXEC line is because pig tries
> > to
> > do the two STORE in the same reducer (or mapper maybe) since B only
> > involves
> > FILTER which doesn't require a separate map-red job and then got
> confused.
> >
> > Is there a way for this to work without having to duplicate the jobs?
> > Thanks
> > a lot!
> >
>
+
Dexin Wang 2011-02-22, 18:18
+
Dmitriy Ryaboy 2011-02-22, 18:59
+
Dmitriy Ryaboy 2011-02-19, 01:53