Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig optimization getting in the way?


Copy link to this message
-
Re: Pig optimization getting in the way?
So I can create multiple db connections for each (jdbc_url, table) pairs and
map each pair to its own connection for record writer. Is that what you are
suggesting? Sounds like a good plan. Thanks.

On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:

>  As you are suspecting, both store functions are probably running in the
> same map or reduce task. This is a result of multi-query optimization.
> Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and you
> will be able to verify if the store is happening the same map/reduce task.
>
> Can you can make the db connection a member of the store function/ record
> writer?
> You can also use  "-no_multiquery" to prevent multi-query optimization from
> happening, but that will also result in the MR job being executed again for
> other output.
>
> Thanks,
> Thejas
>
>
>
>
> On 2/18/11 4:48 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:
>
> I hope that's the case. But
>
>  *mapred.job.reuse.jvm.num.tasks* 1
> However it does seem to be doing the write to two DB tables in the same job
> so although it's not re-using jvm, it is already in one jvm since it's the
> same task!
>
> And since the DB connection is static/singleton as you mentioned, and table
> name (which is the only thing that's different) is not part of connection
> URL, they share the same DB connection, and one of them will close the
> connection when it's done.
>
> Hmm, any suggestions how we can handle this? Thanks.
>
> On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
> wrote:
>
> > Let me guess -- you have a static JDBC connection that you open in
> myJDBC,
> > and you have jvm reuse turned on.
> >
> > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> >
> > > I ran into a problem that I have spent quite some time on and start to
> > > think
> > > it's probably pig's doing something optimization that makes this thing
> > > hard.
> > >
> > > This is my pseudo code:
> > >
> > > raw = LOAD ...
> > >
> > > then some crazy stuff like
> > > filter
> > > join
> > > group
> > > UDF
> > > etc
> > >
> > > A = the result from above operation
> > > STORE A INTO 'dummy' USING myJDBC(write to table1);
> > >
> > > This works fine and I have 4 map-red jobs.
> > >
> > > Then I add this after that:
> > >
> > > B = FILTER A BY col1="xyz";
> > > STORE B INTO 'dummy2' USING myJDBC(write to table2);
> > >
> > > basically I do some filtering of A and write it to another table thru
> > JDBC.
> > >
> > > Then I had the problem of jobs failing and saying "PSQLException: This
> > > statement has been closed".
> > >
> > > My workaround now is to add "EXEC;" before B line and make them write
> to
> > DB
> > > in sequence. This works but now it would run the same map-red jobs
> twice
> > -
> > > I
> > > ended up with 8 jobs.
> > >
> > > I think the reason for the failure without EXEC line is because pig
> tries
> > > to
> > > do the two STORE in the same reducer (or mapper maybe) since B only
> > > involves
> > > FILTER which doesn't require a separate map-red job and then got
> > confused.
> > >
> > > Is there a way for this to work without having to duplicate the jobs?
> > > Thanks
> > > a lot!
> > >
> >
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB