|
|
-
Pig optimization getting in the way?
Dexin Wang 2011-02-18, 21:41
I ran into a problem that I have spent quite some time on and start to think it's probably pig's doing something optimization that makes this thing hard.
This is my pseudo code:
raw = LOAD ...
then some crazy stuff like filter join group UDF etc
A = the result from above operation STORE A INTO 'dummy' USING myJDBC(write to table1);
This works fine and I have 4 map-red jobs.
Then I add this after that:
B = FILTER A BY col1="xyz"; STORE B INTO 'dummy2' USING myJDBC(write to table2);
basically I do some filtering of A and write it to another table thru JDBC.
Then I had the problem of jobs failing and saying "PSQLException: This statement has been closed".
My workaround now is to add "EXEC;" before B line and make them write to DB in sequence. This works but now it would run the same map-red jobs twice - I ended up with 8 jobs.
I think the reason for the failure without EXEC line is because pig tries to do the two STORE in the same reducer (or mapper maybe) since B only involves FILTER which doesn't require a separate map-red job and then got confused.
Is there a way for this to work without having to duplicate the jobs? Thanks a lot!
+
Dexin Wang 2011-02-18, 21:41
-
Re: Pig optimization getting in the way?
Dmitriy Ryaboy 2011-02-18, 23:38
Let me guess -- you have a static JDBC connection that you open in myJDBC, and you have jvm reuse turned on.
On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> I ran into a problem that I have spent quite some time on and start to > think > it's probably pig's doing something optimization that makes this thing > hard. > > This is my pseudo code: > > raw = LOAD ... > > then some crazy stuff like > filter > join > group > UDF > etc > > A = the result from above operation > STORE A INTO 'dummy' USING myJDBC(write to table1); > > This works fine and I have 4 map-red jobs. > > Then I add this after that: > > B = FILTER A BY col1="xyz"; > STORE B INTO 'dummy2' USING myJDBC(write to table2); > > basically I do some filtering of A and write it to another table thru JDBC. > > Then I had the problem of jobs failing and saying "PSQLException: This > statement has been closed". > > My workaround now is to add "EXEC;" before B line and make them write to DB > in sequence. This works but now it would run the same map-red jobs twice - > I > ended up with 8 jobs. > > I think the reason for the failure without EXEC line is because pig tries > to > do the two STORE in the same reducer (or mapper maybe) since B only > involves > FILTER which doesn't require a separate map-red job and then got confused. > > Is there a way for this to work without having to duplicate the jobs? > Thanks > a lot! >
+
Dmitriy Ryaboy 2011-02-18, 23:38
-
Re: Pig optimization getting in the way?
Dexin Wang 2011-02-19, 00:48
I hope that's the case. But
*mapred.job.reuse.jvm.num.tasks* 1 However it does seem to be doing the write to two DB tables in the same job so although it's not re-using jvm, it is already in one jvm since it's the same task!
And since the DB connection is static/singleton as you mentioned, and table name (which is the only thing that's different) is not part of connection URL, they share the same DB connection, and one of them will close the connection when it's done.
Hmm, any suggestions how we can handle this? Thanks.
On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Let me guess -- you have a static JDBC connection that you open in myJDBC, > and you have jvm reuse turned on. > > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: > > > I ran into a problem that I have spent quite some time on and start to > > think > > it's probably pig's doing something optimization that makes this thing > > hard. > > > > This is my pseudo code: > > > > raw = LOAD ... > > > > then some crazy stuff like > > filter > > join > > group > > UDF > > etc > > > > A = the result from above operation > > STORE A INTO 'dummy' USING myJDBC(write to table1); > > > > This works fine and I have 4 map-red jobs. > > > > Then I add this after that: > > > > B = FILTER A BY col1="xyz"; > > STORE B INTO 'dummy2' USING myJDBC(write to table2); > > > > basically I do some filtering of A and write it to another table thru > JDBC. > > > > Then I had the problem of jobs failing and saying "PSQLException: This > > statement has been closed". > > > > My workaround now is to add "EXEC;" before B line and make them write to > DB > > in sequence. This works but now it would run the same map-red jobs twice > - > > I > > ended up with 8 jobs. > > > > I think the reason for the failure without EXEC line is because pig tries > > to > > do the two STORE in the same reducer (or mapper maybe) since B only > > involves > > FILTER which doesn't require a separate map-red job and then got > confused. > > > > Is there a way for this to work without having to duplicate the jobs? > > Thanks > > a lot! > > >
+
Dexin Wang 2011-02-19, 00:48
-
Re: Pig optimization getting in the way?
Thejas M Nair 2011-02-19, 01:31
As you are suspecting, both store functions are probably running in the same map or reduce task. This is a result of multi-query optimization. Try pig -e 'explain -script yourscript.pig' to see the query plan, and you will be able to verify if the store is happening the same map/reduce task.
Can you can make the db connection a member of the store function/ record writer? You can also use "-no_multiquery" to prevent multi-query optimization from happening, but that will also result in the MR job being executed again for other output.
Thanks, Thejas
On 2/18/11 4:48 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:
I hope that's the case. But
*mapred.job.reuse.jvm.num.tasks* 1 However it does seem to be doing the write to two DB tables in the same job so although it's not re-using jvm, it is already in one jvm since it's the same task!
And since the DB connection is static/singleton as you mentioned, and table name (which is the only thing that's different) is not part of connection URL, they share the same DB connection, and one of them will close the connection when it's done.
Hmm, any suggestions how we can handle this? Thanks.
On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Let me guess -- you have a static JDBC connection that you open in myJDBC, > and you have jvm reuse turned on. > > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: > > > I ran into a problem that I have spent quite some time on and start to > > think > > it's probably pig's doing something optimization that makes this thing > > hard. > > > > This is my pseudo code: > > > > raw = LOAD ... > > > > then some crazy stuff like > > filter > > join > > group > > UDF > > etc > > > > A = the result from above operation > > STORE A INTO 'dummy' USING myJDBC(write to table1); > > > > This works fine and I have 4 map-red jobs. > > > > Then I add this after that: > > > > B = FILTER A BY col1="xyz"; > > STORE B INTO 'dummy2' USING myJDBC(write to table2); > > > > basically I do some filtering of A and write it to another table thru > JDBC. > > > > Then I had the problem of jobs failing and saying "PSQLException: This > > statement has been closed". > > > > My workaround now is to add "EXEC;" before B line and make them write to > DB > > in sequence. This works but now it would run the same map-red jobs twice > - > > I > > ended up with 8 jobs. > > > > I think the reason for the failure without EXEC line is because pig tries > > to > > do the two STORE in the same reducer (or mapper maybe) since B only > > involves > > FILTER which doesn't require a separate map-red job and then got > confused. > > > > Is there a way for this to work without having to duplicate the jobs? > > Thanks > > a lot! > > >
+
Thejas M Nair 2011-02-19, 01:31
-
Re: Pig optimization getting in the way?
Dexin Wang 2011-02-22, 18:18
So I can create multiple db connections for each (jdbc_url, table) pairs and map each pair to its own connection for record writer. Is that what you are suggesting? Sounds like a good plan. Thanks.
On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:
> As you are suspecting, both store functions are probably running in the > same map or reduce task. This is a result of multi-query optimization. > Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and you > will be able to verify if the store is happening the same map/reduce task. > > Can you can make the db connection a member of the store function/ record > writer? > You can also use "-no_multiquery" to prevent multi-query optimization from > happening, but that will also result in the MR job being executed again for > other output. > > Thanks, > Thejas > > > > > On 2/18/11 4:48 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote: > > I hope that's the case. But > > *mapred.job.reuse.jvm.num.tasks* 1 > However it does seem to be doing the write to two DB tables in the same job > so although it's not re-using jvm, it is already in one jvm since it's the > same task! > > And since the DB connection is static/singleton as you mentioned, and table > name (which is the only thing that's different) is not part of connection > URL, they share the same DB connection, and one of them will close the > connection when it's done. > > Hmm, any suggestions how we can handle this? Thanks. > > On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > Let me guess -- you have a static JDBC connection that you open in > myJDBC, > > and you have jvm reuse turned on. > > > > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: > > > > > I ran into a problem that I have spent quite some time on and start to > > > think > > > it's probably pig's doing something optimization that makes this thing > > > hard. > > > > > > This is my pseudo code: > > > > > > raw = LOAD ... > > > > > > then some crazy stuff like > > > filter > > > join > > > group > > > UDF > > > etc > > > > > > A = the result from above operation > > > STORE A INTO 'dummy' USING myJDBC(write to table1); > > > > > > This works fine and I have 4 map-red jobs. > > > > > > Then I add this after that: > > > > > > B = FILTER A BY col1="xyz"; > > > STORE B INTO 'dummy2' USING myJDBC(write to table2); > > > > > > basically I do some filtering of A and write it to another table thru > > JDBC. > > > > > > Then I had the problem of jobs failing and saying "PSQLException: This > > > statement has been closed". > > > > > > My workaround now is to add "EXEC;" before B line and make them write > to > > DB > > > in sequence. This works but now it would run the same map-red jobs > twice > > - > > > I > > > ended up with 8 jobs. > > > > > > I think the reason for the failure without EXEC line is because pig > tries > > > to > > > do the two STORE in the same reducer (or mapper maybe) since B only > > > involves > > > FILTER which doesn't require a separate map-red job and then got > > confused. > > > > > > Is there a way for this to work without having to duplicate the jobs? > > > Thanks > > > a lot! > > > > > > > >
+
Dexin Wang 2011-02-22, 18:18
-
Re: Pig optimization getting in the way?
Dmitriy Ryaboy 2011-02-22, 18:59
I was suggesting just a connection per record writer, not a connection per (jdbc, table). That way you are safe even if you are writing into the same table from two streams in the same jvm.
D
On Tue, Feb 22, 2011 at 10:18 AM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> So I can create multiple db connections for each (jdbc_url, table) pairs > and > map each pair to its own connection for record writer. Is that what you are > suggesting? Sounds like a good plan. Thanks. > > On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair <[EMAIL PROTECTED]> > wrote: > > > As you are suspecting, both store functions are probably running in the > > same map or reduce task. This is a result of multi-query optimization. > > Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and > you > > will be able to verify if the store is happening the same map/reduce > task. > > > > Can you can make the db connection a member of the store function/ record > > writer? > > You can also use "-no_multiquery" to prevent multi-query optimization > from > > happening, but that will also result in the MR job being executed again > for > > other output. > > > > Thanks, > > Thejas > > > > > > > > > > On 2/18/11 4:48 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote: > > > > I hope that's the case. But > > > > *mapred.job.reuse.jvm.num.tasks* 1 > > However it does seem to be doing the write to two DB tables in the same > job > > so although it's not re-using jvm, it is already in one jvm since it's > the > > same task! > > > > And since the DB connection is static/singleton as you mentioned, and > table > > name (which is the only thing that's different) is not part of connection > > URL, they share the same DB connection, and one of them will close the > > connection when it's done. > > > > Hmm, any suggestions how we can handle this? Thanks. > > > > On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > Let me guess -- you have a static JDBC connection that you open in > > myJDBC, > > > and you have jvm reuse turned on. > > > > > > On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> > wrote: > > > > > > > I ran into a problem that I have spent quite some time on and start > to > > > > think > > > > it's probably pig's doing something optimization that makes this > thing > > > > hard. > > > > > > > > This is my pseudo code: > > > > > > > > raw = LOAD ... > > > > > > > > then some crazy stuff like > > > > filter > > > > join > > > > group > > > > UDF > > > > etc > > > > > > > > A = the result from above operation > > > > STORE A INTO 'dummy' USING myJDBC(write to table1); > > > > > > > > This works fine and I have 4 map-red jobs. > > > > > > > > Then I add this after that: > > > > > > > > B = FILTER A BY col1="xyz"; > > > > STORE B INTO 'dummy2' USING myJDBC(write to table2); > > > > > > > > basically I do some filtering of A and write it to another table thru > > > JDBC. > > > > > > > > Then I had the problem of jobs failing and saying "PSQLException: > This > > > > statement has been closed". > > > > > > > > My workaround now is to add "EXEC;" before B line and make them write > > to > > > DB > > > > in sequence. This works but now it would run the same map-red jobs > > twice > > > - > > > > I > > > > ended up with 8 jobs. > > > > > > > > I think the reason for the failure without EXEC line is because pig > > tries > > > > to > > > > do the two STORE in the same reducer (or mapper maybe) since B only > > > > involves > > > > FILTER which doesn't require a separate map-red job and then got > > > confused. > > > > > > > > Is there a way for this to work without having to duplicate the jobs? > > > > Thanks > > > > a lot! > > > > > > > > > > > > > >
+
Dmitriy Ryaboy 2011-02-22, 18:59
-
Re: Pig optimization getting in the way?
Dmitriy Ryaboy 2011-02-19, 01:53
Open a new connection per Storage instance? Better yet, use a connection pool?
On Fri, Feb 18, 2011 at 4:48 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> I hope that's the case. But > > *mapred.job.reuse.jvm.num.tasks* 1 > However it does seem to be doing the write to two DB tables in the same job > so although it's not re-using jvm, it is already in one jvm since it's the > same task! > > And since the DB connection is static/singleton as you mentioned, and table > name (which is the only thing that's different) is not part of connection > URL, they share the same DB connection, and one of them will close the > connection when it's done. > > Hmm, any suggestions how we can handle this? Thanks. > > On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: > >> Let me guess -- you have a static JDBC connection that you open in myJDBC, >> and you have jvm reuse turned on. >> >> On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang <[EMAIL PROTECTED]> wrote: >> >> > I ran into a problem that I have spent quite some time on and start to >> > think >> > it's probably pig's doing something optimization that makes this thing >> > hard. >> > >> > This is my pseudo code: >> > >> > raw = LOAD ... >> > >> > then some crazy stuff like >> > filter >> > join >> > group >> > UDF >> > etc >> > >> > A = the result from above operation >> > STORE A INTO 'dummy' USING myJDBC(write to table1); >> > >> > This works fine and I have 4 map-red jobs. >> > >> > Then I add this after that: >> > >> > B = FILTER A BY col1="xyz"; >> > STORE B INTO 'dummy2' USING myJDBC(write to table2); >> > >> > basically I do some filtering of A and write it to another table thru >> JDBC. >> > >> > Then I had the problem of jobs failing and saying "PSQLException: This >> > statement has been closed". >> > >> > My workaround now is to add "EXEC;" before B line and make them write to >> DB >> > in sequence. This works but now it would run the same map-red jobs twice >> - >> > I >> > ended up with 8 jobs. >> > >> > I think the reason for the failure without EXEC line is because pig >> tries >> > to >> > do the two STORE in the same reducer (or mapper maybe) since B only >> > involves >> > FILTER which doesn't require a separate map-red job and then got >> confused. >> > >> > Is there a way for this to work without having to duplicate the jobs? >> > Thanks >> > a lot! >> > >> > >
+
Dmitriy Ryaboy 2011-02-19, 01:53
|
|