Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig optimization getting in the way?

Copy link to this message
Pig optimization getting in the way?
I ran into a problem that I have spent quite some time on and start to think
it's probably pig's doing something optimization that makes this thing hard.

This is my pseudo code:

raw = LOAD ...

then some crazy stuff like

A = the result from above operation
STORE A INTO 'dummy' USING myJDBC(write to table1);

This works fine and I have 4 map-red jobs.

Then I add this after that:

B = FILTER A BY col1="xyz";
STORE B INTO 'dummy2' USING myJDBC(write to table2);

basically I do some filtering of A and write it to another table thru JDBC.

Then I had the problem of jobs failing and saying "PSQLException: This
statement has been closed".

My workaround now is to add "EXEC;" before B line and make them write to DB
in sequence. This works but now it would run the same map-red jobs twice - I
ended up with 8 jobs.

I think the reason for the failure without EXEC line is because pig tries to
do the two STORE in the same reducer (or mapper maybe) since B only involves
FILTER which doesn't require a separate map-red job and then got confused.

Is there a way for this to work without having to duplicate the jobs? Thanks
a lot!