Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig optimization getting in the way?


Copy link to this message
-
Pig optimization getting in the way?
I ran into a problem that I have spent quite some time on and start to think
it's probably pig's doing something optimization that makes this thing hard.

This is my pseudo code:

raw = LOAD ...

then some crazy stuff like
filter
join
group
UDF
etc

A = the result from above operation
STORE A INTO 'dummy' USING myJDBC(write to table1);

This works fine and I have 4 map-red jobs.

Then I add this after that:

B = FILTER A BY col1="xyz";
STORE B INTO 'dummy2' USING myJDBC(write to table2);

basically I do some filtering of A and write it to another table thru JDBC.

Then I had the problem of jobs failing and saying "PSQLException: This
statement has been closed".

My workaround now is to add "EXEC;" before B line and make them write to DB
in sequence. This works but now it would run the same map-red jobs twice - I
ended up with 8 jobs.

I think the reason for the failure without EXEC line is because pig tries to
do the two STORE in the same reducer (or mapper maybe) since B only involves
FILTER which doesn't require a separate map-red job and then got confused.

Is there a way for this to work without having to duplicate the jobs? Thanks
a lot!
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB