Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FOREACH GENERATE Conditional?


Copy link to this message
-
Re: FOREACH GENERATE Conditional?
Are you sure Pig is spawning extra map jobs for this?  The multi-query optimizer should be pushing these back together into one job.

If it isn't, you should be able to accomplish the same thing with trinary logic and a single filter:

all = foreach main_set ((blah == 'a' and meh == 'b') ? 'likes' : ((blah == 'b' and meh == 'c') ? 'disklikes' : ((blah == 'c' and meh =='d') ? 'newuserregs' : ''))) as type;
all_time = filter all by type != '';

(Not sure about all the parenthesis placement, as I didn't run it.)

Alan.

On Oct 24, 2012, at 2:51 AM, Eli Finkelshteyn wrote:

> Hi folks,
> I have a pig script that right now looks like this:
>
> …
> likes = FILTER main_set BY blah == 'a' AND meh == 'b';
> likes_time = FOREACH likes GENERATE date, 'likes' AS type;
>
> dislikes = FILTER main_set BY blah == 'b' AND meh == 'c';
> dislikes_time = FOREACH dislikes GENERATE date, 'dislikes' AS type;
>
> newuserregs = FILTER main_set BY blah == 'c' AND meh == 'd';
> newuserregs_time = FOREACH dislikes GENERATE date, 'newuserregs' as type;
> ...
>
> all_time = UNION likes_time, dislikes_time, newuserregs_time;
> …
>
> As you can see, what I'm doing is filtering the main_set repeatedly and generating based on that, and then unioning everything back together. This means a lot of extra map jobs, which is a lot of extra work. Really, thinking about it in terms of mapping, I should be able to do things in one run. Any idea what the pig syntax would be for that? Is there something like a GENERATE conditional, where I could do something like:
>
> all_time = FOREACH main_set GENERATE date, 'likes' IF (blah == 'a' AND meh == 'b')
>  'dislikes' IF (blah == 'b' AND meh == 'c')
>  'dislikes' IF (blah == 'c' AND meh == 'd') AS type;
>
> Running this in just one map job would be very awesome and would speed this script up a ton, I'm thinking. Ideas? Advice?
>
> Eli
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB