Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - STREAM in foreach block


Copy link to this message
-
STREAM in foreach block
Kannan Shah 2012-09-17, 19:55
I'm trying to group tuples by a key, sort by another key within each group,
and then pass the sorted list of tuples for each group to a perl script. I
need to use the perl script because I need to compute an aggregate quantity
that is dependent on the sort order, and I'm not much of a Java programmer,
so I don't know how to write a user-defined aggregate function.

Doing this requires me to use STREAM in a foreach block, after the GROUP
statement. Basically something like:

r2 = group r1 by key1 ;
r3 = foreach r2 {
   s1=r1;
   s2=order s1 by key2;
   s3=stream s2 through myperlscript as (x,y,z);
   generate group,flatten(s3.x),flatten(s3.y),flatten(s3.z);
}
store r3 into "r3.out" using PigStorage(';');

NOTE: The FLATTENs are there only for syntactic reasons; myperlscript will
only output one tuple for each group.

I'm getting errors that make me think that you can use the STREAM operator
within a foreach block, but I'm not sure. Can someone confirm? Is there a
workaround to this sort of situation?

Any help appreciated,
Kannan