Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF for processing containing group in pig 0.12


Copy link to this message
-
UDF for processing containing group in pig 0.12
Hi all,

I am trying to process a tuple containing some chararray fields and a bag
using UDF to generate a single tuple.

Input tuple consist of 2 chararray field followed by a bag, which contain
tuples having 3rd field as index and 4th field as count.
These tuples may be jumbled.

Our desired output is same initital 2 chararrays and followed by 4th field
of all tuples sorted by index.

for example input
(14,1369,{(14,1369,1,100),(14,1369,2,90),(14,1369,3,80),(14,1369,5,60),(14,1369,4,70),(14,1369,6,50),(14,1369,8,30),(14,1369,7,40),(14,1369,9,20)})
output should be
(14,1369,100,90,80,70,60,50,40,30,20)

=======================================================

Here is what I tried.

y_x_group = GROUP A BY (y_id, x_id);

temp_counts = FOREACH y_x_group GENERATE FLATTEN(group) AS (y_id, x_id), A
AS rows;

counts = FOREACH temp_counts GENERATE myudfs.MergeCounts();

======================================================

y_x_group looks like following having schema
{group: (y_id: chararray,x_id: chararray),A: {(y_id: chararray,x_id:
chararray,level: chararray,count: long)}}
((14,1366),{(14,1366,9,3),(14,1366,1,3),(14,1366,2,3),(14,1366,3,3),(14,1366,4,3),(14,1366,5,3),(14,1366,6,3),(14,1366,7,3),(14,1366,8,3)})
((14,1368),{(14,1368,2,3),(14,1368,3,3),(14,1368,4,3),(14,1368,5,3),(14,1368,6,3),(14,1368,1,3),(14,1368,7,3),(14,1368,8,3),(14,1368,9,3)})
((14,1369),{(14,1369,1,1),(14,1369,2,1),(14,1369,3,1),(14,1369,4,1),(14,1369,5,1),(14,1369,6,1),(14,1369,7,1),(14,1369,8,1),(14,1369,9,1)})
((14,1376),{(14,1376,6,35),(14,1376,1,35),(14,1376,2,35),(14,1376,3,35),(14,1376,4,35),(14,1376,5,35),(14,1376,7,35),(14,1376,8,35),(14,1376,9,35)})

======================================================

I am trying to write following UDFs

public class MergeCounts extends EvalFunc<Tuple> {
 TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();

@Override
public Tuple exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
Object o = input.get(0);
String y_id = (String)o;
o = input.get(1);
String x_id = (String)o;
o = input.get(2);
DataBag bag = (DataBag)o;
Iterator<Tuple> itr = bag.iterator();
HashMap<String,Object> map = new HashMap<String, Object>();
for ( int i = 1 ; i <= 9 ; i++ ) {
map.put(i+"", (Object)0);
}
while ( itr.hasNext() ){
Tuple t = (Tuple)itr.next();
map.put((String)t.get(2), t.get(3));
}
return mTupleFactory.newTuple(Arrays.asList(x_id,y_id,
map.get("1"),map.get("2"),map.get("3"),map.get("4"),
map.get("5"),map.get("6"),map.get("7"),map.get("8"),
map.get("9")));
}
}
I have tried many versions of Map, I am getting error while running pig
script.

org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception
while executing [POUserFunc (Name: POUserFunc(myudfs.MergeCounts)[tuple] -
scope-774 Operator Key: scope-774) children: null at []]:
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.String

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias counts

Please help.

Regards,
Ajay Dubey
<http://www.google.com/imgres?imgurl=http://admissions.iiit.ac.in/logo_name.gif&imgrefurl=http://admissions.iiit.ac.in/admission_procedure.php&usg=__9ccHkzRxJdf9UV-7HNUbLjy0KYQ=&h=91&w=324&sz=19&hl=en&start=18&sig2=W2CiCzBOQyPJFhCggrbDSA&zoom=1&tbnid=5zVpQ8aNlkzftM:&tbnh=63&tbnw=226&ei=qP53TLSJDsnQccn7qOcF&prev=/images%3Fq%3DIIIT-H%26hl%3Den%26sa%3DX%26prmdo%3D1%26biw%3D1307%26bih%3D576%26tbs%3Disch:1&itbs=1&iact=hc&vpx=152&vpy=322&dur=876&hovh=72&hovw=259&tx=164&ty=45&oei=o_53TOanCYfKcIyU-eIF&esq=2&page=2&ndsp=19&ved=1t:429,r:13,s:18>