Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF for processing containing group in pig 0.12


Copy link to this message
-
UDF for processing containing group in pig 0.12
Hi all,

I am trying to process a tuple containing some chararray fields and a bag
using UDF to generate a single tuple.

Input tuple consist of 2 chararray field followed by a bag, which contain
tuples having 3rd field as index and 4th field as count.
These tuples may be jumbled.

Our desired output is same initital 2 chararrays and followed by 4th field
of all tuples sorted by index.

for example input
(14,1369,{(14,1369,1,100),(14,1369,2,90),(14,1369,3,80),(14,1369,5,60),(14,1369,4,70),(14,1369,6,50),(14,1369,8,30),(14,1369,7,40),(14,1369,9,20)})
output should be
(14,1369,100,90,80,70,60,50,40,30,20)

=======================================================

Here is what I tried.

y_x_group = GROUP A BY (y_id, x_id);

temp_counts = FOREACH y_x_group GENERATE FLATTEN(group) AS (y_id, x_id), A
AS rows;

counts = FOREACH temp_counts GENERATE myudfs.MergeCounts();

======================================================

y_x_group looks like following having schema
{group: (y_id: chararray,x_id: chararray),A: {(y_id: chararray,x_id:
chararray,level: chararray,count: long)}}
((14,1366),{(14,1366,9,3),(14,1366,1,3),(14,1366,2,3),(14,1366,3,3),(14,1366,4,3),(14,1366,5,3),(14,1366,6,3),(14,1366,7,3),(14,1366,8,3)})
((14,1368),{(14,1368,2,3),(14,1368,3,3),(14,1368,4,3),(14,1368,5,3),(14,1368,6,3),(14,1368,1,3),(14,1368,7,3),(14,1368,8,3),(14,1368,9,3)})
((14,1369),{(14,1369,1,1),(14,1369,2,1),(14,1369,3,1),(14,1369,4,1),(14,1369,5,1),(14,1369,6,1),(14,1369,7,1),(14,1369,8,1),(14,1369,9,1)})
((14,1376),{(14,1376,6,35),(14,1376,1,35),(14,1376,2,35),(14,1376,3,35),(14,1376,4,35),(14,1376,5,35),(14,1376,7,35),(14,1376,8,35),(14,1376,9,35)})

======================================================

I am trying to write following UDFs

public class MergeCounts extends EvalFunc<Tuple> {
 TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();

@Override
public Tuple exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
Object o = input.get(0);
String y_id = (String)o;
o = input.get(1);
String x_id = (String)o;
o = input.get(2);
DataBag bag = (DataBag)o;
Iterator<Tuple> itr = bag.iterator();
HashMap<String,Object> map = new HashMap<String, Object>();
for ( int i = 1 ; i <= 9 ; i++ ) {
map.put(i+"", (Object)0);
}
while ( itr.hasNext() ){
Tuple t = (Tuple)itr.next();
map.put((String)t.get(2), t.get(3));
}
return mTupleFactory.newTuple(Arrays.asList(x_id,y_id,
map.get("1"),map.get("2"),map.get("3"),map.get("4"),
map.get("5"),map.get("6"),map.get("7"),map.get("8"),
map.get("9")));
}
}
I have tried many versions of Map, I am getting error while running pig
script.

org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception
while executing [POUserFunc (Name: POUserFunc(myudfs.MergeCounts)[tuple] -
scope-774 Operator Key: scope-774) children: null at []]:
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.String

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias counts

Please help.

Regards,
Ajay Dubey
<http://www.google.com/imgres?imgurl=http://admissions.iiit.ac.in/logo_name.gif&imgrefurl=http://admissions.iiit.ac.in/admission_procedure.php&usg=__9ccHkzRxJdf9UV-7HNUbLjy0KYQ=&h=91&w=324&sz=19&hl=en&start=18&sig2=W2CiCzBOQyPJFhCggrbDSA&zoom=1&tbnid=5zVpQ8aNlkzftM:&tbnh=63&tbnw=226&ei=qP53TLSJDsnQccn7qOcF&prev=/images%3Fq%3DIIIT-H%26hl%3Den%26sa%3DX%26prmdo%3D1%26biw%3D1307%26bih%3D576%26tbs%3Disch:1&itbs=1&iact=hc&vpx=152&vpy=322&dur=876&hovh=72&hovw=259&tx=164&ty=45&oei=o_53TOanCYfKcIyU-eIF&esq=2&page=2&ndsp=19&ved=1t:429,r:13,s:18>

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB