Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Cannot flatten a bag with tuple wrapping an int successfully.


Copy link to this message
-
Cannot flatten a bag with tuple wrapping an int successfully.
Dear all,

We are using Pig-0.8.1 with patch issue-1866, and we also use elephant-bird
with thrift/protobuf serialized objects.

We are trying to use some pig scripts to flatten a repeated int field with
protobuf structure like the following:

DEFINE PROTO_TO_TUPLE com.mediav.proto.pig.ProtobufBytesToTuple('some
class');

raw_data = load 'data' using
com.twitter.elephantbird.pig.load.LzoThriftB64LinePigLoader('class');
A = FILTER raw_data BY request.requestType == 'REQUEST';
B = FOREACH A GENERATE PROTO_TO_TUPLE(request.data) as bid_request;
C = FOREACH B GENERATE FLATTEN(bid_request.detected_content_label_bag) as
labels, bid_request.google_user_id as gid;
D = GROUP C BY labels;
X = FOREACH D GENERATE group, COUNT(C.gid);
DUMP X;

Running this script will lead a "ERROR 1066: Unable to open iterator for
alias X. Backend error : java.lang.Integer cannot be cast to
org.apache.pig.data.Tuple
" in the map side.

If we describe the C, we will see
C: {labels: (detected_content_label: int),gid: chararray}

If we STORE C and then reload it like the following, everything works fine.

STORE C INTO 'temppath';
new_data = load 'temppath' as (label:chararray, gid:chararray);
D = GROUP new_data BY labels;
X = FOREACH D GENERATE group, COUNT(new_data.gid);
DUMP X;

I am wondering how could I avoid store C and reload it in PIG, is there any
important patch I missed? Since it's really huge data.

Best wishes,
Stanley Xu
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB