Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Cannot flatten a bag with tuple wrapping an int successfully.


Copy link to this message
-
Cannot flatten a bag with tuple wrapping an int successfully.
Dear all,

We are using Pig-0.8.1 with patch issue-1866, and we also use elephant-bird
with thrift/protobuf serialized objects.

We are trying to use some pig scripts to flatten a repeated int field with
protobuf structure like the following:

DEFINE PROTO_TO_TUPLE com.mediav.proto.pig.ProtobufBytesToTuple('some
class');

raw_data = load 'data' using
com.twitter.elephantbird.pig.load.LzoThriftB64LinePigLoader('class');
A = FILTER raw_data BY request.requestType == 'REQUEST';
B = FOREACH A GENERATE PROTO_TO_TUPLE(request.data) as bid_request;
C = FOREACH B GENERATE FLATTEN(bid_request.detected_content_label_bag) as
labels, bid_request.google_user_id as gid;
D = GROUP C BY labels;
X = FOREACH D GENERATE group, COUNT(C.gid);
DUMP X;

Running this script will lead a "ERROR 1066: Unable to open iterator for
alias X. Backend error : java.lang.Integer cannot be cast to
org.apache.pig.data.Tuple
" in the map side.

If we describe the C, we will see
C: {labels: (detected_content_label: int),gid: chararray}

If we STORE C and then reload it like the following, everything works fine.

STORE C INTO 'temppath';
new_data = load 'temppath' as (label:chararray, gid:chararray);
D = GROUP new_data BY labels;
X = FOREACH D GENERATE group, COUNT(new_data.gid);
DUMP X;

I am wondering how could I avoid store C and reload it in PIG, is there any
important patch I missed? Since it's really huge data.

Best wishes,
Stanley Xu