|
|
-
Cannot flatten a bag with tuple wrapping an int successfully.Stanley Xu 2012-09-05, 12:28
Dear all,
We are using Pig-0.8.1 with patch issue-1866, and we also use elephant-bird with thrift/protobuf serialized objects. We are trying to use some pig scripts to flatten a repeated int field with protobuf structure like the following: DEFINE PROTO_TO_TUPLE com.mediav.proto.pig.ProtobufBytesToTuple('some class'); raw_data = load 'data' using com.twitter.elephantbird.pig.load.LzoThriftB64LinePigLoader('class'); A = FILTER raw_data BY request.requestType == 'REQUEST'; B = FOREACH A GENERATE PROTO_TO_TUPLE(request.data) as bid_request; C = FOREACH B GENERATE FLATTEN(bid_request.detected_content_label_bag) as labels, bid_request.google_user_id as gid; D = GROUP C BY labels; X = FOREACH D GENERATE group, COUNT(C.gid); DUMP X; Running this script will lead a "ERROR 1066: Unable to open iterator for alias X. Backend error : java.lang.Integer cannot be cast to org.apache.pig.data.Tuple " in the map side. If we describe the C, we will see C: {labels: (detected_content_label: int),gid: chararray} If we STORE C and then reload it like the following, everything works fine. STORE C INTO 'temppath'; new_data = load 'temppath' as (label:chararray, gid:chararray); D = GROUP new_data BY labels; X = FOREACH D GENERATE group, COUNT(new_data.gid); DUMP X; I am wondering how could I avoid store C and reload it in PIG, is there any important patch I missed? Since it's really huge data. Best wishes, Stanley Xu |