Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> dereferencing bag of map


+
Jerry Lam 2013-06-18, 00:18
+
Pradeep Gollakota 2013-06-18, 00:22
+
Suresh Saggar 2013-06-21, 09:43
+
Shahab Yunus 2013-06-21, 12:35
+
Pradeep Gollakota 2013-06-21, 12:42
+
Suresh Saggar 2013-06-24, 09:25
Copy link to this message
-
Re: dereferencing bag of map
Use REGEX_EXTRACT_ALL
Something like this should work (untested, please verify)

rel2 = foreach rel1 generate
FLATTEN(REGEX_EXTRACT_ALL(attributes#'md','\\{"cld":"(\\w+)","sld":"(\\w+)"\\}'))
AS (cld: chararray, sld: chararray);

Tighten up the regex appropriately.
On 24 June 2013 14:55, Suresh Saggar <[EMAIL PROTECTED]> wrote:

> *Thanks a lot* for your reply but the problem still exists. To clarify
> further the exact sequence of pig statements are shown below:
>
> REGISTER 'hdfs://hadoop-prod-master.vpc:8020/user/hdfs/libs/prod.jar';
> <<<<< *Our custom jar containing the Loader() code.*
> records_log = LOAD
> 'hdfs://hadoop-prod-master.vpc:8020/data/{prod}/{2013-06-20-11}/*' USING
> com.example.Loader() AS (date:chararray, type:chararray, attributes:[]);
> http = FILTER records_log BY type == 'm' AND attributes#'st' == 'http';
> X = FOREACH http GENERATE attributes#'md' AS metadata;
> Y = FOREACH X GENERATE FLATTEN(metadata);
>
> grunt> describe Y
> Y: {metadata: bytearray}
> grunt> describe X
> X: {metadata: bytearray}
>
> Once I dump either X or Y, both result in the same. Further I tried FLATTEN
> directly on records_log too, but no help i.e.
> Z = FOREACH records_log GENERATE FLATTEN(attributes);
>
> Similarly JsonStorage() can't be used directly as my raw data (one stored
> in HDFS) is not json, but a custom format as shown below:
> 2013-06-20-11|m|{'st':'http','md':{'cId':'a','sId':'b'}}
>
> Here our Loader() takes above raw data as input and returns the output in
> the format: (date:chararray, type:chararray, attributes:[]). Now since
> attributes#'md' is a JSON here, I'm having problems getting the 'cId' &
> 'sId' values. Hope this clarifies the context. I assume that FLATTEN
> operator couldn't 'un-nests' the  attributes#'md' as that is represented as
> {'cId':'a','sId':'b'} but not as ['cId'#'a','sId'#'b']  (map in pig) or
> {('cId'#'a'),('sId'#'b')} (bag in pig).
>
> TIA
> Ss
>
> On Fri, Jun 21, 2013 at 6:12 PM, Pradeep Gollakota <[EMAIL PROTECTED]
> >wrote:
>
> > Suresh,
> >
> > Look into using JsonStorage(). This seems to be what you're looking for.
> > http://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore
> >
> >
> > On Fri, Jun 21, 2013 at 8:35 AM, Shahab Yunus <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Have you tried flattening the bag first?
> > >
> > >
> > > On Fri, Jun 21, 2013 at 5:43 AM, Suresh Saggar <[EMAIL PROTECTED]> wrote:
> > >
> > > > Facing a similar challenge. Here X contains one column named
> 'metadata'
> > > of
> > > > type bytearray. But the actual content is a JSON i.e. the value of
> > > metadata
> > > > field is a JSON (keys as sId & cId) as shown below:
> > > >
> > > > grunt> describe X
> > > > X: {metadata: bytearray}
> > > >
> > > > grunt> dump X
> > > > ({"sId":"003_w","cId":"k"})
> > > > ({"sId":"001_rf","cId":"r"})
> > > > ({"sId":"001_rf","cId":"r"})
> > > > ({"sId":"004_rf","cId":"r"})
> > > >
> > > > Any idea how can I generate cId & sId as separate chararray columns?
> > TIA
> > > >
> > > > Ss
> > > >
> > > > On Tue, Jun 18, 2013 at 5:52 AM, Pradeep Gollakota <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > What's the error you are seeing? What does you bag of maps look
> like?
> > > > What
> > > > > exactly is a userId? Is it a field or is it a key in the map?
> > > > >
> > > > >
> > > > > On Mon, Jun 17, 2013 at 8:18 PM, Jerry Lam <[EMAIL PROTECTED]>
> > > wrote:
> > > > >
> > > > > > Hi Pig users,
> > > > > >
> > > > > > anyone has experience in dereferencing a bag of maps? For
> instance
> > > (in
> > > > > the
> > > > > > example below), doc in the B contains maps of userId and time. I
> > want
> > > > to
> > > > > > keep only userId in C. Pig throws an exception on C. Any help is
> > > > > > appreciated.
> > > > > >
> > > > > > A = LOAD 'data' AS doc:bytearray;
> > > > > >
> > > > > > B = FOREACH A GENERATE (bag{})doc;
> > > > > >
> > > > > > -- C = FOREACH B GENERATE doc.userId; // this doesn't work.
> > > > > >
> > > > > > Best Regards,
> > > > > >
> > > >
+
Suresh Saggar 2013-07-03, 08:58
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB