Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> dereferencing bag of map


+
Jerry Lam 2013-06-18, 00:18
+
Pradeep Gollakota 2013-06-18, 00:22
+
Suresh Saggar 2013-06-21, 09:43
+
Shahab Yunus 2013-06-21, 12:35
+
Pradeep Gollakota 2013-06-21, 12:42
Copy link to this message
-
Re: dereferencing bag of map
*Thanks a lot* for your reply but the problem still exists. To clarify
further the exact sequence of pig statements are shown below:

REGISTER 'hdfs://hadoop-prod-master.vpc:8020/user/hdfs/libs/prod.jar';
<<<<< *Our custom jar containing the Loader() code.*
records_log = LOAD
'hdfs://hadoop-prod-master.vpc:8020/data/{prod}/{2013-06-20-11}/*' USING
com.example.Loader() AS (date:chararray, type:chararray, attributes:[]);
http = FILTER records_log BY type == 'm' AND attributes#'st' == 'http';
X = FOREACH http GENERATE attributes#'md' AS metadata;
Y = FOREACH X GENERATE FLATTEN(metadata);

grunt> describe Y
Y: {metadata: bytearray}
grunt> describe X
X: {metadata: bytearray}

Once I dump either X or Y, both result in the same. Further I tried FLATTEN
directly on records_log too, but no help i.e.
Z = FOREACH records_log GENERATE FLATTEN(attributes);

Similarly JsonStorage() can't be used directly as my raw data (one stored
in HDFS) is not json, but a custom format as shown below:
2013-06-20-11|m|{'st':'http','md':{'cId':'a','sId':'b'}}

Here our Loader() takes above raw data as input and returns the output in
the format: (date:chararray, type:chararray, attributes:[]). Now since
attributes#'md' is a JSON here, I'm having problems getting the 'cId' &
'sId' values. Hope this clarifies the context. I assume that FLATTEN
operator couldn't 'un-nests' the  attributes#'md' as that is represented as
{'cId':'a','sId':'b'} but not as ['cId'#'a','sId'#'b']  (map in pig) or
{('cId'#'a'),('sId'#'b')} (bag in pig).

TIA
Ss

On Fri, Jun 21, 2013 at 6:12 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> Suresh,
>
> Look into using JsonStorage(). This seems to be what you're looking for.
> http://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore
>
>
> On Fri, Jun 21, 2013 at 8:35 AM, Shahab Yunus <[EMAIL PROTECTED]
> >wrote:
>
> > Have you tried flattening the bag first?
> >
> >
> > On Fri, Jun 21, 2013 at 5:43 AM, Suresh Saggar <[EMAIL PROTECTED]> wrote:
> >
> > > Facing a similar challenge. Here X contains one column named 'metadata'
> > of
> > > type bytearray. But the actual content is a JSON i.e. the value of
> > metadata
> > > field is a JSON (keys as sId & cId) as shown below:
> > >
> > > grunt> describe X
> > > X: {metadata: bytearray}
> > >
> > > grunt> dump X
> > > ({"sId":"003_w","cId":"k"})
> > > ({"sId":"001_rf","cId":"r"})
> > > ({"sId":"001_rf","cId":"r"})
> > > ({"sId":"004_rf","cId":"r"})
> > >
> > > Any idea how can I generate cId & sId as separate chararray columns?
> TIA
> > >
> > > Ss
> > >
> > > On Tue, Jun 18, 2013 at 5:52 AM, Pradeep Gollakota <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > What's the error you are seeing? What does you bag of maps look like?
> > > What
> > > > exactly is a userId? Is it a field or is it a key in the map?
> > > >
> > > >
> > > > On Mon, Jun 17, 2013 at 8:18 PM, Jerry Lam <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Hi Pig users,
> > > > >
> > > > > anyone has experience in dereferencing a bag of maps? For instance
> > (in
> > > > the
> > > > > example below), doc in the B contains maps of userId and time. I
> want
> > > to
> > > > > keep only userId in C. Pig throws an exception on C. Any help is
> > > > > appreciated.
> > > > >
> > > > > A = LOAD 'data' AS doc:bytearray;
> > > > >
> > > > > B = FOREACH A GENERATE (bag{})doc;
> > > > >
> > > > > -- C = FOREACH B GENERATE doc.userId; // this doesn't work.
> > > > >
> > > > > Best Regards,
> > > > >
> > > > > Jerry
> > > > >
> > > >
> > >
> >
>
+
Abhinav Neelam 2013-06-25, 13:05
+
Suresh Saggar 2013-07-03, 08:58
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB