Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> newbie just not getting structure


Copy link to this message
-
Re: newbie just not getting structure
I don't know what the data file on disk looks like, as it is compressed or
encoded. It should be in whatever format PigStorage('|') would store a map.

(The original relation was a data that had been loaded into a map, and
could be accessed as a map, so something like:

original = load 'origFile' using customeLoader('params') as (a:map[]);
at this point I can work with original as map correctly, accessing fields
using a#'fieldname';
then I did a filter down to one row:
small = filter original by a#'id' == 'rowofinterest';
then I stored it
store small into '/outputfilename' using PigStorage('|');

so whatever format PigStorage('|') put in the output file is what it is. I
haven't manipulated it, just copied down to a different machine.

lauren
On Wed, Aug 15, 2012 at 7:06 PM, Cheolsoo Park <[EMAIL PROTECTED]>wrote:

> Hi,
>
> What's the content of data/file like? Given your description, I guess that
> it looks as follows:
>
>
> [id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]
>
> But this is not map literal format. If you change it to:
>
> [id#ID1]
> [documentDate#1344461328851]
> [source#93931]
> [indexed#false]
> [lastModifiedDate#1344461328851]
> [contexts#{([id#CID1])}]
>
> then you can load it as map:
>
> >> a = load 'data/file'  using PigStorage(',') as (m:map[]);
> >> dump a;
>
> ([id#ID1])
> ([documentDate#1344461328851])
> ([source#93931])
> ([indexed#false])
> ([lastModifiedDate#1344461328851])
> ([contexts#{([id#CID1])}])
>
> Furthermore, you can do:
>
> >> b = foreach a generate $0#'id';
> >> dump b;
>
> (ID1)
> ()
> ()
> ()
> ()
> ()
>
> This is what you expect, no?
>
> Thanks,
> Cheolsoo
>
>
> On Wed, Aug 15, 2012 at 4:44 AM, Lauren Blau <
> [EMAIL PROTECTED]> wrote:
>
> > I'm having problems with understanding storage structures. Here's what I
> > did:
> >
> > on the cluster I loaded some data and created a relation with one row.
> > I output the row using store relation into '/file' using PigStorage('|');
> > then I copied it my local workspace, copyToLocal /file ./file
> > then I tarred up the local file and scp'd it to my laptop.
> >
> > on my laptop I untarred the file into data/file
> > then I ran these pig commands:
> >
> > b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm
> > expecting a map
> > dump b;
> >
> > return is successful but result is ().
> >
> > then I ran
> > c = foreach b generate *;
> > dump c;
> >
> > return is successful but result is ().
> >
> > then I tried
> >
> > d = load 'data/file' using PigStorage('|');
> > dump d;
> >
> > return
> > is
> >
> ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}])
> >
> > since that is a map, I'm not sure why dump b didn't return values. so
> then
> > I tried
> > e = foreach d generate $0#'id';
> > dump e;
> >
> > and the return was ();
> >
> > Does anyone see where I'm missing the point? And how do I grab those map
> > values?
> >
> > Thanks
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB