Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - newbie just not getting structure


Copy link to this message
-
Re: newbie just not getting structure
Lauren Blau 2012-08-16, 09:48
I don't know what the data file on disk looks like, as it is compressed or
encoded. It should be in whatever format PigStorage('|') would store a map.

(The original relation was a data that had been loaded into a map, and
could be accessed as a map, so something like:

original = load 'origFile' using customeLoader('params') as (a:map[]);
at this point I can work with original as map correctly, accessing fields
using a#'fieldname';
then I did a filter down to one row:
small = filter original by a#'id' == 'rowofinterest';
then I stored it
store small into '/outputfilename' using PigStorage('|');

so whatever format PigStorage('|') put in the output file is what it is. I
haven't manipulated it, just copied down to a different machine.

lauren
On Wed, Aug 15, 2012 at 7:06 PM, Cheolsoo Park <[EMAIL PROTECTED]>wrote:

> Hi,
>
> What's the content of data/file like? Given your description, I guess that
> it looks as follows:
>
>
> [id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]
>
> But this is not map literal format. If you change it to:
>
> [id#ID1]
> [documentDate#1344461328851]
> [source#93931]
> [indexed#false]
> [lastModifiedDate#1344461328851]
> [contexts#{([id#CID1])}]
>
> then you can load it as map:
>
> >> a = load 'data/file'  using PigStorage(',') as (m:map[]);
> >> dump a;
>
> ([id#ID1])
> ([documentDate#1344461328851])
> ([source#93931])
> ([indexed#false])
> ([lastModifiedDate#1344461328851])
> ([contexts#{([id#CID1])}])
>
> Furthermore, you can do:
>
> >> b = foreach a generate $0#'id';
> >> dump b;
>
> (ID1)
> ()
> ()
> ()
> ()
> ()
>
> This is what you expect, no?
>
> Thanks,
> Cheolsoo
>
>
> On Wed, Aug 15, 2012 at 4:44 AM, Lauren Blau <
> [EMAIL PROTECTED]> wrote:
>
> > I'm having problems with understanding storage structures. Here's what I
> > did:
> >
> > on the cluster I loaded some data and created a relation with one row.
> > I output the row using store relation into '/file' using PigStorage('|');
> > then I copied it my local workspace, copyToLocal /file ./file
> > then I tarred up the local file and scp'd it to my laptop.
> >
> > on my laptop I untarred the file into data/file
> > then I ran these pig commands:
> >
> > b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm
> > expecting a map
> > dump b;
> >
> > return is successful but result is ().
> >
> > then I ran
> > c = foreach b generate *;
> > dump c;
> >
> > return is successful but result is ().
> >
> > then I tried
> >
> > d = load 'data/file' using PigStorage('|');
> > dump d;
> >
> > return
> > is
> >
> ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}])
> >
> > since that is a map, I'm not sure why dump b didn't return values. so
> then
> > I tried
> > e = foreach d generate $0#'id';
> > dump e;
> >
> > and the return was ();
> >
> > Does anyone see where I'm missing the point? And how do I grab those map
> > values?
> >
> > Thanks
> >
>