|
|
-
newbie just not getting structure
Lauren Blau 2012-08-15, 11:44
I'm having problems with understanding storage structures. Here's what I did:
on the cluster I loaded some data and created a relation with one row. I output the row using store relation into '/file' using PigStorage('|'); then I copied it my local workspace, copyToLocal /file ./file then I tarred up the local file and scp'd it to my laptop.
on my laptop I untarred the file into data/file then I ran these pig commands:
b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm expecting a map dump b;
return is successful but result is ().
then I ran c = foreach b generate *; dump c;
return is successful but result is ().
then I tried
d = load 'data/file' using PigStorage('|'); dump d;
return is ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}])
since that is a map, I'm not sure why dump b didn't return values. so then I tried e = foreach d generate $0#'id'; dump e;
and the return was ();
Does anyone see where I'm missing the point? And how do I grab those map values?
Thanks
-
Re: newbie just not getting structure
Cheolsoo Park 2012-08-15, 23:06
Hi,
What's the content of data/file like? Given your description, I guess that it looks as follows:
[id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]
But this is not map literal format. If you change it to:
[id#ID1] [documentDate#1344461328851] [source#93931] [indexed#false] [lastModifiedDate#1344461328851] [contexts#{([id#CID1])}]
then you can load it as map:
>> a = load 'data/file' using PigStorage(',') as (m:map[]); >> dump a;
([id#ID1]) ([documentDate#1344461328851]) ([source#93931]) ([indexed#false]) ([lastModifiedDate#1344461328851]) ([contexts#{([id#CID1])}])
Furthermore, you can do:
>> b = foreach a generate $0#'id'; >> dump b;
(ID1) () () () () ()
This is what you expect, no?
Thanks, Cheolsoo On Wed, Aug 15, 2012 at 4:44 AM, Lauren Blau < [EMAIL PROTECTED]> wrote:
> I'm having problems with understanding storage structures. Here's what I > did: > > on the cluster I loaded some data and created a relation with one row. > I output the row using store relation into '/file' using PigStorage('|'); > then I copied it my local workspace, copyToLocal /file ./file > then I tarred up the local file and scp'd it to my laptop. > > on my laptop I untarred the file into data/file > then I ran these pig commands: > > b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm > expecting a map > dump b; > > return is successful but result is (). > > then I ran > c = foreach b generate *; > dump c; > > return is successful but result is (). > > then I tried > > d = load 'data/file' using PigStorage('|'); > dump d; > > return > is > ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]) > > since that is a map, I'm not sure why dump b didn't return values. so then > I tried > e = foreach d generate $0#'id'; > dump e; > > and the return was (); > > Does anyone see where I'm missing the point? And how do I grab those map > values? > > Thanks >
-
Re: newbie just not getting structure
Lauren Blau 2012-08-16, 09:48
I don't know what the data file on disk looks like, as it is compressed or encoded. It should be in whatever format PigStorage('|') would store a map.
(The original relation was a data that had been loaded into a map, and could be accessed as a map, so something like:
original = load 'origFile' using customeLoader('params') as (a:map[]); at this point I can work with original as map correctly, accessing fields using a#'fieldname'; then I did a filter down to one row: small = filter original by a#'id' == 'rowofinterest'; then I stored it store small into '/outputfilename' using PigStorage('|');
so whatever format PigStorage('|') put in the output file is what it is. I haven't manipulated it, just copied down to a different machine.
lauren On Wed, Aug 15, 2012 at 7:06 PM, Cheolsoo Park <[EMAIL PROTECTED]>wrote:
> Hi, > > What's the content of data/file like? Given your description, I guess that > it looks as follows: > > > [id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}] > > But this is not map literal format. If you change it to: > > [id#ID1] > [documentDate#1344461328851] > [source#93931] > [indexed#false] > [lastModifiedDate#1344461328851] > [contexts#{([id#CID1])}] > > then you can load it as map: > > >> a = load 'data/file' using PigStorage(',') as (m:map[]); > >> dump a; > > ([id#ID1]) > ([documentDate#1344461328851]) > ([source#93931]) > ([indexed#false]) > ([lastModifiedDate#1344461328851]) > ([contexts#{([id#CID1])}]) > > Furthermore, you can do: > > >> b = foreach a generate $0#'id'; > >> dump b; > > (ID1) > () > () > () > () > () > > This is what you expect, no? > > Thanks, > Cheolsoo > > > On Wed, Aug 15, 2012 at 4:44 AM, Lauren Blau < > [EMAIL PROTECTED]> wrote: > > > I'm having problems with understanding storage structures. Here's what I > > did: > > > > on the cluster I loaded some data and created a relation with one row. > > I output the row using store relation into '/file' using PigStorage('|'); > > then I copied it my local workspace, copyToLocal /file ./file > > then I tarred up the local file and scp'd it to my laptop. > > > > on my laptop I untarred the file into data/file > > then I ran these pig commands: > > > > b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm > > expecting a map > > dump b; > > > > return is successful but result is (). > > > > then I ran > > c = foreach b generate *; > > dump c; > > > > return is successful but result is (). > > > > then I tried > > > > d = load 'data/file' using PigStorage('|'); > > dump d; > > > > return > > is > > > ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]) > > > > since that is a map, I'm not sure why dump b didn't return values. so > then > > I tried > > e = foreach d generate $0#'id'; > > dump e; > > > > and the return was (); > > > > Does anyone see where I'm missing the point? And how do I grab those map > > values? > > > > Thanks > > >
-
Re: newbie just not getting structure
Lauren Blau 2012-08-21, 12:14
Still not getting it. A similar problem is occurring: I have a file which I believe contains structures like, ("string1","string2",{[]}) and if I load it as (messageId:chararray, documentName:chararray, annot:map[]) I can dump it, and I can define: foo = foreach row generate messageId as messageId:chararray,documentName as documentName:chararray,annot#'prefix' as apre:chararray, annot#'label' as alabel:chararray ..), and can dump foo and see my results as expected
if I try x = filter foo by apre == 'VALUE'; I get 0 rows back and I see a warning about FIELD_DISCARDED_CONVERSION_FAILED
but if I store foo into a file using store foo into '/filefoo'; and then define foo2 = load '/filefoo' as (messageId:chararray,documentName:chararray,apre:chararray,alabel:chararray ..) then y = filter foo2 by apre == 'VALUE' I get back the rows I expect.
would some please explain what the difference between the 2 is? Why should storing and re-reading the data make a difference? What am I missing? Thanks.
On Wed, Aug 15, 2012 at 7:44 AM, Lauren Blau < [EMAIL PROTECTED]> wrote:
> I'm having problems with understanding storage structures. Here's what I > did: > > on the cluster I loaded some data and created a relation with one row. > I output the row using store relation into '/file' using PigStorage('|'); > then I copied it my local workspace, copyToLocal /file ./file > then I tarred up the local file and scp'd it to my laptop. > > on my laptop I untarred the file into data/file > then I ran these pig commands: > > b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm > expecting a map > dump b; > > return is successful but result is (). > > then I ran > c = foreach b generate *; > dump c; > > return is successful but result is (). > > then I tried > > d = load 'data/file' using PigStorage('|'); > dump d; > > return > is ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]) > > since that is a map, I'm not sure why dump b didn't return values. so then > I tried > e = foreach d generate $0#'id'; > dump e; > > and the return was (); > > Does anyone see where I'm missing the point? And how do I grab those map > values? > > Thanks > > > >
|
|