Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Having troubles with PigStorage


Copy link to this message
-
Re: Having troubles with PigStorage
>> This is a dumb question, but PigStorage escapes the delimiter, right?

No it doesn't.

On Tue, Nov 6, 2012 at 1:29 PM, William Oberman <[EMAIL PROTECTED]>wrote:

> This is a dumb question, but PigStorage escapes the delimiter, right?  I
> was assuming I didn't have to select a delimiter such that it doesn't
> appear in the data as it would get escaped by the export process, and
> unescaped in the import process....
>
>
> On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <[EMAIL PROTECTED]>
> wrote:
>
> > Hi Will,
> >
> > >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
> > (key:chararray,columns:bag {column:tuple (name, value)});
> >
> > Can you please provide some of your data from this file
> > (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2
> > rows would be sufficient.
> >
> > Thanks,
> > Cheolsoo
> >
> > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
> > <[EMAIL PROTECTED]>wrote:
> >
> > > I'm trying to play around with Amazon EMR, and I currently have self
> > hosted
> > > Cassandra as the source of data.  I was going to try to do: Cassandra
> ->
> > S3
> > > -> EMR.  I've traced my problems to PigStorage.  At this point I can
> > > recreate my problem "locally" without involving S3 or Amazon.
> > >
> > > In my local test environment I have this script:
> > >
> > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
> > > (key:chararray, columns:bag {column:tuple (name, value)});
> > >
> > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
> > >
> > >
> > > I can verify that HDFS file looks vaguely correct (\t separated fields,
> > > return separated lines, my data is in the right spots).
> > >
> > >
> > > Then if I do:
> > >
> > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
> > > columns:bag {column:tuple (name, value)});
> > >
> > > keys = FOREACH data GENERATE key;
> > >
> > > DUMP keys;
> > >
> > >
> > > I can see that data is wrong.  In the dump sometimes I see keys,
> > sometimes
> > > I see columns, and sometimes I see a mismatch of keys/columns lumped
> > > together.
> > >
> > >
> > > As far as I can tell PigStorage is unable to parse the data it just
> > > persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
> > >
> > >
> > > In terms of my data:
> > >
> > > key = URI (ASCII)
> > >
> > > columns = binary UUID -> JSON (ASCII)
> > >
> > >
> > > Any ideas?  Next I guess I'll see what kind of debugging is in pig in
> the
> > > STORE/LOAD processes.
> > >
> > >
> > > Thanks!
> > >
> > >
> > > will
> > >
> >
>