Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Having troubles with PigStorage


Copy link to this message
-
Re: Having troubles with PigStorage
William Oberman 2012-11-06, 21:29
This is a dumb question, but PigStorage escapes the delimiter, right?  I
was assuming I didn't have to select a delimiter such that it doesn't
appear in the data as it would get escaped by the export process, and
unescaped in the import process....
On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <[EMAIL PROTECTED]> wrote:

> Hi Will,
>
> >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
> (key:chararray,columns:bag {column:tuple (name, value)});
>
> Can you please provide some of your data from this file
> (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2
> rows would be sufficient.
>
> Thanks,
> Cheolsoo
>
> On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
> <[EMAIL PROTECTED]>wrote:
>
> > I'm trying to play around with Amazon EMR, and I currently have self
> hosted
> > Cassandra as the source of data.  I was going to try to do: Cassandra ->
> S3
> > -> EMR.  I've traced my problems to PigStorage.  At this point I can
> > recreate my problem "locally" without involving S3 or Amazon.
> >
> > In my local test environment I have this script:
> >
> > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
> > (key:chararray, columns:bag {column:tuple (name, value)});
> >
> > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
> >
> >
> > I can verify that HDFS file looks vaguely correct (\t separated fields,
> > return separated lines, my data is in the right spots).
> >
> >
> > Then if I do:
> >
> > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
> > columns:bag {column:tuple (name, value)});
> >
> > keys = FOREACH data GENERATE key;
> >
> > DUMP keys;
> >
> >
> > I can see that data is wrong.  In the dump sometimes I see keys,
> sometimes
> > I see columns, and sometimes I see a mismatch of keys/columns lumped
> > together.
> >
> >
> > As far as I can tell PigStorage is unable to parse the data it just
> > persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
> >
> >
> > In terms of my data:
> >
> > key = URI (ASCII)
> >
> > columns = binary UUID -> JSON (ASCII)
> >
> >
> > Any ideas?  Next I guess I'll see what kind of debugging is in pig in the
> > STORE/LOAD processes.
> >
> >
> > Thanks!
> >
> >
> > will
> >
>