Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Having troubles with PigStorage


Copy link to this message
-
Having troubles with PigStorage
William Oberman 2012-11-06, 20:20
I'm trying to play around with Amazon EMR, and I currently have self hosted
Cassandra as the source of data.  I was going to try to do: Cassandra -> S3
-> EMR.  I've traced my problems to PigStorage.  At this point I can
recreate my problem "locally" without involving S3 or Amazon.

In my local test environment I have this script:

data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});

STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
I can verify that HDFS file looks vaguely correct (\t separated fields,
return separated lines, my data is in the right spots).
Then if I do:

data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
columns:bag {column:tuple (name, value)});

keys = FOREACH data GENERATE key;

DUMP keys;
I can see that data is wrong.  In the dump sometimes I see keys, sometimes
I see columns, and sometimes I see a mismatch of keys/columns lumped
together.
As far as I can tell PigStorage is unable to parse the data it just
persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
In terms of my data:

key = URI (ASCII)

columns = binary UUID -> JSON (ASCII)
Any ideas?  Next I guess I'll see what kind of debugging is in pig in the
STORE/LOAD processes.
Thanks!
will