|
|
-
Having troubles with PigStorageWilliam Oberman 2012-11-06, 20:20
I'm trying to play around with Amazon EMR, and I currently have self hosted
Cassandra as the source of data. I was going to try to do: Cassandra -> S3 -> EMR. I've traced my problems to PigStorage. At this point I can recreate my problem "locally" without involving S3 or Amazon. In my local test environment I have this script: data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage(); I can verify that HDFS file looks vaguely correct (\t separated fields, return separated lines, my data is in the right spots). Then if I do: data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); keys = FOREACH data GENERATE key; DUMP keys; I can see that data is wrong. In the dump sometimes I see keys, sometimes I see columns, and sometimes I see a mismatch of keys/columns lumped together. As far as I can tell PigStorage is unable to parse the data it just persisted. I've tried pig 0.8, 0.9 and 0.10 with the same results. In terms of my data: key = URI (ASCII) columns = binary UUID -> JSON (ASCII) Any ideas? Next I guess I'll see what kind of debugging is in pig in the STORE/LOAD processes. Thanks! will +
Cheolsoo Park 2012-11-06, 21:01
+
William Oberman 2012-11-06, 21:29
+
Cheolsoo Park 2012-11-06, 21:35
+
William Oberman 2012-11-06, 21:50
+
William Oberman 2012-11-06, 22:01
|