Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Weird problem in Pig 0.10 with STOR'ing JSON and then LOADing it as PigStorage chararray


Copy link to this message
-
Weird problem in Pig 0.10 with STOR'ing JSON and then LOADing it as PigStorage chararray
The script that has worked in the past is thus:

/* Load Avro emails */
emails = load '/me/tmp/emails_big' using AvroStorage();
emails = filter emails by message_id IS NOT NULL;

/* JSONify the emails for ElasticSearch */
store emails into '/tmp/emails.json' using JsonStorage();

/* LOAD JSON as single field for storage in ElasticSearch with Wonderpig */
json_emails = load '/tmp/emails.json' using PigStorage() AS
(json_record:chararray);
store json_emails into 'es://email/email?id=message_id&json=true&size=1000'
using ElasticSearch();
Now I get this error:

grunt> json_emails = load '/tmp/emails.json' AS (json_record:chararray);

2012-06-22 15:45:34,136 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1031: Incompatable schema: left is "json_record:chararray", right is
"message_id:chararray,thread_id:chararray,in_reply_to:chararray,subject:chararray,body:chararray,date:chararray,froms:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},ccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},bccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},reply_tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)}"
2012-06-22 15:45:34,136 [main] ERROR org.apache.pig.tools.grunt.Grunt -
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031:
Incompatable schema: left is "json_record:chararray", right is
"message_id:chararray,thread_id:chararray,in_reply_to:chararray,subject:chararray,body:chararray,date:chararray,froms:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},ccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},bccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},reply_tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)}"
at
org.apache.pig.newplan.logical.relational.LogicalSchema.merge(LogicalSchema.java:760)
at
org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:114)
at
org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:219)
at
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at
org.apache.pig.newplan.logical.visitor.CastLineageSetter.<init>(CastLineageSetter.java:57)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1635)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1566)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1538)
at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I tried copying the file from /tmp/emails.json to /tmp/json_emails and
loading it then - but that doesn't work.  I tried calling PigStorage()
explicitly, and that doesn't work either.

How am I supposed to pull this off?

I figured it out:

grunt> rm /tmp/emails.json/.pig_header
grunt> rm /tmp/emails.json/.pig_schema

Then I can load my JSON as chararray.  Interesting problem.

--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com