our input is a line of text which may be parsed to e.g. A or B object. We want all A objects written to "A.avro" files, while all B objects written to "B.avro".
I looked into AvroMultipleOutputs class: http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html There is an example, however, it's not quite clear. For job submission, it uses AvroMultipleOutputs.addNamedOutput to add schemas for A and B. In my program looks like: AvroMultipleOutputs.addNamedOutput(job, "A", AvroKeyOutputFormat.class, aSchema, null); AvroMultipleOutputs.addNamedOutput(job, "B", AvroKeyOutputFormat.class, bSchema, null); I believe this is for Reducer output files.
*My question is* what the Mapper output should be, in specific what "job.setMapOutputValueClass" should be, since the Mapper output could be A or B object, with schema aSchema or bSchema.
In my progam, I simply set it to GenericData, but get error as below:
14/03/06 15:55:34 INFO mapreduce.Job: Task Id : attempt_1393817780522_0012_m_000010_2, Status : FAILED Error: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:989) at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:390) at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:79) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:746) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)
any idea on how to build a common map output data type? The only way I can think of is "toString()", which would be very inefficient, since A and B are big objects and may change with time, which is also the reason we want to use Avro serialization. 2014-03-07 9:55 GMT+08:00 Harsh J <[EMAIL PROTECTED]>:
You may consider "SpecificRecord" or "GenericRecord" of Avor. Yong
Date: Fri, 7 Mar 2014 10:29:49 +0800 Subject: Re: MapReduce: How to output multiplt Avro files? From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
thanks, Harsh. any idea on how to build a common map output data type? The only way I can think of is "toString()", which would be very inefficient, since A and B are big objects and may change with time, which is also the reason we want to use Avro serialization. 2014-03-07 9:55 GMT+08:00 Harsh J <[EMAIL PROTECTED]>:
If you have a reducer involved, you'll likely need a common map output
data type that both A and B can fit into.
On Thu, Mar 6, 2014 at 12:09 AM, Fengyun RAO <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext