Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Re: MapReduce: How to output multiplt Avro files?

Copy link to this message
RE: MapReduce: How to output multiplt Avro files?
Hi Fengyun,


Here's what I've done in the past when facing a similar issue:


1)      Set the map output schema to a UNION of both of your target schemas,
A and B.

2)      Serialize the data in the mappers, using the avro datum as the

3)      Figure out what the avro schema is for each datum and write out the
data in the reducer.  






From: Fengyun RAO [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 06, 2014 2:14 AM
Subject: Re: MapReduce: How to output multiplt Avro files?


add avro user mail-list


2014-03-06 16:09 GMT+08:00 Fengyun RAO <[EMAIL PROTECTED]>:

our input is a line of text which may be parsed to e.g. A or B object.

We want all A objects written to "A.avro" files, while all B objects written
to "B.avro".


I looked into AvroMultipleOutputs class:

There is an example, however, it's not quite clear.

For job submission, it uses AvroMultipleOutputs.addNamedOutput to add
schemas for A and B.

In my program looks like:

        AvroMultipleOutputs.addNamedOutput(job, "A",
AvroKeyOutputFormat.class, aSchema, null);  

        AvroMultipleOutputs.addNamedOutput(job, "B",
AvroKeyOutputFormat.class, bSchema, null);

I believe this is for Reducer output files.


My question is what the Mapper output should be, in specific what
"job.setMapOutputValueClass" should be,

since the Mapper output could be A or B object, with schema aSchema or


In my progam, I simply set it to GenericData, but get error as below:


14/03/06 15:55:34 INFO mapreduce.Job: Task Id :
attempt_1393817780522_0012_m_000010_2, Status : FAILED

Error: java.lang.NullPointerException



        at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:79)


        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:746)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)

        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)


        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)


I have no idea what this means.