Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - org.apache.avro.mapred.AvroMultipleOutputs (avro-1.7.3) does not allow different output schemas

Copy link to this message
org.apache.avro.mapred.AvroMultipleOutputs (avro-1.7.3) does not allow different output schemas
Luke Liu 2013-01-09, 00:01

The issue is that it does not allow additional outputs with a different schema.

I am using "org.apache.avro.mapred.AvroMultipleOutputs" following the
Javadoc and pass in a "newSchema" that is different from the default
output avro schema.

// Job configuration
JobConf job = new JobConf();
AvroMultipleOutputs.addNamedOutput(job, "avro1",
AvroOutputFormat.class,  newSchema);

// In Reducer
MyReducer {

  private AvroMultipleOutputs amos;
  public void configure(JobConf conf) {
    amos = new AvroMultipleOutputs(conf);
 public void reduce (...) {
   amos.getCollector("avro1", reporter).collect(datum);

 public void close() {


Then "amos.getCollector("avro1", reporter).collect(datum);" always
uses the default avro schema in the JobConf. It should use

I found in "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(JobConf
conf, String namedOutput,  Class<? extends OutputFormat>
outputFormatClass, Schema schema)"

it stores the additional schemas in a static HashMap called
"schemaList" at the time of the job configuration time.

But "reducer tasks" could be running on different hosts in which
"schemaList" was not initialized.  So the reducer won't get the schema
from the list. Therefore, it will use the default schema in the

I think "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(...)"
should store the passed in schema in the JobConf, not in static member