Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> org.apache.avro.mapred.AvroMultipleOutputs (avro-1.7.3) does not allow different output schemas


Copy link to this message
-
org.apache.avro.mapred.AvroMultipleOutputs (avro-1.7.3) does not allow different output schemas
Hi,

The issue is that it does not allow additional outputs with a different schema.

I am using "org.apache.avro.mapred.AvroMultipleOutputs" following the
Javadoc and pass in a "newSchema" that is different from the default
output avro schema.

// Job configuration
JobConf job = new JobConf();
....
AvroMultipleOutputs.addNamedOutput(job, "avro1",
AvroOutputFormat.class,  newSchema);

// In Reducer
MyReducer {

  private AvroMultipleOutputs amos;
  ...
  public void configure(JobConf conf) {
   ...
    amos = new AvroMultipleOutputs(conf);
 }
 public void reduce (...) {
   amos.getCollector("avro1", reporter).collect(datum);
 }

 public void close() {
   amos.close();
 }

}
....

Then "amos.getCollector("avro1", reporter).collect(datum);" always
uses the default avro schema in the JobConf. It should use
"newSchema".

I found in "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(JobConf
conf, String namedOutput,  Class<? extends OutputFormat>
outputFormatClass, Schema schema)"

it stores the additional schemas in a static HashMap called
"schemaList" at the time of the job configuration time.

But "reducer tasks" could be running on different hosts in which
"schemaList" was not initialized.  So the reducer won't get the schema
from the list. Therefore, it will use the default schema in the
jobconf.

I think "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(...)"
should store the passed in schema in the JobConf, not in static member
"schemaList".

Thanks,
Luke