|
|
-
org.apache.avro.mapred.AvroMultipleOutputs (avro-1.7.3) does not allow different output schemasLuke Liu 2013-01-09, 00:01
Hi,
The issue is that it does not allow additional outputs with a different schema. I am using "org.apache.avro.mapred.AvroMultipleOutputs" following the Javadoc and pass in a "newSchema" that is different from the default output avro schema. // Job configuration JobConf job = new JobConf(); .... AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class, newSchema); // In Reducer MyReducer { private AvroMultipleOutputs amos; ... public void configure(JobConf conf) { ... amos = new AvroMultipleOutputs(conf); } public void reduce (...) { amos.getCollector("avro1", reporter).collect(datum); } public void close() { amos.close(); } } .... Then "amos.getCollector("avro1", reporter).collect(datum);" always uses the default avro schema in the JobConf. It should use "newSchema". I found in "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(JobConf conf, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Schema schema)" it stores the additional schemas in a static HashMap called "schemaList" at the time of the job configuration time. But "reducer tasks" could be running on different hosts in which "schemaList" was not initialized. So the reducer won't get the schema from the list. Therefore, it will use the default schema in the jobconf. I think "org.apache.avro.mapred.AvroMultipleOutputs.addNamedOutput(...)" should store the passed in schema in the JobConf, not in static member "schemaList". Thanks, Luke |