Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Avro mapred: How to avoid schema specification in job.xml?


Copy link to this message
-
Re: Avro mapred: How to avoid schema specification in job.xml?
Hello,

I followed your advice and filled up a jira:
https://issues.apache.org/jira/browse/AVRO-923
My first idea was to implement a custom HadoopMapper, but the actual code
for this is in static methods of AvroJob. The impact is that I would have to
add many custom classes (Mapper, Reducer, RecordReader, AvroJob ...).
It is actually a very good idea to implement a fallback in AvroJob, since it
would be only a dozen of lines of code to add.

The route I might go is to build my custom version of Avro MapRed based on
1.5.4, but this has several drawbacks, including problems during version
updates and jobs modifications when the feature will be actually
implemented.

--
Julien Muller

2011/10/10 Scott Carey <[EMAIL PROTECTED]>

> On 10/10/11 11:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> Thanks for your answer, let me try to clarify my context a bit:
>
> I'm not all that familiar with how Oozie interacts with Avro.
>>
> Let's get oozie out of the picture. I use job.xml files to configure Jobs.
> This means I do not have any JobConf object and I cannot use AvroJob.
> Therefore I directly write the job properties (as what AvroJob outputs).
>
> The Job must set its avro.input.schema and avro.output.schema properties —
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples),
>>
> The solution I have now is basically based on the Avro mapred unit tests.
> But in my context, it is not an option to code (using the $SCHEMA property)
> at the job configuration level.
> where you code:
>     AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
> I have to copy the entire schema in job.xml file. And I have to update it
> every time my schema get updated.
> I hope I can find a better solution.
>
>
> I suppose that in AvroJob we could transmit only the class name in a
> property, and use that to look up the schema for generated classes using
> reflection.  Could you do something similar?  I don't think it is possible
> to avoid configuring at least some sort of pointer to where the schema is.
>  This could be via a property, or if you already have the job class, an
> annotation on that class.
>
>
> and if you are using SpecificRecords and DataFiles the schema is available
>> to the code where necessary.
>>
> I am not sure what you mean here. I am using SpecificRecords and would like
> to avoid specifying avro.input.schema, since this info is already here in
> the specific record.
>
>
> Potentially the AvroMapper / AvroReducer could have a fall-back for
> obtaining the schema if the property is not set — reflection on a class name
> or an annotation .  If this looks like it is an enhancement request for Avro
> (or a bug) please file a JIRA ticket.  Thanks!
>
>
> Thanks,
>
> Julien Muller
>
> 2011/10/10 Scott Carey <[EMAIL PROTECTED]>
>
>> I'm not all that familiar with how Oozie interacts with Avro.
>>
>> The Job must set its avro.input.schema and avro.output.schema properties —
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), and if you are using SpecificRecords and DataFiles the schema is
>> available to the code where necessary.
>>
>>
>>
>> On 10/10/11 5:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:
>>
>> Hello,
>>
>> I have been using avro with hadoop and oozie for months now and I am very
>> happy with the results.
>>
>> The only point I see as a limitation now is that we specify avro schemes
>> in workflow.xml (job.xml):
>> - avro.input.schema
>> - avro.output.schema
>> Since this info is already provided in Mapper/Reducer signatures, I see
>> this as redundant. The schema is also present in all my serialized files,
>> which means that the schema is specified in 3 different places.
>>
>> From a run point of view, this is a pain, since any schema modification
>> (let's say a simple optional field added) forces me to update many job
>> files. This task is very error prone and since we have a large amount of
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB