Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Avro mapred: How to avoid schema specification in job.xml?


Copy link to this message
-
Re: Avro mapred: How to avoid schema specification in job.xml?
On 10/10/11 11:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:

> Hello,
>
> Thanks for your answer, let me try to clarify my context a bit:
>
>> I'm not all that familiar with how Oozie interacts with Avro.
> Let's get oozie out of the picture. I use job.xml files to configure Jobs.
> This means I do not have any JobConf object and I cannot use AvroJob.
> Therefore I directly write the job properties (as what AvroJob outputs).
>
>> The Job must set its avro.input.schema and avro.output.schema properties ‹
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples),
> The solution I have now is basically based on the Avro mapred unit tests. But
> in my context, it is not an option to code (using the $SCHEMA property) at the
> job configuration level.
> where you code:
>     AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
> I have to copy the entire schema in job.xml file. And I have to update it
> every time my schema get updated.
> I hope I can find a better solution.

I suppose that in AvroJob we could transmit only the class name in a
property, and use that to look up the schema for generated classes using
reflection.  Could you do something similar?  I don't think it is possible
to avoid configuring at least some sort of pointer to where the schema is.
This could be via a property, or if you already have the job class, an
annotation on that class.

>
>> and if you are using SpecificRecords and DataFiles the schema is available to
>> the code where necessary.
> I am not sure what you mean here. I am using SpecificRecords and would like to
> avoid specifying avro.input.schema, since this info is already here in the
> specific record.

Potentially the AvroMapper / AvroReducer could have a fall-back for
obtaining the schema if the property is not set ‹ reflection on a class name
or an annotation .  If this looks like it is an enhancement request for Avro
(or a bug) please file a JIRA ticket.  Thanks!

>
> Thanks,
>
> Julien Muller
>
> 2011/10/10 Scott Carey <[EMAIL PROTECTED]>
>> I'm not all that familiar with how Oozie interacts with Avro.
>>
>> The Job must set its avro.input.schema and avro.output.schema properties ‹
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), and if you are using SpecificRecords and DataFiles the schema is
>> available to the code where necessary.
>>
>>
>>
>> On 10/10/11 5:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:
>>
>>> Hello,
>>>
>>> I have been using avro with hadoop and oozie for months now and I am very
>>> happy with the results.
>>>
>>> The only point I see as a limitation now is that we specify avro schemes in
>>> workflow.xml (job.xml):
>>> - avro.input.schema
>>> - avro.output.schema
>>> Since this info is already provided in Mapper/Reducer signatures, I see this
>>> as redundant. The schema is also present in all my serialized files, which
>>> means that the schema is specified in 3 different places.
>>>
>>> From a run point of view, this is a pain, since any schema modification
>>> (let's say a simple optional field added) forces me to update many job
>>> files. This task is very error prone and since we have a large amount of
>>> jobs, it generates a lot of work.
>>>
>>> The only solution I see now would be to find/replace in the build script,
>>> but I hope I could find a better solution by providing some generic schemes
>>> to the job file, or find a way to deactivate schema validation in the job.
>>> Any help will be appreciated!
>>>
>>> --
>>> Julien Muller
>