Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Avro mapred: How to avoid schema specification in job.xml?


Copy link to this message
-
Re: Avro mapred: How to avoid schema specification in job.xml?
On 10/10/11 11:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:

> Hello,
>
> Thanks for your answer, let me try to clarify my context a bit:
>
>> I'm not all that familiar with how Oozie interacts with Avro.
> Let's get oozie out of the picture. I use job.xml files to configure Jobs.
> This means I do not have any JobConf object and I cannot use AvroJob.
> Therefore I directly write the job properties (as what AvroJob outputs).
>
>> The Job must set its avro.input.schema and avro.output.schema properties ‹
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples),
> The solution I have now is basically based on the Avro mapred unit tests. But
> in my context, it is not an option to code (using the $SCHEMA property) at the
> job configuration level.
> where you code:
>     AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
> I have to copy the entire schema in job.xml file. And I have to update it
> every time my schema get updated.
> I hope I can find a better solution.

I suppose that in AvroJob we could transmit only the class name in a
property, and use that to look up the schema for generated classes using
reflection.  Could you do something similar?  I don't think it is possible
to avoid configuring at least some sort of pointer to where the schema is.
This could be via a property, or if you already have the job class, an
annotation on that class.

>
>> and if you are using SpecificRecords and DataFiles the schema is available to
>> the code where necessary.
> I am not sure what you mean here. I am using SpecificRecords and would like to
> avoid specifying avro.input.schema, since this info is already here in the
> specific record.

Potentially the AvroMapper / AvroReducer could have a fall-back for
obtaining the schema if the property is not set ‹ reflection on a class name
or an annotation .  If this looks like it is an enhancement request for Avro
(or a bug) please file a JIRA ticket.  Thanks!

>
> Thanks,
>
> Julien Muller
>
> 2011/10/10 Scott Carey <[EMAIL PROTECTED]>
>> I'm not all that familiar with how Oozie interacts with Avro.
>>
>> The Job must set its avro.input.schema and avro.output.schema properties ‹
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), and if you are using SpecificRecords and DataFiles the schema is
>> available to the code where necessary.
>>
>>
>>
>> On 10/10/11 5:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:
>>
>>> Hello,
>>>
>>> I have been using avro with hadoop and oozie for months now and I am very
>>> happy with the results.
>>>
>>> The only point I see as a limitation now is that we specify avro schemes in
>>> workflow.xml (job.xml):
>>> - avro.input.schema
>>> - avro.output.schema
>>> Since this info is already provided in Mapper/Reducer signatures, I see this
>>> as redundant. The schema is also present in all my serialized files, which
>>> means that the schema is specified in 3 different places.
>>>
>>> From a run point of view, this is a pain, since any schema modification
>>> (let's say a simple optional field added) forces me to update many job
>>> files. This task is very error prone and since we have a large amount of
>>> jobs, it generates a lot of work.
>>>
>>> The only solution I see now would be to find/replace in the build script,
>>> but I hope I could find a better solution by providing some generic schemes
>>> to the job file, or find a way to deactivate schema validation in the job.
>>> Any help will be appreciated!
>>>
>>> --
>>> Julien Muller
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB