Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Avro mapred: How to avoid schema specification in job.xml?


+
Julien Muller 2011-10-10, 12:41
+
Scott Carey 2011-10-10, 18:09
Copy link to this message
-
Re: Avro mapred: How to avoid schema specification in job.xml?
Julien Muller 2011-10-10, 18:41
Hello,

Thanks for your answer, let me try to clarify my context a bit:

I'm not all that familiar with how Oozie interacts with Avro.
>
Let's get oozie out of the picture. I use job.xml files to configure Jobs.
This means I do not have any JobConf object and I cannot use AvroJob.
Therefore I directly write the job properties (as what AvroJob outputs).

The Job must set its avro.input.schema and avro.output.schema properties —
> this can be done in code (see the unit tests in the Avro mapred project for
> examples),
>
The solution I have now is basically based on the Avro mapred unit tests.
But in my context, it is not an option to code (using the $SCHEMA property)
at the job configuration level.
where you code:
    AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
I have to copy the entire schema in job.xml file. And I have to update it
every time my schema get updated.
I hope I can find a better solution.

and if you are using SpecificRecords and DataFiles the schema is available
> to the code where necessary.
>
I am not sure what you mean here. I am using SpecificRecords and would like
to avoid specifying avro.input.schema, since this info is already here in
the specific record.

Thanks,

Julien Muller

2011/10/10 Scott Carey <[EMAIL PROTECTED]>

> I'm not all that familiar with how Oozie interacts with Avro.
>
> The Job must set its avro.input.schema and avro.output.schema properties —
> this can be done in code (see the unit tests in the Avro mapred project for
> examples), and if you are using SpecificRecords and DataFiles the schema is
> available to the code where necessary.
>
>
>
> On 10/10/11 5:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I have been using avro with hadoop and oozie for months now and I am very
> happy with the results.
>
> The only point I see as a limitation now is that we specify avro schemes in
> workflow.xml (job.xml):
> - avro.input.schema
> - avro.output.schema
> Since this info is already provided in Mapper/Reducer signatures, I see
> this as redundant. The schema is also present in all my serialized files,
> which means that the schema is specified in 3 different places.
>
> From a run point of view, this is a pain, since any schema modification
> (let's say a simple optional field added) forces me to update many job
> files. This task is very error prone and since we have a large amount of
> jobs, it generates a lot of work.
>
> The only solution I see now would be to find/replace in the build script,
> but I hope I could find a better solution by providing some generic schemes
> to the job file, or find a way to deactivate schema validation in the job.
> Any help will be appreciated!
>
> --
> Julien Muller
>
>
+
Scott Carey 2011-10-10, 21:54
+
Julien Muller 2011-10-11, 10:07