Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Avro mapred: How to avoid schema specification in job.xml?


+
Julien Muller 2011-10-10, 12:41
+
Scott Carey 2011-10-10, 18:09
Copy link to this message
-
Re: Avro mapred: How to avoid schema specification in job.xml?
Hello,

Thanks for your answer, let me try to clarify my context a bit:

I'm not all that familiar with how Oozie interacts with Avro.
>
Let's get oozie out of the picture. I use job.xml files to configure Jobs.
This means I do not have any JobConf object and I cannot use AvroJob.
Therefore I directly write the job properties (as what AvroJob outputs).

The Job must set its avro.input.schema and avro.output.schema properties —
> this can be done in code (see the unit tests in the Avro mapred project for
> examples),
>
The solution I have now is basically based on the Avro mapred unit tests.
But in my context, it is not an option to code (using the $SCHEMA property)
at the job configuration level.
where you code:
    AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
I have to copy the entire schema in job.xml file. And I have to update it
every time my schema get updated.
I hope I can find a better solution.

and if you are using SpecificRecords and DataFiles the schema is available
> to the code where necessary.
>
I am not sure what you mean here. I am using SpecificRecords and would like
to avoid specifying avro.input.schema, since this info is already here in
the specific record.

Thanks,

Julien Muller

2011/10/10 Scott Carey <[EMAIL PROTECTED]>

> I'm not all that familiar with how Oozie interacts with Avro.
>
> The Job must set its avro.input.schema and avro.output.schema properties —
> this can be done in code (see the unit tests in the Avro mapred project for
> examples), and if you are using SpecificRecords and DataFiles the schema is
> available to the code where necessary.
>
>
>
> On 10/10/11 5:41 AM, "Julien Muller" <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I have been using avro with hadoop and oozie for months now and I am very
> happy with the results.
>
> The only point I see as a limitation now is that we specify avro schemes in
> workflow.xml (job.xml):
> - avro.input.schema
> - avro.output.schema
> Since this info is already provided in Mapper/Reducer signatures, I see
> this as redundant. The schema is also present in all my serialized files,
> which means that the schema is specified in 3 different places.
>
> From a run point of view, this is a pain, since any schema modification
> (let's say a simple optional field added) forces me to update many job
> files. This task is very error prone and since we have a large amount of
> jobs, it generates a lot of work.
>
> The only solution I see now would be to find/replace in the build script,
> but I hope I could find a better solution by providing some generic schemes
> to the job file, or find a way to deactivate schema validation in the job.
> Any help will be appreciated!
>
> --
> Julien Muller
>
>
+
Scott Carey 2011-10-10, 21:54
+
Julien Muller 2011-10-11, 10:07
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB