Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Hi,all. How can I involve two avro files with different schema into one M/R job?


+
幻 2011-03-18, 03:13
+
Doug Cutting 2011-03-18, 16:51
Copy link to this message
-
Re: Hi,all. How can I involve two avro files with different schema into one M/R job?
Doug,

Would it help if the provided JSON schemae were added to the JobConf
with the given path(s) as a prefix to the key used to retrieve them?
This would help use with MultipleInputs and such (but it may get
complicated to do if globs were involved?).

On Fri, Mar 18, 2011 at 10:21 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> On 03/17/2011 08:13 PM, 幻 wrote:
>>      Currently,I have two avro files with different schema. I found that
>> I have to set the schema before running a M/R job if the files are in
>> avro format.But the schema of the files are probably not the same.How
>> can I do that without setting the schema before running a job? Thanks.
>
> The schema you set for the job is the reader's schema.  The schema in
> the input files is the writer's schema and not match this exactly.  It
> will be projected to the reader's schema, as described in the
> specification, particularly in the "Schema Resolution" section.
>
> http://avro.apache.org/docs/current/spec.html#Schema+Resolution
>
> The aliases section is also relevant:
>
> http://avro.apache.org/docs/current/spec.html#Aliases
>
> This can be used to extract fields from different schemas into a common
> data structure.  For example, if your input files use the following two
> schemas:
>
> {"type":"record", "name":"a.A", "fields":[{"name":"foo", "type":"int"}]}
> {"type":"record", "name":"b.B", "fields":[{"name":"bar", "type":"int"}]}
>
> then the following record can read both:
>
> {"type":"record", "name":"my.MapInput",
>  "aliases":["a.A","b.B"],
>  "fields":[{"name":"x", "type":"int", "aliases":["foo","bar"]}]
> }
>
> The reader's schema can thus include a common subset of fields in
> inputs.  It can map fields of compatible types that are named
> differently to a common field.  It can include fields that are not in
> all inputs, so long as they have a default value in the reader's schema.
>  It could include all data from all inputs, e.g., in the above case:
>
> {"type":"record", "name":"my.MapInput",
>  "aliases":["a.A","b.B"],
>  "fields":[
>   {"name":"foo", "type":"int", "default": -1},
>   {"name":"bar", "type":"int", "default": -1},
>  ]
> }
>
> So there's a fair amount of flexibility available.
>
> Doug
>

--
Harsh J
http://harshj.com
+
Doug Cutting 2011-03-18, 18:08
+
Harsh J 2011-03-18, 18:31
+
Doug Cutting 2011-03-18, 19:59
+
幻 2011-03-21, 02:15
+
幻 2011-03-21, 02:22