-record schema names... a nuisance?
Koert Kuipers 2012-10-20, 16:52
we are on a fairly old avro (1.5.4) so not sure my observations apply to
newer versions. i noticed that when i read from avro files in hadoop it
does not expect the reader's schema (fully qualified) name to be equal to
the writer's schema (fully qualified) name. this allows me to read from
files without knowing what name the schema had when it was written.
according to doug cutting this is a bug and the read should not succeed if
the reader's and writer's schema do not have the same name. also when the
schema names are not the same then field aliases do not work.
ok with that out of the way this is my situation: we create lots of avro
files that we add to large partitioned tables (a structure with subdirs on
hdfs). the people that write the files understand the importance of
canonical columns names (field names), but not everyone gets the idea of
schema names, so generally i have avro files with name different (writer's)
schema names in there. i do not expect i can correct this. also it is not
unusual to run a hadoop map-red job reading from many different data
sources at once, using avro's fantastic projection ability to extract just
a few columns. however in that case again the (writer's) schema names are
not expected to be the same across avro files i am reading from.
so today all of this works, meaning i can run map-reduce jobs across all
these files with difference/inconsistent schema names, but only thanks to a
bug, which makes me nervous one day it will not work. also field aliases do
not work, which is a real limitation. so i am trying to see if i can come
up with a better solution. of course i could go find out every times what
all the schema names are in the avro files, and add all aliases to my
reader's schema. but that is real pain, in particular since the set is not
constant. i guess i could automate this by scanning all the avro files
first and extracting their schemas. however it sounds very inelegant. so i
rather not do that.
so i have 2 questions:
1) can i reasonably assume that processing in hadoop will continue to work
even if the reader and writer's schema names are not the same (so rely on
this bug)? the fact that field aliases do not work in this case is too bad
but at least i got something working...
2) is there a better solution? for example, something like where i could
say in my reader's schema that the schema has an alias of * (wildcard) so
that i can read from all these files with different (writer's) schema names
and it works without relying on a bug, and on top of that field aliases
will also work? that would be fantastic...