Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Generic Avro Classification and Deserialization

Copy link to this message
Generic Avro Classification and Deserialization

I am involved in a project to extract data from a large number of files (to
be provided at some point), in numerous formats, among which is some Avro
files (both binary and JSON-encoded), and thus I am looking for the best
way to tackle this.

One of the things we would (ideally) like to do is auto-classify the data
generically, i.e. read a few lines or bytes off a file and be able to tell
what kind of format it is.

This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
sure how this would be done for Avro.

For one thing, there is the necessity of a Schema, about which the
documentation says that

   - "Avro data is always serialized with its schema. Files that store Avro
   data should always also include the schema for that data in the same file."

However, the Java code examples posted on the project website imply that
the Schema is supplied as a separate file and I am not sure whether this is
only the case with RPC.

Are there any code examples for detecting the encoding format (binary/json)
of the data file, assessing whether there is a schema embedded in it and
extracting that schema?