Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Generic Avro Classification and Deserialization


Copy link to this message
-
Re: Generic Avro Classification and Deserialization
Avro files have a "magic" prefix of "Obj\0x1", this might help.
The schema is always embedded in the avro file in the "meta" field.
On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services <
[EMAIL PROTECTED]> wrote:

> Folks,
>
> I am involved in a project to extract data from a large number of files
> (to be provided at some point), in numerous formats, among which is some
> Avro files (both binary and JSON-encoded), and thus I am looking for the
> best way to tackle this.
>
> One of the things we would (ideally) like to do is auto-classify the data
> generically, i.e. read a few lines or bytes off a file and be able to tell
> what kind of format it is.
>
> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> sure how this would be done for Avro.
>
> For one thing, there is the necessity of a Schema, about which the
> documentation says that
>
>    - "Avro data is always serialized with its schema. Files that store
>    Avro data should always also include the schema for that data in the same
>    file."
>
> However, the Java code examples posted on the project website imply that
> the Schema is supplied as a separate file and I am not sure whether this is
> only the case with RPC.
>
> Are there any code examples for detecting the encoding format
> (binary/json) of the data file, assessing whether there is a schema
> embedded in it and extracting that schema?
>
> Thanks!
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB