Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Generic Avro Classification and Deserialization


+
Public Network Services 2013-01-17, 22:11
Copy link to this message
-
Re: Generic Avro Classification and Deserialization
Miki Tebeka 2013-01-18, 18:49
Avro files have a "magic" prefix of "Obj\0x1", this might help.
The schema is always embedded in the avro file in the "meta" field.
On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services <
[EMAIL PROTECTED]> wrote:

> Folks,
>
> I am involved in a project to extract data from a large number of files
> (to be provided at some point), in numerous formats, among which is some
> Avro files (both binary and JSON-encoded), and thus I am looking for the
> best way to tackle this.
>
> One of the things we would (ideally) like to do is auto-classify the data
> generically, i.e. read a few lines or bytes off a file and be able to tell
> what kind of format it is.
>
> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> sure how this would be done for Avro.
>
> For one thing, there is the necessity of a Schema, about which the
> documentation says that
>
>    - "Avro data is always serialized with its schema. Files that store
>    Avro data should always also include the schema for that data in the same
>    file."
>
> However, the Java code examples posted on the project website imply that
> the Schema is supplied as a separate file and I am not sure whether this is
> only the case with RPC.
>
> Are there any code examples for detecting the encoding format
> (binary/json) of the data file, assessing whether there is a schema
> embedded in it and extracting that schema?
>
> Thanks!
>
+
Public Network Services 2013-01-18, 23:48
+
Terry Healy 2013-01-18, 14:53
+
Public Network Services 2013-01-19, 00:46
+
Miki Tebeka 2013-01-19, 04:36