Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Generic Avro Classification and Deserialization


Copy link to this message
-
Re: Generic Avro Classification and Deserialization
You mean "Avro binary files", yes?

What about Avro JSON files? Would there be a trick to assess whether such a
file is Avro and not generic JSON?
On Fri, Jan 18, 2013 at 10:49 AM, Miki Tebeka <[EMAIL PROTECTED]> wrote:

> Avro files have a "magic" prefix of "Obj\0x1", this might help.
> The schema is always embedded in the avro file in the "meta" field.
>
>
> On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services <
> [EMAIL PROTECTED]> wrote:
>
>> Folks,
>>
>> I am involved in a project to extract data from a large number of files
>> (to be provided at some point), in numerous formats, among which is some
>> Avro files (both binary and JSON-encoded), and thus I am looking for the
>> best way to tackle this.
>>
>> One of the things we would (ideally) like to do is auto-classify the data
>> generically, i.e. read a few lines or bytes off a file and be able to tell
>> what kind of format it is.
>>
>> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
>> sure how this would be done for Avro.
>>
>> For one thing, there is the necessity of a Schema, about which the
>> documentation says that
>>
>>    - "Avro data is always serialized with its schema. Files that store
>>    Avro data should always also include the schema for that data in the same
>>    file."
>>
>> However, the Java code examples posted on the project website imply that
>> the Schema is supplied as a separate file and I am not sure whether this is
>> only the case with RPC.
>>
>> Are there any code examples for detecting the encoding format
>> (binary/json) of the data file, assessing whether there is a schema
>> embedded in it and extracting that schema?
>>
>> Thanks!
>>
>
>