Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Generic Avro Classification and Deserialization


+
Public Network Services 2013-01-17, 22:11
+
Miki Tebeka 2013-01-18, 18:49
+
Public Network Services 2013-01-18, 23:48
+
Terry Healy 2013-01-18, 14:53
Copy link to this message
-
Re: Generic Avro Classification and Deserialization
Thanks for the help!

I am trying to find sample Avro files and it turns out to be surprisingly
difficult (at least via the Google searches I tried).

Would you know of any such files (preferably large-ish) in the open source?
On Fri, Jan 18, 2013 at 6:53 AM, Terry Healy <[EMAIL PROTECTED]> wrote:

> Check out avro-tools. With this you can dump the schema for a file,
> extract the metadata, or export it in several formats:
>
> ----------------
> Available tools:
>       compile  Generates Java code for the given schema.
>    fragtojson  Renders a binary-encoded Avro datum as JSON.
>      fromjson  Reads JSON records and writes an Avro data file.
>      fromtext  Imports a text file into an avro data file.
>       getmeta  Prints out the metadata of an Avro data file.
>     getschema  Prints out schema of an Avro data file.
>           idl  Generates a JSON schema from an Avro IDL file
>        induce  Induce schema/protocol from Java class/interface via
> reflection.
>    jsontofrag  Renders a JSON-encoded Avro datum as binary.
>       recodec  Alters the codec of a data file.
>    rpcreceive  Opens an RPC Server and listens for one message.
>       rpcsend  Sends a single RPC message.
>        tether  Run a tethered mapreduce job.
>        tojson  Dumps an Avro data file as JSON, one record per line.
>        totext  Converts an Avro data file to a text file.
>   trevni_meta  Dumps a Trevni file's metadata as JSON.
> trevni_random  Create a Trevni file filled with random instances of a
> schema.
> trevni_tojson  Dumps a Trevni file as JSON.
>
> -Terry
>
> On 01/17/2013 05:11 PM, Public Network Services wrote:
> > Folks,
> >
> > I am involved in a project to extract data from a large number of files
> > (to be provided at some point), in numerous formats, among which is some
> > Avro files (both binary and JSON-encoded), and thus I am looking for the
> > best way to tackle this.
> >
> > One of the things we would (ideally) like to do is auto-classify the
> > data generically, i.e. read a few lines or bytes off a file and be able
> > to tell what kind of format it is.
> >
> > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> > sure how this would be done for Avro.
> >
> > For one thing, there is the necessity of a Schema, about which the
> > documentation says that
> >
> >   * "Avro data is always serialized with its schema. Files that store
> >     Avro data should always also include the schema for that data in the
> >     same file."
> >
> > However, the Java code examples posted on the project website imply that
> > the Schema is supplied as a separate file and I am not sure whether this
> > is only the case with RPC.
> >
> > Are there any code examples for detecting the encoding format
> > (binary/json) of the data file, assessing whether there is a schema
> > embedded in it and extracting that schema?
> >
> > Thanks!
>
+
Miki Tebeka 2013-01-19, 04:36
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB