-Re: Generic Avro Classification and Deserialization
Terry Healy 2013-01-18, 14:53
Check out avro-tools. With this you can dump the schema for a file,
extract the metadata, or export it in several formats:
compile Generates Java code for the given schema.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
induce Induce schema/protocol from Java class/interface via
jsontofrag Renders a JSON-encoded Avro datum as binary.
recodec Alters the codec of a data file.
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, one record per line.
totext Converts an Avro data file to a text file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a
trevni_tojson Dumps a Trevni file as JSON.
On 01/17/2013 05:11 PM, Public Network Services wrote:
> I am involved in a project to extract data from a large number of files
> (to be provided at some point), in numerous formats, among which is some
> Avro files (both binary and JSON-encoded), and thus I am looking for the
> best way to tackle this.
> One of the things we would (ideally) like to do is auto-classify the
> data generically, i.e. read a few lines or bytes off a file and be able
> to tell what kind of format it is.
> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> sure how this would be done for Avro.
> For one thing, there is the necessity of a Schema, about which the
> documentation says that
> * "Avro data is always serialized with its schema. Files that store
> Avro data should always also include the schema for that data in the
> same file."
> However, the Java code examples posted on the project website imply that
> the Schema is supplied as a separate file and I am not sure whether this
> is only the case with RPC.
> Are there any code examples for detecting the encoding format
> (binary/json) of the data file, assessing whether there is a schema
> embedded in it and extracting that schema?