|
Public Network Services
2013-01-17, 22:11
Terry Healy
2013-01-18, 14:53
Miki Tebeka
2013-01-18, 18:49
Public Network Services
2013-01-18, 23:48
Public Network Services
2013-01-19, 00:46
Miki Tebeka
2013-01-19, 04:36
|
-
Generic Avro Classification and DeserializationPublic Network Services 2013-01-17, 22:11
Folks,
I am involved in a project to extract data from a large number of files (to be provided at some point), in numerous formats, among which is some Avro files (both binary and JSON-encoded), and thus I am looking for the best way to tackle this. One of the things we would (ideally) like to do is auto-classify the data generically, i.e. read a few lines or bytes off a file and be able to tell what kind of format it is. This is fairly easy to do with, say, (non-Avro) JSON files, but I am not sure how this would be done for Avro. For one thing, there is the necessity of a Schema, about which the documentation says that - "Avro data is always serialized with its schema. Files that store Avro data should always also include the schema for that data in the same file." However, the Java code examples posted on the project website imply that the Schema is supplied as a separate file and I am not sure whether this is only the case with RPC. Are there any code examples for detecting the encoding format (binary/json) of the data file, assessing whether there is a schema embedded in it and extracting that schema? Thanks!
-
Re: Generic Avro Classification and DeserializationTerry Healy 2013-01-18, 14:53
Check out avro-tools. With this you can dump the schema for a file,
extract the metadata, or export it in several formats: ---------------- Available tools: compile Generates Java code for the given schema. fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. fromtext Imports a text file into an avro data file. getmeta Prints out the metadata of an Avro data file. getschema Prints out schema of an Avro data file. idl Generates a JSON schema from an Avro IDL file induce Induce schema/protocol from Java class/interface via reflection. jsontofrag Renders a JSON-encoded Avro datum as binary. recodec Alters the codec of a data file. rpcreceive Opens an RPC Server and listens for one message. rpcsend Sends a single RPC message. tether Run a tethered mapreduce job. tojson Dumps an Avro data file as JSON, one record per line. totext Converts an Avro data file to a text file. trevni_meta Dumps a Trevni file's metadata as JSON. trevni_random Create a Trevni file filled with random instances of a schema. trevni_tojson Dumps a Trevni file as JSON. -Terry On 01/17/2013 05:11 PM, Public Network Services wrote: > Folks, > > I am involved in a project to extract data from a large number of files > (to be provided at some point), in numerous formats, among which is some > Avro files (both binary and JSON-encoded), and thus I am looking for the > best way to tackle this. > > One of the things we would (ideally) like to do is auto-classify the > data generically, i.e. read a few lines or bytes off a file and be able > to tell what kind of format it is. > > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not > sure how this would be done for Avro. > > For one thing, there is the necessity of a Schema, about which the > documentation says that > > * "Avro data is always serialized with its schema. Files that store > Avro data should always also include the schema for that data in the > same file." > > However, the Java code examples posted on the project website imply that > the Schema is supplied as a separate file and I am not sure whether this > is only the case with RPC. > > Are there any code examples for detecting the encoding format > (binary/json) of the data file, assessing whether there is a schema > embedded in it and extracting that schema? > > Thanks!
-
Re: Generic Avro Classification and DeserializationMiki Tebeka 2013-01-18, 18:49
Avro files have a "magic" prefix of "Obj\0x1", this might help.
The schema is always embedded in the avro file in the "meta" field. On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services < [EMAIL PROTECTED]> wrote: > Folks, > > I am involved in a project to extract data from a large number of files > (to be provided at some point), in numerous formats, among which is some > Avro files (both binary and JSON-encoded), and thus I am looking for the > best way to tackle this. > > One of the things we would (ideally) like to do is auto-classify the data > generically, i.e. read a few lines or bytes off a file and be able to tell > what kind of format it is. > > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not > sure how this would be done for Avro. > > For one thing, there is the necessity of a Schema, about which the > documentation says that > > - "Avro data is always serialized with its schema. Files that store > Avro data should always also include the schema for that data in the same > file." > > However, the Java code examples posted on the project website imply that > the Schema is supplied as a separate file and I am not sure whether this is > only the case with RPC. > > Are there any code examples for detecting the encoding format > (binary/json) of the data file, assessing whether there is a schema > embedded in it and extracting that schema? > > Thanks! >
-
Re: Generic Avro Classification and DeserializationPublic Network Services 2013-01-18, 23:48
You mean "Avro binary files", yes?
What about Avro JSON files? Would there be a trick to assess whether such a file is Avro and not generic JSON? On Fri, Jan 18, 2013 at 10:49 AM, Miki Tebeka <[EMAIL PROTECTED]> wrote: > Avro files have a "magic" prefix of "Obj\0x1", this might help. > The schema is always embedded in the avro file in the "meta" field. > > > On Thu, Jan 17, 2013 at 2:11 PM, Public Network Services < > [EMAIL PROTECTED]> wrote: > >> Folks, >> >> I am involved in a project to extract data from a large number of files >> (to be provided at some point), in numerous formats, among which is some >> Avro files (both binary and JSON-encoded), and thus I am looking for the >> best way to tackle this. >> >> One of the things we would (ideally) like to do is auto-classify the data >> generically, i.e. read a few lines or bytes off a file and be able to tell >> what kind of format it is. >> >> This is fairly easy to do with, say, (non-Avro) JSON files, but I am not >> sure how this would be done for Avro. >> >> For one thing, there is the necessity of a Schema, about which the >> documentation says that >> >> - "Avro data is always serialized with its schema. Files that store >> Avro data should always also include the schema for that data in the same >> file." >> >> However, the Java code examples posted on the project website imply that >> the Schema is supplied as a separate file and I am not sure whether this is >> only the case with RPC. >> >> Are there any code examples for detecting the encoding format >> (binary/json) of the data file, assessing whether there is a schema >> embedded in it and extracting that schema? >> >> Thanks! >> > >
-
Re: Generic Avro Classification and DeserializationPublic Network Services 2013-01-19, 00:46
Thanks for the help!
I am trying to find sample Avro files and it turns out to be surprisingly difficult (at least via the Google searches I tried). Would you know of any such files (preferably large-ish) in the open source? On Fri, Jan 18, 2013 at 6:53 AM, Terry Healy <[EMAIL PROTECTED]> wrote: > Check out avro-tools. With this you can dump the schema for a file, > extract the metadata, or export it in several formats: > > ---------------- > Available tools: > compile Generates Java code for the given schema. > fragtojson Renders a binary-encoded Avro datum as JSON. > fromjson Reads JSON records and writes an Avro data file. > fromtext Imports a text file into an avro data file. > getmeta Prints out the metadata of an Avro data file. > getschema Prints out schema of an Avro data file. > idl Generates a JSON schema from an Avro IDL file > induce Induce schema/protocol from Java class/interface via > reflection. > jsontofrag Renders a JSON-encoded Avro datum as binary. > recodec Alters the codec of a data file. > rpcreceive Opens an RPC Server and listens for one message. > rpcsend Sends a single RPC message. > tether Run a tethered mapreduce job. > tojson Dumps an Avro data file as JSON, one record per line. > totext Converts an Avro data file to a text file. > trevni_meta Dumps a Trevni file's metadata as JSON. > trevni_random Create a Trevni file filled with random instances of a > schema. > trevni_tojson Dumps a Trevni file as JSON. > > -Terry > > On 01/17/2013 05:11 PM, Public Network Services wrote: > > Folks, > > > > I am involved in a project to extract data from a large number of files > > (to be provided at some point), in numerous formats, among which is some > > Avro files (both binary and JSON-encoded), and thus I am looking for the > > best way to tackle this. > > > > One of the things we would (ideally) like to do is auto-classify the > > data generically, i.e. read a few lines or bytes off a file and be able > > to tell what kind of format it is. > > > > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not > > sure how this would be done for Avro. > > > > For one thing, there is the necessity of a Schema, about which the > > documentation says that > > > > * "Avro data is always serialized with its schema. Files that store > > Avro data should always also include the schema for that data in the > > same file." > > > > However, the Java code examples posted on the project website imply that > > the Schema is supplied as a separate file and I am not sure whether this > > is only the case with RPC. > > > > Are there any code examples for detecting the encoding format > > (binary/json) of the data file, assessing whether there is a schema > > embedded in it and extracting that schema? > > > > Thanks! >
-
Re: Generic Avro Classification and DeserializationMiki Tebeka 2013-01-19, 04:36
On Fri, Jan 18, 2013 at 4:46 PM, Public Network Services <
[EMAIL PROTECTED]> wrote: > I am trying to find sample Avro files There are some in the Avro source tree test directory. |