|
Doug Cutting
2009-04-02, 22:05
Owen O'Malley
2009-04-02, 22:20
Abhishek Verma
2009-04-03, 00:11
Owen O'Malley
2009-04-03, 04:28
Doug Cutting
2009-04-03, 16:06
Bryan Duxbury
2009-04-03, 16:24
Doug Cutting
2009-04-03, 17:28
Bryan Duxbury
2009-04-03, 17:50
Doug Cutting
2009-04-03, 18:37
George Porter
2009-04-03, 19:03
Scott Carey
2009-04-03, 19:49
Doug Cutting
2009-04-03, 20:02
George Porter
2009-04-03, 20:24
Bryan Duxbury
2009-04-03, 19:59
Doug Cutting
2009-04-03, 17:39
Alan Gates
2009-04-03, 17:43
Sameer Paranjpye
2009-04-03, 20:27
Nigel Daley
2009-04-03, 20:47
Jim Kellerman
2009-04-03, 22:10
Doug Cutting
2009-04-03, 22:44
Jerome Boulon
2009-04-03, 20:59
Sanjay Radia
2009-04-06, 05:33
Dhruba Borthakur
2009-04-06, 05:42
Brian Forney
2009-04-06, 20:38
Tom White
2009-04-07, 17:26
Doug Cutting
2009-04-08, 03:33
Chad Walters
2009-04-08, 07:03
Chad Walters
2009-04-06, 07:23
Doug Cutting
2009-04-06, 19:12
Kevin Clark
2009-04-07, 00:17
Doug Cutting
2009-04-07, 04:15
Chad Walters
2009-04-07, 08:56
Doug Cutting
2009-04-07, 16:16
Doug Cutting
2009-04-06, 20:08
George Porter
2009-04-06, 20:25
Doug Cutting
2009-04-06, 20:48
George Porter
2009-04-06, 21:02
Chad Walters
2009-04-06, 21:05
Doug Cutting
2009-04-06, 22:17
Chad Walters
2009-04-06, 06:51
Ankur Goel
2009-04-13, 11:27
Doug Cutting
2009-04-13, 17:54
|
-
[PROPOSAL] new subproject: AvroDoug Cutting 2009-04-02, 22:05
I propose we add a new Hadoop subproject for Avro, a serialization
system. My ambition is for Avro to replace both Hadoop's RPC and to be used for most Hadoop data files, e.g., by Pig, Hive, etc. Initial committers would be Sharad Agarwal and me, both existing Hadoop committers. We are the sole authors of this software to date. The code is currently at: http://people.apache.org/~cutting/avro.git/ To learn more: git clone http://people.apache.org/~cutting/avro.git/ avro cat avro/README.txt Comments? Questions? Doug +
Doug Cutting 2009-04-02, 22:05
-
Re: [PROPOSAL] new subproject: AvroOwen O'Malley 2009-04-02, 22:20
On Apr 2, 2009, at 3:05 PM, Doug Cutting wrote: > I propose we add a new Hadoop subproject for Avro, a serialization > system. +1. Even if Justin comes after me for allowing another Jakarta. *smile* -- Owen +
Owen O'Malley 2009-04-02, 22:20
-
Re: [PROPOSAL] new subproject: AvroAbhishek Verma 2009-04-03, 00:11
I am a newbie here. Why not use something existing like protocol buffers :
http://code.google.com/p/protobuf/ which is open source and works amazingly well. On Thu, Apr 2, 2009 at 5:20 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Apr 2, 2009, at 3:05 PM, Doug Cutting wrote: > > I propose we add a new Hadoop subproject for Avro, a serialization system. >> > > +1. Even if Justin comes after me for allowing another Jakarta. *smile* > > -- Owen > +
Abhishek Verma 2009-04-03, 00:11
-
Re: [PROPOSAL] new subproject: AvroOwen O'Malley 2009-04-03, 04:28
On Apr 2, 2009, at 5:11 PM, Abhishek Verma wrote: > I am a newbie here. Why not use something existing like protocol > buffers : > http://code.google.com/p/protobuf/ which is open source and works > amazingly > well. There are two blockers for protocol buffers that make them suboptimal for Hadoop. They are: 1. Protocol buffers are open source, but the community isn't open. Google doesn't seem interested in getting patches from outside of itself. If we needed something changed in protocol buffers, we'd end up needing to fork the project to make any progress. 2. Protocol buffers (and thrift) encode the field names as id numbers. That means that if you read them into dynamic language like Python that it has to use the field numbers instead of the field names. In Avro, the field names are saved and there are no field ids. A final point is that since the schema isn't inlined in Avro, the binary representation is much tighter than protocol buffers. -- Owen +
Owen O'Malley 2009-04-03, 04:28
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-03, 16:06
Owen O'Malley wrote:
> 2. Protocol buffers (and thrift) encode the field names as id numbers. > That means that if you read them into dynamic language like Python that > it has to use the field numbers instead of the field names. In Avro, the > field names are saved and there are no field ids. This hints at a related problem with Thrift and Protocol Buffers, which is that they require one to generate code for each datatype one processes. This is awkward in dynamic environments, where one would like to write a script (Pig, Python, Perl, Hive, whatever) to process input data and generate output data, without having to locate the IDL for each input file, run an IDL compiler, load the generated code, generate an IDL file for the output, run the compiler again, load the output code and finally write your output. Avro rather lets you simply open your inputs, examine their datatypes, specify output types and write them. Avro's Java implementation currently includes three different data representations: - a "generic" representation uses a standard set of datastructures for all datatypes: records are represented as Map<String,Object>, arrays as List<Object>, longs as Long, etc. - a "reflect" representation uses Java reflection to permit one to read and write existing Java classes with Avro. - a "specific" representation generates Java classes that are compiled and loaded, much like Thrift and Protocol Buffers. We don't expect most scripting languages to use more than a single representation. Implementing Avro is quite simple, by design. We have a Python implementation, and hope to add more soon. Doug +
Doug Cutting 2009-04-03, 16:06
-
Re: [PROPOSAL] new subproject: AvroBryan Duxbury 2009-04-03, 16:24
It sounds like what you want is the option avoid pre-generated
classes. If that's the only thing you need, it seems like we could bolt that on to Thrift with almost no work. I assume you'd have the schema stored in metadata or file header or something, right? (You wouldn't want to store the field names in the binary encoding as strings, since that would probably very quickly dwarf the size of the actual data in a lot of cases.) If my assumptions are correct, it seems like it'd be a lot smarter to leverage existing Thrift infrastructure and encoding work rather than duplicating it for this lone feature. -Bryan On Apr 3, 2009, at 9:06 AM, Doug Cutting wrote: > Owen O'Malley wrote: >> 2. Protocol buffers (and thrift) encode the field names as id >> numbers. That means that if you read them into dynamic language >> like Python that it has to use the field numbers instead of the >> field names. In Avro, the field names are saved and there are no >> field ids. > > This hints at a related problem with Thrift and Protocol Buffers, > which is that they require one to generate code for each datatype > one processes. This is awkward in dynamic environments, where one > would like to write a script (Pig, Python, Perl, Hive, whatever) to > process input data and generate output data, without having to > locate the IDL for each input file, run an IDL compiler, load the > generated code, generate an IDL file for the output, run the > compiler again, load the output code and finally write your > output. Avro rather lets you simply open your inputs, examine > their datatypes, specify output types and write them. > > Avro's Java implementation currently includes three different data > representations: > > - a "generic" representation uses a standard set of datastructures > for all datatypes: records are represented as Map<String,Object>, > arrays as List<Object>, longs as Long, etc. > > - a "reflect" representation uses Java reflection to permit one to > read and write existing Java classes with Avro. > > - a "specific" representation generates Java classes that are > compiled and loaded, much like Thrift and Protocol Buffers. > > We don't expect most scripting languages to use more than a single > representation. Implementing Avro is quite simple, by design. We > have a Python implementation, and hope to add more soon. > > Doug +
Bryan Duxbury 2009-04-03, 16:24
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-03, 17:28
Bryan Duxbury wrote:
> It sounds like what you want is the option avoid pre-generated classes. That's part of it. But, once you have the schema, you might as well take advantage of it. With the schema in hand, you don't need to tag data with field numbers or types, since that's all there in the schema. So, having the schema, you can use a simpler data format. Also, with the schema, resolving version differences is simplified. Developers don't need to assign field numbers, but can just use names. For performance, one can internally use field numbers while reading, to avoid string comparisons, but developers need no longer specify these, but can use names, as in most software. Here having the schema means we can simplify the IDL and its versioning semantics. > If that's the only thing you need, it seems like we could bolt that on > to Thrift with almost no work. Would you write parsers for Thrift's IDL in every language? Or would you use JSON, as Avro does, to avoid that? Once you're using a different IDL and a different data format, what's shared with Thrift? Fundamentally, those two things define a serialization system, no? > I assume you'd have the schema stored in > metadata or file header or something, right? (You wouldn't want to store > the field names in the binary encoding as strings, since that would > probably very quickly dwarf the size of the actual data in a lot of cases.) Yes, in data files the schema is typically stored in the metadata. > If my assumptions are correct, it seems like it'd be a lot smarter to > leverage existing Thrift infrastructure and encoding work rather than > duplicating it for this lone feature. What specific shared infrastructure would be leveraged? For Hadoop's RPC, I hope to adapt Hadoop's client and server implementations as a transport, as these have been highly tuned for Hadoop's performance requirements. Doug +
Doug Cutting 2009-04-03, 17:28
-
Re: [PROPOSAL] new subproject: AvroBryan Duxbury 2009-04-03, 17:50
> With the schema in hand, you don't need to tag data with field
> numbers or types, since that's all there in the schema. So, having > the schema, you can use a simpler data format. To a degree, we already have that in Thrift - we call it the DenseProtocol. > Would you write parsers for Thrift's IDL in every language? Or > would you use JSON, as Avro does, to avoid that? When it comes to having a code-usable IDL for the schema, I'm totally pro-JSON. > Once you're using a different IDL and a different data format, > what's shared with Thrift? Fundamentally, those two things define > a serialization system, no? It's not actually a different data format, is it? You're saying that the user wouldn't specify the field IDs, but you'd fundamentally still use field ids for compactness and the like. You may not use actual Thrift generated objects, but you could certainly use Binary or Compact protocol from Thrift to do all the writing to the wire. You might also be able to use (or contribute to) Thrift's RPC-level stuff like server implementations. We have some respectable Java servers written, and if those aren't enough for your uses, I'd actually be really interested in seeing if we could generalize some of the Hadoop stuff to be useful within Thrift. The bottom line is that I would love to see greater cooperation between Hadoop and Thrift. Unless it's impossible or impractical for Thrift to be useful here, I think we'd be willing to work towards Hadoop's needs. -Bryan +
Bryan Duxbury 2009-04-03, 17:50
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-03, 18:37
Bryan Duxbury wrote:
> It's not actually a different data format, is it? You're saying that the > user wouldn't specify the field IDs, but you'd fundamentally still use > field ids for compactness and the like. Field ids are not present in Avro data except in the schema. A record's fields are serialized in the order that the fields occur in the records schema, with no per-field annotations whatsoever. For example, a record that contains a string and an int is serialized simply as a string followed by an int, nothing before, nothing between and nothing after. So, yes, it is a different data format. > The bottom line is that I would love to see greater cooperation between > Hadoop and Thrift. Unless it's impossible or impractical for Thrift to > be useful here, I think we'd be willing to work towards Hadoop's needs. Perhaps Thrift could be augmented to support Avro's JSON schemas and serialization. Then it could interoperate with other Avro-based systems. But then Thrift would have yet another serialization format, that every language would need to implement for it to be useful... Avro will only ever have one serialization format. Thrift fundamentally standardizes an API, not a data format. Avro fundamentally is a data format specification, like XML. Thrift could implement this specification. The Avro project includes reference implementations, but the format is intended to be simple enough and the specification stable enough that others might reasonably develop alternate, independent implementations. Doug +
Doug Cutting 2009-04-03, 18:37
-
Re: [PROPOSAL] new subproject: AvroGeorge Porter 2009-04-03, 19:03
On Apr 3, 2009, at 11:37 AM, Doug Cutting wrote: >> > > Field ids are not present in Avro data except in the schema. A > record's fields are serialized in the order that the fields occur in > the records schema, with no per-field annotations whatsoever. For > example, a record that contains a string and an int is serialized > simply as a string followed by an int, nothing before, nothing > between and nothing after. So, yes, it is a different data format. While this representation would certainly be as compact as possible, wouldn't it prevent evolving the data structure over time? One of the nice features of Google Protocol Buffers and Thrift is that you can evolve the set of fields over time, and older/newer clients can talk to older/newer services. If the proposed Avro is evolvable, then perhaps I'm misunderstanding your statement about the lack of IDs in the serialized data. I also agree with Bryan, in that it would be unfortunate to have two different Apache projects with overlapping goals. Regardless of features, both protocol buffers and thrift have the advantage of being debugged in mission-critical production environments. -George +
George Porter 2009-04-03, 19:03
-
Re: [PROPOSAL] new subproject: AvroScott Carey 2009-04-03, 19:49
On 4/3/09 12:03 PM, "George Porter" <[EMAIL PROTECTED]> wrote:
> > > On Apr 3, 2009, at 11:37 AM, Doug Cutting wrote: >>> >> >> Field ids are not present in Avro data except in the schema. A >> record's fields are serialized in the order that the fields occur in >> the records schema, with no per-field annotations whatsoever. For >> example, a record that contains a string and an int is serialized >> simply as a string followed by an int, nothing before, nothing >> between and nothing after. So, yes, it is a different data format. > > While this representation would certainly be as compact as possible, > wouldn't it prevent evolving the data structure over time? One of the > nice features of Google Protocol Buffers and Thrift is that you can > evolve the set of fields over time, and older/newer clients can talk > to older/newer services. If the proposed Avro is evolvable, then > perhaps I'm misunderstanding your statement about the lack of IDs in > the serialized data. >From a quick perusal of the serialization format -- it contains headers with type/schema information, and other metadata blocks. The types can be inferred from this, and if this is done right then older/newer clients will be able to read things just fine. What can't be done is mixing two different formats in the same stream if headers define the format of the whole stream. I have not looked much deeper than that, but it looks like schema evolution is feasible. > > I also agree with Bryan, in that it would be unfortunate to have two > different Apache projects with overlapping goals. Regardless of > features, both protocol buffers and thrift have the advantage of being > debugged in mission-critical production environments. > > -George > +
Scott Carey 2009-04-03, 19:49
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-03, 20:02
George Porter wrote:
> While this representation would certainly be as compact as possible, > wouldn't it prevent evolving the data structure over time? One of the > nice features of Google Protocol Buffers and Thrift is that you can > evolve the set of fields over time, and older/newer clients can talk to > older/newer services. If the proposed Avro is evolvable, then perhaps > I'm misunderstanding your statement about the lack of IDs in the > serialized data. Avro supports schema evolution. In Avro, the schema used to write the data must be available when the data is read. (In files, it is typically stored in the file metadata.) If you have the schema that was used to write the data, and you're expecting a slightly different schema, then you simply keep those fields that are in both schemas and skip those not. This is equivalent to Thrift and Protocol Buffer's support for schema evolution, but does not require manually assigning numeric field ids. This feature can also be used to support projection. If you have records with many large fields, but only need a single field in a particular computation, then you can specify an expected schema with only that field, and the runtime will efficiently skip all of the other fields, returning a record with just the single, expected field. > I also agree with Bryan, in that it would be unfortunate to have two > different Apache projects with overlapping goals. We already have both Thrift and Etch in the incubator, which have similar goals. Apache does not attempt to mandate that projects have disjoint goals. There are many ways to slice things, and Apache prefers to rely on survival of the fittest rather than forcing things together. > Regardless of > features, both protocol buffers and thrift have the advantage of being > debugged in mission-critical production environments. Yes, but, as I've argued in other messages in this thread, they do not support the dynamic features we need. Adding those features would add new code that would share little with existing code in those projects. So, while the projects are conceptually similar, the implementations are necessarily different, and, without significant code overlap, separate projects seem more natural. Doug +
Doug Cutting 2009-04-03, 20:02
-
Re: [PROPOSAL] new subproject: AvroGeorge Porter 2009-04-03, 20:24
On Apr 3, 2009, at 1:02 PM, Doug Cutting wrote: > George Porter wrote: >> While this representation would certainly be as compact as >> possible, wouldn't it prevent evolving the data structure over >> time? One of the nice features of Google Protocol Buffers and >> Thrift is that you can evolve the set of fields over time, and >> older/newer clients can talk to older/newer services. If the >> proposed Avro is evolvable, then perhaps I'm misunderstanding your >> statement about the lack of IDs in the serialized data. > > Avro supports schema evolution. In Avro, the schema used to write > the data must be available when the data is read. (In files, it is > typically stored in the file metadata.) > > If you have the schema that was used to write the data, and you're > expecting a slightly different schema, then you simply keep those > fields that are in both schemas and skip those not. This is > equivalent to Thrift and Protocol Buffer's support for schema > evolution, but does not require manually assigning numeric field ids. > > This feature can also be used to support projection. If you have > records with many large fields, but only need a single field in a > particular computation, then you can specify an expected schema with > only that field, and the runtime will efficiently skip all of the > other fields, returning a record with just the single, expected field. Thanks for the clarification--I better understand the schema relationship. The projection feature is a nice feature, especially since it seems like it would be able to support "sparse files" where you want to just peek at large structs without invoking a lot of disk- io (for data serialized on-disk). > > >> I also agree with Bryan, in that it would be unfortunate to have >> two different Apache projects with overlapping goals. > > We already have both Thrift and Etch in the incubator, which have > similar goals. Apache does not attempt to mandate that projects > have disjoint goals. There are many ways to slice things, and > Apache prefers to rely on survival of the fittest rather than > forcing things together. > >> Regardless of features, both protocol buffers and thrift have the >> advantage of being debugged in mission-critical production >> environments. > > Yes, but, as I've argued in other messages in this thread, they do > not support the dynamic features we need. Adding those features > would add new code that would share little with existing code in > those projects. So, while the projects are conceptually similar, the > implementations are necessarily different, and, without significant > code overlap, separate projects seem more natural. > > Doug Makes sense. Thanks, George +
George Porter 2009-04-03, 20:24
-
Re: [PROPOSAL] new subproject: AvroBryan Duxbury 2009-04-03, 19:59
> Field ids are not present in Avro data except in the schema. A
> record's fields are serialized in the order that the fields occur > in the records schema, with no per-field annotations whatsoever. > For example, a record that contains a string and an int is > serialized simply as a string followed by an int, nothing before, > nothing between and nothing after. So, yes, it is a different data > format. So you can't serialize nulls? It also seems like this would make forward/backward compatibility a little more complex. Thrift solves this problem by using tags to indicate what kind of field you're working with. -Bryan +
Bryan Duxbury 2009-04-03, 19:59
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-03, 17:39
Doug Cutting wrote:
> git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt I've posted the generated documentation. The specification is at: http://people.apache.org/~cutting/avro/spec.html and the javadoc is at: http://people.apache.org/~cutting/avro/api/index.html Doug +
Doug Cutting 2009-04-03, 17:39
-
Re: [PROPOSAL] new subproject: AvroAlan Gates 2009-04-03, 17:43
+1. Pig would be happy to use a cross language serialization package
that did not require pre-compilation to read and write. Alan. On Apr 2, 2009, at 3:05 PM, Doug Cutting wrote: > I propose we add a new Hadoop subproject for Avro, a serialization > system. My ambition is for Avro to replace both Hadoop's RPC and to > be used for most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing > Hadoop committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug +
Alan Gates 2009-04-03, 17:43
-
Re: [PROPOSAL] new subproject: AvroSameer Paranjpye 2009-04-03, 20:27
+1 While protocol buffers and thrift have similar goals. Avro takes a different approach to schema evolution and reconciliation. I feel that Avros tighter layout of data and schema management is better suited for many of Hadoop and Pigs use cases for large data sets/tables on HDFS. Field ids start to matter when milions of objects have hundreds of fields each. There is, of course, the storage overhead. Schema management becomes hard especially if there are cases where field ids need to be assigned manually. ----- Original Message ---- From: Doug Cutting <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, April 2, 2009 3:05:08 PM Subject: [PROPOSAL] new subproject: Avro I propose we add a new Hadoop subproject for Avro, a serialization system. My ambition is for Avro to replace both Hadoop's RPC and to be used for most Hadoop data files, e.g., by Pig, Hive, etc. Initial committers would be Sharad Agarwal and me, both existing Hadoop committers. We are the sole authors of this software to date. The code is currently at: http://people.apache.org/~cutting/avro.git/ To learn more: git clone http://people.apache.org/~cutting/avro.git/ avro cat avro/README.txt Comments? Questions? Doug +
Sameer Paranjpye 2009-04-03, 20:27
-
Re: [PROPOSAL] new subproject: AvroNigel Daley 2009-04-03, 20:47
+1.
On Apr 2, 2009, at 3:05 PM, Doug Cutting wrote: > I propose we add a new Hadoop subproject for Avro, a serialization > system. My ambition is for Avro to replace both Hadoop's RPC and to > be used for most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing > Hadoop committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug +
Nigel Daley 2009-04-03, 20:47
-
RE: [PROPOSAL] new subproject: AvroJim Kellerman 2009-04-03, 22:10
> -----Original Message-----
> On Apr 2, 2009, at 3:05 PM, Doug Cutting wrote: > > I propose we add a new Hadoop subproject for Avro, a serialization > system. My ambition is for Avro to replace both Hadoop's RPC and to > be used for most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing > Hadoop committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug After reading all the messages about Avro, I'm still not sure I understand why we should invent "yet another wheel". There are a number of people in the community who have significant investments in Thrift, and I have yet to see a compelling argument for Avro over Thrift. My understanding is that Thrift already supports multi-language bindings, something the HBase community has been asking for, for some time. It is also my understanding (based on the email thread) that Avro only supports Java and python. That is a step backwards from Thrift. It appears that Avro uses introspection heavily, which is expensive in applications that require a high message rate. So I guess my question is why Avro? I may be thick, but it seems to me as if it is just another wheel of a different color. If I could see a point by point comparison between Avro and Thrift I could be convinced that Avro is the way to go. So far, I have not seen any compelling reason to re-invent the wheel. +-0 --- Jim Kellerman, Powerset (Live Search, Microsoft Corporation) +
Jim Kellerman 2009-04-03, 22:10
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-03, 22:44
Jim Kellerman (POWERSET) wrote:
> It is also my understanding (based on the email thread) that Avro only > supports Java and python. That is a step backwards from Thrift. We intend to add support for more languages. Avro is not complete. > It appears that Avro uses introspection heavily, which is expensive in > applications that require a high message rate. It only uses introspection if you wish to use your existing Java classes to represent Avro data. There are three representations in Java: generic (uses Map<String,Object> for records, List<Object> for arrays), specific (generates a java class for each Avro record, like Thrift) and reflect (uses reflection to access existing classes). So introspection is optional. And, while introspection is indeed slow for processing file-based data, it would probably not a bottleneck for most RPC protocols and might be a useful tool to migrate existing code to Avro. > So I guess my question is why Avro? The compelling case is dynamic data types. Pig, Hive, Python, Perl etc. scripts should not have to generate a Thrift IDL file each time they wish to write a data file with a new schema, nor should they need to run the Thrift compiler for each data file they wish to read. For production applications, code-generation is not an imposition and may offer increased opportunities for optimization and error checking, but for exploration and experimentation, a very common use case for Hadoop, one would like to be able to browse datasets and build mapreduce programs more interactively. Doug +
Doug Cutting 2009-04-03, 22:44
-
Re: [PROPOSAL] new subproject: AvroJerome Boulon 2009-04-03, 20:59
Chukwa needs a cross language serialization/RPC framework.
During demux, being able to create dynamically the schema based on the recordType will be a big plus instead of using an hardcoded one, The projection feature will also be useful. +1 /Jerome. On 4/2/09 3:05 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: I propose we add a new Hadoop subproject for Avro, a serialization system. My ambition is for Avro to replace both Hadoop's RPC and to be used for most Hadoop data files, e.g., by Pig, Hive, etc. Initial committers would be Sharad Agarwal and me, both existing Hadoop committers. We are the sole authors of this software to date. The code is currently at: http://people.apache.org/~cutting/avro.git/ To learn more: git clone http://people.apache.org/~cutting/avro.git/ avro cat avro/README.txt Comments? Questions? Doug +
Jerome Boulon 2009-04-03, 20:59
-
Re: [PROPOSAL] new subproject: AvroSanjay Radia 2009-04-06, 05:33
On Apr 2, 2009, at 3:05 PM, Doug Cutting wrote: > I propose we add a new Hadoop subproject for Avro, a serialization > system. My ambition is for Avro to replace both Hadoop's RPC and to > be > used for most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing > Hadoop > committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug > +1 sanjay +
Sanjay Radia 2009-04-06, 05:33
-
Re: [PROPOSAL] new subproject: AvroDhruba Borthakur 2009-04-06, 05:42
+1. Awesome!
-dhruba On Sun, Apr 5, 2009 at 10:33 PM, Sanjay Radia <[EMAIL PROTECTED]> wrote: > > On Apr 2, 2009, at 3:05 PM, Doug Cutting wrote: > > I propose we add a new Hadoop subproject for Avro, a serialization >> system. My ambition is for Avro to replace both Hadoop's RPC and to be >> used for most Hadoop data files, e.g., by Pig, Hive, etc. >> >> Initial committers would be Sharad Agarwal and me, both existing Hadoop >> committers. We are the sole authors of this software to date. >> >> The code is currently at: >> >> http://people.apache.org/~cutting/avro.git/<http://people.apache.org/%7Ecutting/avro.git/> >> >> To learn more: >> >> git clone http://people.apache.org/~cutting/avro.git/<http://people.apache.org/%7Ecutting/avro.git/>avro >> cat avro/README.txt >> >> Comments? Questions? >> >> Doug >> >> +1 > > sanjay +
Dhruba Borthakur 2009-04-06, 05:42
-
Re: [PROPOSAL] new subproject: AvroBrian Forney 2009-04-06, 20:38
On 4/2/09 5:05 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
> I propose we add a new Hadoop subproject for Avro, a serialization > system. My ambition is for Avro to replace both Hadoop's RPC and to be > used for most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing Hadoop > committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug > +1 +
Brian Forney 2009-04-06, 20:38
-
Re: [PROPOSAL] new subproject: AvroTom White 2009-04-07, 17:26
On Thu, Apr 2, 2009 at 11:05 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> I propose we add a new Hadoop subproject for Avro, a serialization system. +1 Tom > My ambition is for Avro to replace both Hadoop's RPC and to be used for > most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing Hadoop > committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug > +
Tom White 2009-04-07, 17:26
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-08, 03:33
To be clear, since a few folks have missed this point: Avro is not
complete. At some point in the future, before people start using it as a format for persistent data, we'll need to stop altering its specification, or at least do so much more cautiously. But before then, my immediate goal to move development from private to open so that we have a chance to incorporate feedback before we lock down the specification. For example, several folks have raised the issue of compatibility with Thrift. We certainly want to avoid gratuitous incompatibilities. There are also features clearly missing from Avro that we expect to add before we make a release, like default values, a more efficient RPC handshake, etc. And some features that we might consider removing, if they're not broadly useful and inhibit interoperability, like single-float, which isn't in Thrift, Python, etc. And I expect there will be more such issues raised in the coming weeks and months. But before we can discuss and resolve such issues we need a forum in which to do so. That's all I am after at this point: mailing lists, a bug database, a public source code repository, etc., so that we can start accepting patches, adding committers, etc. Three days have now passed since I initially proposed this, the nominal time for an Apache vote. Is there anyone who strongly opposes taking the development of Avro public as a Hadoop subproject? Only PMC votes are binding, but I would vastly prefer that the broader community also supports this step in the process. Thanks, Doug Doug Cutting wrote: > I propose we add a new Hadoop subproject for Avro, a serialization > system. My ambition is for Avro to replace both Hadoop's RPC and to be > used for most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing Hadoop > committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug +
Doug Cutting 2009-04-08, 03:33
-
Re: [PROPOSAL] new subproject: AvroChad Walters 2009-04-08, 07:03
Doug, After our off-list chat, and given that you have indicated that the design is still in flux and that you are open to discussing changes that would permit interoperability, I am not as concerned as I was. My urgency came from concern that once the design was put in place as part of an Apache subproject, rather than open sourced in some other less prominent forum, it would increase the barrier to interoperability; in particular, I was concerned that people would assume the design of the data format was fully-baked and start persisting large amounts of data in some early version of the format, potentially prematurely ossifying the design in a state unsuited for compatibility with Thrift. Given your clarifications around this, my fears clearly were not well-founded. Please accept my apology if I came across as obstructionist. I was honestly advocating on behalf of what I believe is in the best interest our shared user base. Clearly we have some disagreements about the value of some of Thrift's design choices and what those mean for various use cases. I think we also have some differences of opinion about the relative difficulty of implementation versus the value of interoperability. Hopefully, the next few months will afford an opportunity to examine the sources of those disagreements and see if they can be resolved. Sincerely, Chad ----- Original Message ---- From: Doug Cutting <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, April 7, 2009 8:33:32 PM Subject: Re: [PROPOSAL] new subproject: Avro To be clear, since a few folks have missed this point: Avro is not complete. At some point in the future, before people start using it as a format for persistent data, we'll need to stop altering its specification, or at least do so much more cautiously. But before then, my immediate goal to move development from private to open so that we have a chance to incorporate feedback before we lock down the specification. For example, several folks have raised the issue of compatibility with Thrift. We certainly want to avoid gratuitous incompatibilities. There are also features clearly missing from Avro that we expect to add before we make a release, like default values, a more efficient RPC handshake, etc. And some features that we might consider removing, if they're not broadly useful and inhibit interoperability, like single-float, which isn't in Thrift, Python, etc. And I expect there will be more such issues raised in the coming weeks and months. But before we can discuss and resolve such issues we need a forum in which to do so. That's all I am after at this point: mailing lists, a bug database, a public source code repository, etc., so that we can start accepting patches, adding committers, etc. Three days have now passed since I initially proposed this, the nominal time for an Apache vote. Is there anyone who strongly opposes taking the development of Avro public as a Hadoop subproject? Only PMC votes are binding, but I would vastly prefer that the broader community also supports this step in the process. Thanks, Doug Doug Cutting wrote: > I propose we add a new Hadoop subproject for Avro, a serialization system. My ambition is for Avro to replace both Hadoop's RPC and to be used for most Hadoop data files, e.g., by Pig, Hive, etc. > > Initial committers would be Sharad Agarwal and me, both existing Hadoop committers. We are the sole authors of this software to date. > > The code is currently at: > > http://people.apache.org/~cutting/avro.git/ > > To learn more: > > git clone http://people.apache.org/~cutting/avro.git/ avro > cat avro/README.txt > > Comments? Questions? > > Doug +
Chad Walters 2009-04-08, 07:03
-
RE: [PROPOSAL] new subproject: AvroChad Walters 2009-04-06, 07:23
Cross-posting to the Thrift dev and user lists since folks there may be interested in this. It appears that my attempts to subscribe to [EMAIL PROTECTED] from my work email were silently failing somewhere along the line -- I'll try not to take it personally. ;) Some others have experienced this too -- so if you didn't get a subscription confirmation message, then it failed. Try from a different address, I guess. You can view the thread here without being subscribed: http://mail-archives.apache.org/mod_mbox/hadoop-general/200904.mbox/browser Doug, First, let me say that I think Avro has a lot of useful features -- features that I would like to see fully supported in Thrift. At a minimum, I would like for us to be able to hash out the details to guarantee that there can really be full interoperability between Avro and Thrift. I am really interested in working cooperatively and collaboratively on this and I am willing to put in significant time on design and communication to help make full interoperability possible (I am unfortunately not able to contribute code directly at this time). Second, I think all of this decision about where Avro should live requires more thought and more discussion. I'd love to hear from more folks outside of Yahoo on this topic: so far all of the +1 votes have come from Yahoo employees. I'd also love to hear from other folks who have significant investments in both Thrift and Hadoop. Some points to think about: -- You suggest that there is not a lot in Thrift that Avro can leverage. I think you may be overlooking the fact that Thrift has a user base and a community of developers who are very interested in issues of cross-language data serialization and interoperability. Thrift has committers with expertise in a pretty big set of languages and leveraging this could get Avro's functionality onto more languages faster than the current path. Also, there is in fact significant overlap between Hadoop users and Thrift users at this point, as well as significant use of Thrift in more than one Hadoop sub-project. At the code level, Thrift contains a transport abstraction and multiple different transport and server implementations in many different target languages. If there were closer collaboration, Avro could certainly benefit from leveraging the existing ones and any additional contributions in this area would benefit both projects. -- You also suggest that the two are largely disjoint from a technical perspective: "Thrift fundamentally standardizes an API, not a data format. Avro fundamentally is a data format specification, like XML." I agree with the fundamental part but I think that doesn't bring to light enough of what is in common and what is different for purposes of this discussion. Thrift specifies a type system, an API for data formats and transport mechanisms, a schema resolution algorithm, and provides implementations of several distinct data formats and transports. Avro specifies a single data format but it also brings along several other things as well, including a type system, specific RPC mechanism and a schema resolution algorithm. The most significant issue is that both of them specify a type system. At a very minimum I would like to see Avro and Thrift make agreements on that type system. The fact that there is significant existing investment in the Thrift type system by the Thrift community should weigh somewhere in this discussion. Obviously, the technical needs of Avro will also have weight there, but where there is room for choice, the Thrift choices should be respected. Arbitrary changes here will make it unnecessarily painful, perhaps impossible, for Thrift to directly adopt Avro and instead Thrift will be forced to make an "Avro-like" data specification, hampering interoperability for everyone. There may be pitfalls in the other areas of overlap as well that would prevent real interoperability -- let's elucidate them in further discussions. -- Avro appears to have 3 primary features that Thrift does not currently support sufficiently: 1. Schema serialization allowing for compact representation of files containing large numbers of records of identical types 2. Dynamic interpretation of schemas, which improves ease-of-use in dynamic languages (like the Python Hadoop Streaming use case) 3. Lazy partial deserialization to support "projection" Note that features 1 and 3 are independent of whether schemas are dynamicly interpreted or compiled into static bindings. WRT #1: Thrift's DenseProtocol goes some distance towards this although it doesn't go the whole way. Thrift can easily be extended to further compact the DenseProtocol's wire format for special cases where all fields are required. We have had significant discussions on the Thrift list about doing more in this area previously but we couldn't get folks from Hadoop who cared most about this use case to participate with us on capturing a complete set of requirements and so there was no strong driver for it. WRT #2: I totally understand the case you make for dynamic interpretation in ad hoc data processing. I would love to see Thrift enhanced to do this kind of thing. WRT #3: Partial deserialization seems like a really useful feature for several use cases, not just for "projection". I think Thrift could and should be extended to support this functionality, and it should be available for both static bindings and dynamic schema interpretation via field names and field IDs where possible. "Perhaps Thrift could be augmented to support Avro's JSON schemas and serialization. Then it could interoperate with other Avro-based systems. But then Thrift would have yet another serialization format, that every language would need to implement for it to be useful..." First, that "Perhaps" hides a lot of complexity and unless that is hashed out ahead of time I am pretty sure the real answer will be "Thrift cannot be augmented to support Avro directly but instead could be augmen +
Chad Walters 2009-04-06, 07:23
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-06, 19:12
Chad Walters wrote:
> -- You suggest that there is not a lot in Thrift that Avro can > leverage. I think you may be overlooking the fact that Thrift has a > user base and a community of developers who are very interested in > issues of cross-language data serialization and interoperability. I meant that in terms of common code, not coders. Coders can belong to more than one community but code should generally not. Hadoop Core has become a sprawling community that we're trying to split. It's more productive to have have more, small communities than few large ones. A project needs a handful of active developers, but too many and it becomes ungainly. So, if it's technically possible for a codebase to be distinct, and it can attract enough active developers to sustain itself, that is a preferable structure. > At the code level, Thrift contains a transport abstraction and > multiple different transport and server implementations in many > different target languages. If there were closer collaboration, Avro > could certainly benefit from leveraging the existing ones and any > additional contributions in this area would benefit both projects. The transport and server implementations are indeed an area where code could potentially be shared between Avro and Thrift. Perhaps someone could start a separate project with reusable transport and server implementations to support RPC? In any case, Avro primarily specifies a binary message format, not a full transport. We hope to piggyback off other transport implementations, like HTTP servers, etc. Full transports involve authentication, authorization, encryption, etc., which are outside of the scope of Avro. > The most significant issue is that both of them specify a type > system. At a very minimum I would like to see Avro and Thrift make > agreements on that type system. This makes good sense. It would be good if these were interoperable. Thrift has byte and i16, which Avro does not currently. I'd like to add a fixed<n> primitive type to Avro, where n is the number of bytes and is specified in the schema, so that one could, e.g., define a byte as fixed<1>, i16 as a fixed<2> and md5 as a fixed<16>. Thrift has both lists and sets, Avro has just arrays, which are equivalent to lists (they're ordered). Perhaps Avro could add sets. Are they leveraged heavily in Thrift? I've not heard much call for them in Avro yet. Avro has single-float, Thrift does not. Avro could perhaps lose this. Avro distinguishes UTF-8 text strings from byte strings, while Thrift does not. I am reluctant to lose this distinction. Avro has unions and a null type, while Thrift does not. Does Thrift support recursive data structures? > Furthermore, you say that last part ("Thrift would have yet another > serialization format...") like it is a bad thing... When faced with multiple programming and scripting languages, multiple serialization formats should be discouraged, or one ends up with multiplicative compatibility problems. A single, primary data format would vastly simplify the Hadoop ecosystem. Yes, folks need to be able to easily import and export data, but expecting scripts in arbitrary languages to be able to process data in arbitrary formats seems unwise. > Note that it is > an explicit design goal of Thrift to allow for multiple different > serialization formats so that lots of different use cases can be > supported by the same fundamental framework. That's not a design goal of Avro, which intends to provide a single, well-specified, easy to implement serialization format. This is not in conflict with Thrift, it's just a different goal. > Also, doesn't Avro essentially contain "another serialization format > that every language would need to implement for it to be useful"? > Seems like the same basic set of work to me, whether it is in Avro or > Thrift. None of Thrift's existing formats solve the problems Avro seeks to. Thrift may be able to incorporate Avro's format, if it has good format generalizations, ideally using Avro's code. So there should be little duplication of effort in such an approach. I didn't say it was onerous, I said that, like in most data structure languages (e.g., programming languages), Avro permits folks to name fields with symbolic names alone. In human-authored software, symbolic naming is generally preferable to numeric naming. Is that really a matter of dispute? Optional features increase compatibility complexity and are harder to maintain and test. A Thrift IDL without numbers would not provide versioning features to non-dynamic languages. They are formally equivalent. For machines, matching numbers is easier, but people usually prefer to operate on names, and names can be automatically mapped to numbers. I looked into changing Thrift to support Avro's features, and it was very messy. Perhaps someone else could do this more easily. Building Avro as a part of Thrift would take considerably more effort for me and I think offer little more than it does separately. If you feel differently, you are free to fork Avro, start a competitor, provide patches that integrate it into Thrift, or whatever. It could be a floor wax and a dessert topping! Doug +
Doug Cutting 2009-04-06, 19:12
-
Re: [PROPOSAL] new subproject: AvroKevin Clark 2009-04-07, 00:17
Hi Doug,
On Mon, Apr 6, 2009 at 12:12 PM, Doug Cutting <[EMAIL PROTECTED]> wrote: > Chad Walters wrote: >> >> -- You suggest that there is not a lot in Thrift that Avro can >> leverage. I think you may be overlooking the fact that Thrift has a >> user base and a community of developers who are very interested in >> issues of cross-language data serialization and interoperability. > > I meant that in terms of common code, not coders. Coders can belong to more > than one community but code should generally not. Hadoop Core has become a > sprawling community that we're trying to split. It's more productive to > have have more, small communities than few large ones. A project needs a > handful of active developers, but too many and it becomes ungainly. So, if > it's technically possible for a codebase to be distinct, and it can attract > enough active developers to sustain itself, that is a preferable structure. I agree with you in general, but cross language libraries require larger communities than other projects. It's non-trivial to gather groups of coders to support each language the project chooses to include. Right now Thrift has some level of support for a dozen languages. We've been really very active in the last several months, and devs have come out of the woodwork to extend their favorite language(s) binding(s). The overhead for those people (or some equivalent group) to pay attention to another mailing list, another bug tracker, another irc channel, and another community isn't trivial. I understand that developing the code itself may be more convenient for some, but I think that the community that supports the code is what really counts. If we can share that, and still achieve our goals, I think we'll be better off. Of course, this assumes that one of the primary goals of Avro is to be cross language. Is that the case, or have I misunderstood? > Avro has unions and a null type, while Thrift does not. Does Thrift support > recursive data structures? We don't support recursive data structures. We do, however, have a ticket open where we're discussing union support (THRIFT-409). In your post you talk about the problems associated with supporting multiple serialization formats. One of the things I like about Thrift is that even though Thrift supports many different things, application developers aren't at all obligated to. In fact, I don't expect anyone does. It would be perfectly reasonable for Hadoop to specify that they use the Avro data format for transmissions, and the cross language library to provide the API could be Thrift. I think you said something similar in your post, but if not please do clarify. On the "names vs field ids" issue: I know that the Ruby and Java Thrift libraries provide name-based access to this information, and know of no restriction that would keep the others from doing the same. It's just a matter of a little code. >> Consider an alternative: making Avro more like a sub-project of >> Thrift or just implementing it directly in Thrift. > > I looked into changing Thrift to support Avro's features, and it was very > messy. Perhaps someone else could do this more easily. > > Building Avro as a part of Thrift would take considerably more effort for me > and I think offer little more than it does separately. If you feel > differently, you are free to fork Avro, start a competitor, provide patches > that integrate it into Thrift, or whatever. I'd again like to appeal to you that it's the community that's harder to develop than the code, and we've got one already. I also don't see the implementation being especially difficult, but maybe we're looking at different information. I'd be happy to talk with you about it if you're open to the idea. The goals of Avro seem to be consistent with the goals of each of Thrift's contributors who have developed a new protocol. We can already offer the things you've stated you don't want to develop, and I think we've got a lot more to gain working together than we do working separately. That being said, I'm fairly confident we'll be providing an Avro protocol on our own at some point if you're not interested in working together. But I think if we go down that path we're doing a disservice to users of both Thrift and Avro. Kevin Clark http://glu.ttono.us +
Kevin Clark 2009-04-07, 00:17
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-07, 04:15
Kevin Clark wrote:
> The overhead for those people (or some > equivalent group) to pay attention to another mailing list, another > bug tracker, another irc channel, and another community isn't trivial. Communities form around code, and, if Avro's code is largely disjoint from Thrift's, we should not assume that everyone in the Thrift community cares about Avro or vice versa. > Of course, this assumes that one of the primary goals of Avro is to be > cross language. Is that the case, or have I misunderstood? Yes, that is a goal. > It would be perfectly reasonable for Hadoop to specify that they > use the Avro data format for transmissions, and the cross language > library to provide the API could be Thrift. I think you said something > similar in your post, but if not please do clarify. Yes, perhaps this could be done. I am not convinced that TProtocol is an ideal API for reading and writing Avro data, but it could perhaps be made to work reasonably well. > That being said, I'm fairly confident we'll be providing an Avro > protocol on our own at some point if you're not interested in working > together. But I think if we go down that path we're doing a disservice > to users of both Thrift and Avro. I have never said I was not interested in working together. I've said that I think Avro is fundamentally different from Thrift. Avro is a specific format, Thrift is a generic API for various formats, none like Avro. They might be made to work together. But at this point I see no point in forcing them together. If TProtocol's API is a good match for Avro's format and features, then it should be easy for folks to implement TProtocol using Avro's code and include Avro in Thrift. If the match is not good then perhaps we can adjust Thrift and/or Avro to improve it. Doug +
Doug Cutting 2009-04-07, 04:15
-
Re: [PROPOSAL] new subproject: AvroChad Walters 2009-04-07, 08:56
Doug, > I have never said I was not interested in working together. That's great -- glad to hear that you are open to collaboration. My concern is that by making a separate (sub)project, however, it may be difficult for us to work together in practice, and in particular it may be difficult for Thrift to leverage Avro's source code. > I've said that I think Avro is fundamentally different from Thrift. > Avro is a specific format, Thrift is a generic API for various formats, none like Avro > They might be made to work together. But at this point I see no point in forcing > them together. I don't think that they are as far apart as you are making it sound with this statement. I do think, however, that it will be very difficult for them to work together properly if the goal of code reuse by Thrift is not an explicit goal of Avro. The easiest way I can come up with to guarantee this is simply to incorporate Avro's feature set into Thrift. If you have other mechanisms for doing this, I'd love to hear them. > If TProtocol's API is a good match for Avro's format and features, > then it should be easy for folks to implement TProtocol using Avro's code and > include Avro in Thrift. If the match is not good then perhaps we can adjust Thrift > and/or Avro to improve it. Absolutely. And right now, there are sufficient differences in the type system and other areas that do require some adjustments, likely some on both sides (although, as I said in my previous email, we need to account for the fact that Thrift has current users to support so backwards-compatibility will need to be a consideration). > Communities form around code, and, if Avro's code is largely disjoint > from Thrift's, we should not assume that everyone in the Thrift > community cares about Avro or vice versa. IMO communities form around shared goals and purposes. Code and designs are created to achieve those purposes; they are also malleable and can be bent to achieve new goals and purposes. If we can find common cause, then we form a common community. You have some features that you want to satisfy for Hadoop's purposes: compact serialization of large files containing many records of identical structure; partial deserialization in support of projection; dynamic interpretation of object schemas; better/more efficient RPC -- all delivered across multiple languages. The first three are also use cases that are of interest to some portion of the Thrift community and the fourth is something that Thrift already provides. Avro at this point is fairly nascent -- you have a design, some code, a couple of developers, and a target group of future users who seem very receptive to what you are working on. You do not have current users, however, and that should mean that you have some degree of flexibility to your design where it doesn't make a material difference to the use cases you are trying to solve. If you are willing to make some modifications to that design and code, the work on Avro could also work directly towards extending Thrift's functionality. I am pretty certain that the Thrift community would be willing to make some reasonable modifications and extensions to Thrift to smooth the way for this as well. I think that by working closely with the Thrift community directly in the Thrift code base, you will get several significant benefits. You will be able to directly leverage the transport and server implementations in Thrift today and any future work in this area is also beneficial. You will have a built-in set of developers and committers across many languages who are already familiar with issues in cross-language serialization (and I agree with Kevin that this is not as portable as you seem to think it is). You will be able to avoid writing lots of parts of an RPC framework in multiple languages that you would need to write to make Avro a stand-alone solution for Hadoop. You would have a significant role in shaping the direction of Thrift to make sure that it remains a strong solution for Hadoop. It is clear to me that a slightly modified version of Avro's data format should fit just fine as a Thrift TProtocol implementation. Out of the box this would, of course, only provide for statically generated bindings, but this is enough to satisfy the first of the desired features I described above. The second feature, partial deserialization, is a feature that I would like to see in Thrift for a variety of use cases, not just your projection use case -- for example, message routing where only a message header is deserialized to determine where to pass along an otherwise uninterpreted block of data. This feature is not tightly coupled to the Avro data format in any way. As you have stated, this is possible to do when you have the schema in hand. Note that he static bindings in Thrift are another way that the schema can be transmitted -- in fact, the whole schema could just be retrievable from the bindings directly and fed into whatever mechanism is availabe for dynamic interpretation. But we wouldn't have to go so far as that for field look up by name -- as Kevin pointed out, the Java and Ruby Thrift libraries already have mechanisms for sufficient introspection to accomplish the right kind of lookups, I believe, and the other libraries could be extended to do the same quite easily. So partial deserialization can be supported via either dynamic interpretation and/or via introspection features of the static bindings. To support the second use case, dynamic schema interpretation, there is definitely significant new code to be written. Note that this code is essentially the same code wherever you are writing it. Whatever work you are doing in Avro to be able to dynamically interpret JSON IDL could just be directly implemented in Thrift -- we would just define a JSON version of the Thrift IDL which would look a lot like Avro's IDL. To help further with interoperability we could make the Thrift compiler generate the JSON IDL fro +
Chad Walters 2009-04-07, 08:56
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-07, 16:16
Chad Walters wrote:
> I do think, however, that it will be very > difficult for them to work together properly if the goal of code > reuse by Thrift is not an explicit goal of Avro. Code reuse is an explicit goal of Avro. It's an open source project with public APIs intended to expose all of its functionality. > I think that by working closely with the Thrift community directly in > the Thrift code base, you will get several significant benefits. It's not like I did not consider this approach, evolving Thrift to better support my needs. In fact, I considered it for months before abandoning it. I am very familiar with these arguments. I am starting a new serialization project fully aware of the hazards. I feel that, on balance, it is considerably simpler for Avro to be developed separately and that this will not adversely affect its users or its developer community. You may disagree. As volunteers here, we are both free to do as we choose. > To support the second use case, dynamic schema interpretation, there > is definitely significant new code to be written. Note that this code > is essentially the same code wherever you are writing it. This is a primary case for Avro. Without it, Avro's a non-starter. And, as you note, this is new code that must be written for each platform. That's primarily what Avro is. Fitting this code into Thrift would only make it more complicated. > Whatever > work you are doing in Avro to be able to dynamically interpret JSON > IDL could just be directly implemented in Thrift -- we would just > define a JSON version of the Thrift IDL which would look a lot like > Avro's IDL. To help further with interoperability we could make the > Thrift compiler generate the JSON IDL from the Thrift IDL as another > output target. Sure, we could bolt Avro's features onto the side of Thrift, but that doesn't make it easier for me to deliver Avro's features nor any easier for folks to use them. And Thrift doesn't need a second IDL format. It already suffers from too many formats. I seek a single format, not a multitude. > The basic upshot of the above is that it is not that hard to see how > Avro could be directly integrated into Thrift if you were willing to > entertain that option and I believe that you would see significant > benefits that would more than offset the impact to your own ease of > development about which you expressed concerns. I am unlikely to implement it myself, as it does not address my needs. > I am proposing that the IDL would > only allow for field IDs to be omitted in the case where the schema > was being interpreted dynamically -- no static bindings could be > generated from IDL without fully specified field IDs. So if you are > only interested in dynamic interpretation, you never have to look at > or even think about field IDs. Does that in any way alter your stance > here? Not really. It adds an "except on Tuesday" clause in the specification, which is not ideal. In Avro we can generate static bindings without using field ids. >> It could be a floor wax and a dessert topping! > > Love the SNL reference, but I don't think it is really appropos. My > vision for Thrft with Avro's features folded in as a unified > framework for cross-language serialization, covering a variety of use > cases, is not jamming two completely heterogeneous things together. I > can easily see wanting to take structures represented in one > serialization format from disk and send them out over RPC. Thrift > provides the means to do this kind of thing seemlessly, with formats > appropriate to both use cases, rather than selecting a format that is > good for one use case and so-so for the other. I believe that the cost of supporting multiple formats is too high. We differ on that point. I don't think one-stop-shopping is appropriate here, but prefer to provide an ala-carte format. Doug +
Doug Cutting 2009-04-07, 16:16
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-06, 20:08
Chad Walters wrote:
> so far all of the +1 votes have come from Yahoo employees. Not quite. Jeff Hammerbacher is with Cloudera. Sameer Paranjpye is with Mechanical Zoo. Dhruba Borthakur is with Facebook. Doug +
Doug Cutting 2009-04-06, 20:08
-
Re: [PROPOSAL] new subproject: AvroGeorge Porter 2009-04-06, 20:25
One thing that you might consider is adding the ability to add path
state to the serialization to better support tracing and instrumentation of the RPC layer with Avro. I did something like this with Thrift, where I simply added a "hidden" parameter with a well known, but unused parameter. If the receiver understood the tracing format, it could pull metadata from that parameter, and if not, than it would just ignore it. Based on your description of the "projection" feature, it seems like that would apply here as well. Thanks, -George On Apr 6, 2009, at 1:08 PM, Doug Cutting wrote: > Chad Walters wrote: >> so far all of the +1 votes have come from Yahoo employees. > > Not quite. Jeff Hammerbacher is with Cloudera. Sameer Paranjpye is > with Mechanical Zoo. Dhruba Borthakur is with Facebook. > > Doug > +
George Porter 2009-04-06, 20:25
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-06, 20:48
George Porter wrote:
> One thing that you might consider is adding the ability to add path > state to the serialization to better support tracing and instrumentation > of the RPC layer with Avro. Such cross-system tracing would be great, but I think this might best be done as part of request and response metadata, rather than made part of the user's request and response data. You and I worked this nearly through for Hadoop's existing RPC: http://issues.apache.org/jira/browse/HADOOP-4049 Unfortunately, that patch stalled awaiting benchmarks, and now is stale. Avro currently specifies a request and response payload format, but does not say much about metadata. In an HTTP-based transport, tracing might be done through headers. As we adapt Hadoop's optimized RPC transport to support Avro messages we should probably insert a generic metadata layer, analogous to HTTP's headers, to support features such as this. Doug +
Doug Cutting 2009-04-06, 20:48
-
Re: [PROPOSAL] new subproject: AvroGeorge Porter 2009-04-06, 21:02
>
> Unfortunately, that patch stalled awaiting benchmarks, and now is > stale. Avro currently specifies a request and response payload > format, but does not say much about metadata. In an HTTP-based > transport, tracing might be done through headers. As we adapt > Hadoop's optimized RPC transport to support Avro messages we should > probably insert a generic metadata layer, analogous to HTTP's > headers, to support features such as this. > > Doug Doug, This would be great. While it is always possible to add extension data to the payload itself, having such support in the transport itself can be quite useful, in my opinion. Especially if there are a variety of different payloads, you don't have to extend each one. -George +
George Porter 2009-04-06, 21:02
-
Re: [PROPOSAL] new subproject: AvroChad Walters 2009-04-06, 21:05
Not to get too technical on you, but: -- I don't see any message from Jeff on this topic -- Sameer has either just left or is still in the process of leaving Yahoo -- Dhruba's message was posted after my message was composed and while I was still having trouble getting my email to be recognized by the subscription system Chad ----- Original Message ---- From: Doug Cutting <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] Sent: Monday, April 6, 2009 1:08:52 PM Subject: Re: [PROPOSAL] new subproject: Avro Chad Walters wrote: > so far all of the +1 votes have come from Yahoo employees. Not quite. Jeff Hammerbacher is with Cloudera. Sameer Paranjpye is with Mechanical Zoo. Dhruba Borthakur is with Facebook. Doug +
Chad Walters 2009-04-06, 21:05
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-06, 22:17
Chad Walters wrote:
> Not to get too technical on you, but: -- I don't see any message from > Jeff on this topic Sorry, you're right. I inadvertently counted a private response. I'll let Jeff speak in public for himself from now on! Doug +
Doug Cutting 2009-04-06, 22:17
-
RE: [PROPOSAL] new subproject: AvroChad Walters 2009-04-06, 06:51
Doug,
First, let me say that I think Avro has a lot of useful features -- features that I would like to see fully supported in Thrift. At a minimum, I would like for us to be able to hash out the details to guarantee that there can really be full interoperability between Avro and Thrift. I am really interested in working cooperatively and collaboratively on this and I am willing to put in significant time on design and communication to help make full interoperability possible (I am unfortunately not able to contribute code directly at this time). Second, I think all of this decision about where Avro should live requires more thought and more discussion. I'd love to hear from more folks outside of Yahoo on this topic: so far all of the +1 votes have come from Yahoo employees. I'd also love to hear from other folks who have significant investments in both Thrift and Hadoop. Some points to think about: -- You suggest that there is not a lot in Thrift that Avro can leverage. I think you may be overlooking the fact that Thrift has a user base and a community of developers who are very interested in issues of cross-language data serialization and interoperability. Thrift has committers with expertise in a pretty big set of languages and leveraging this could get Avro's functionality onto more languages faster than the current path. Also, there is in fact significant overlap between Hadoop users and Thrift users at this point, as well as significant use of Thrift in more than one Hadoop sub-project. At the code level, Thrift contains a transport abstraction and multiple different transport and server implementations in many different target languages. If there were closer collaboration, Avro could certainly benefit from leveraging the existing ones and any additional contributions in this area would benefit both projects. -- You also suggest that the two are largely disjoint from a technical perspective: "Thrift fundamentally standardizes an API, not a data format. Avro fundamentally is a data format specification, like XML." I agree with the fundamental part but I think that doesn't bring to light enough of what is in common and what is different for purposes of this discussion. Thrift specifies a type system, an API for data formats and transport mechanisms, a schema resolution algorithm, and provides implementations of several distinct data formats and transports. Avro specifies a single data format but it also brings along several other things as well, including a type system, specific RPC mechanism and a schema resolution algorithm. The most significant issue is that both of them specify a type system. At a very minimum I would like to see Avro and Thrift make agreements on that type system. The fact that there is significant existing investment in the Thrift type system by the Thrift community should weigh somewhere in this discussion. Obviously, the technical needs of Avro will also have weight there, but where there is room for choice, the Thrift choices should be respected. Arbitrary changes here will make it unnecessarily painful, perhaps impossible, for Thrift to directly adopt Avro and instead Thrift will be forced to make an "Avro-like" data specification, hampering interoperability for everyone. There may be pitfalls in the other areas of overlap as well that would prevent real interoperability -- let's elucidate them in further discussions. -- Avro appears to have 3 primary features that Thrift does not currently support sufficiently: 1. Schema serialization allowing for compact representation of files containing large numbers of records of identical types 2. Dynamic interpretation of schemas, which improves ease-of-use in dynamic languages (like the Python Hadoop Streaming use case) 3. Lazy partial deserialization to support "projection" Note that features 1 and 3 are independent of whether schemas are dynamicly interpreted or compiled into static bindings. WRT #1: Thrift's DenseProtocol goes some distance towards this although it doesn't go the whole way. Thrift can easily be extended to further compact the DenseProtocol's wire format for special cases where all fields are required. We have had significant discussions on the Thrift list about doing more in this area previously but we couldn't get folks from Hadoop who cared most about this use case to participate with us on capturing a complete set of requirements and so there was no strong driver for it. WRT #2: I totally understand the case you make for dynamic interpretation in ad hoc data processing. I would love to see Thrift enhanced to do this kind of thing. WRT #3: Partial deserialization seems like a really useful feature for several use cases, not just for "projection". I think Thrift could and should be extended to support this functionality, and it should be available for both static bindings and dynamic schema interpretation via field names and field IDs where possible. "Perhaps Thrift could be augmented to support Avro's JSON schemas and serialization. Then it could interoperate with other Avro-based systems. But then Thrift would have yet another serialization format, that every language would need to implement for it to be useful..." First, that "Perhaps" hides a lot of complexity and unless that is hashed out ahead of time I am pretty sure the real answer will be "Thrift cannot be augmented to support Avro directly but instead could be augmented to support something that looks quite a bit like Avro but differs in mostly unimportant ways." To me that seems like a shame. Furthermore, you say that last part ("Thrift would have yet another serialization format...") like it is a bad thing... Note that it is an explicit design goal of Thrift to allow for multiple different serialization formats so that lots of different use cases can be supported by the same fundamental framework. This is a clear recognition that there is no one-size-fits-all answer for data serialization (fast RPC vs compact arc +
Chad Walters 2009-04-06, 06:51
-
Re: [PROPOSAL] new subproject: AvroAnkur Goel 2009-04-13, 11:27
How fast do we expect the new serialization system to be when it replaces existing serialization mechanism in Hadoop RPC?
A clear description of the existing bottlenecks and the performance goals for this system would help developers interested in contributing. -Ankur -------- Original Message -------- Subject: [PROPOSAL] new subproject: Avro Date: Thu, 02 Apr 2009 15:05:08 -0700 From: Doug Cutting <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] I propose we add a new Hadoop subproject for Avro, a serialization system. My ambition is for Avro to replace both Hadoop's RPC and to be used for most Hadoop data files, e.g., by Pig, Hive, etc. Initial committers would be Sharad Agarwal and me, both existing Hadoop committers. We are the sole authors of this software to date. The code is currently at: http://people.apache.org/~cutting/avro.git/ To learn more: git clone http://people.apache.org/~cutting/avro.git/ avro cat avro/README.txt Comments? Questions? Doug +
Ankur Goel 2009-04-13, 11:27
-
Re: [PROPOSAL] new subproject: AvroDoug Cutting 2009-04-13, 17:54
Ankur Goel wrote:
> How fast do we expect the new serialization system to be when it > replaces existing serialization mechanism in Hadoop RPC? I hope that Avro will make its first release this summer. Sometime soon after, I hope that we can start moving Hadoop Core's trunk RPC onto Avro. We may start developing an experimental version of Hadoop Core that uses Avro in a branch before Avro is released. This is all speculative, of course. Any detailed discussion of Hadoop Core's future belongs on the core-dev@ and of Avro's future on avro-dev@. > A clear description of the existing bottlenecks and the performance > goals for this system would help developers interested in > contributing. Adding Avro to Hadoop Core is not primarily about performance but rather about compatibility and security. Hadoop's existing RPC is not a performance bottleneck, nor is HDFS's data transfer protocol. However, currently, Hadoop requires that clients and servers must run the exact same version of code, since the existing RPC is not tolerant of protocol changes. We'd like to change that, so that one can run older clients against newer servers and vice versa. Longer term, we'd also like to permit clients in languages other than Java. We intend Avro to provide a change-tolerant, cross-platform RPC solution. We'd also like Hadoop to become more secure. Currently Hadoop uses three different communications mechanisms: RPC, HTTP (for shuffle) and a raw socket-based protocol for HDFS data transfers. It would be best not to have to re-implement security features for each of these. So we hope that we can make Avro perform well enough to replace not only Hadoop's RPC, but also HTTP in the shuffle and the HDFS data transfer protocol. If you're interested in discussing Avro further, I encourage you to join the Avro mailing lists. http://hadoop.apache.org/avro/mailing_lists.html Doug +
Doug Cutting 2009-04-13, 17:54
|