|
|
Douglas Creager 2010-10-14, 01:45
Quick question about the C library. It seems like a lot of the code implements some scaffolding that we could get for free from a library like glib. Looking through the svn history, it looks like you've already taken out some dependencies on apr and apr-util, so I'm guessing you're trying to limit external dependencies, right? Is that just a licensing question? Or is it to simplify the build process?
cheers –doug
+
Douglas Creager 2010-10-14, 01:45
Bruce Mitchener 2010-10-14, 02:01
Hi Doug,
I didn't write the existing C library, but I've used it and done some work on it. I'm currently writing my own more minimal and more streamlined implementation of Avro in C ...
The issues with glib specifically would be: - The license is not acceptable for use here. (LGPL) - It is much bigger than what is needed here. - Many of the things that make it more general would also make it slower than necessary. The existing C code isn't a speed demon either, but the C implementation should aim for solid performance. - Bruce
On Thu, Oct 14, 2010 at 8:45 AM, Douglas Creager <[EMAIL PROTECTED]>wrote:
> Quick question about the C library. It seems like a lot of the code > implements some scaffolding that we could get for free from a library > like glib. Looking through the svn history, it looks like you've > already taken out some dependencies on apr and apr-util, so I'm guessing > you're trying to limit external dependencies, right? Is that just a > licensing question? Or is it to simplify the build process? > > cheers > –doug > >
+
Bruce Mitchener 2010-10-14, 02:01
Douglas Creager 2010-10-14, 02:43
> I didn't write the existing C library, but I've used it and done some work > on it. I'm currently writing my own more minimal and more streamlined > implementation of Avro in C ... > > The issues with glib specifically would be: > > > - The license is not acceptable for use here. (LGPL) > - It is much bigger than what is needed here. > - Many of the things that make it more general would also make it slower > than necessary. The existing C code isn't a speed demon either, but the C > implementation should aim for solid performance.
Ha! Well you're certainly right that glib's not small. Are you sure about the speed claims, though? Would it be worth banging out a LGPL, glib-based prototype to do some initial tests?
Along those lines, you mention a new C implementation you're working on. Is that something that you plan to fold back into the main libavro? Or will it be separate? The spec provides a good basis for defining how well different implementations interoperate, but so far it seems like everything has been folded into the single, Apache-sponsored project. Is there interest in having independent implementations?
–doug
+
Douglas Creager 2010-10-14, 02:43
Bruce Mitchener 2010-10-14, 02:56
On Thu, Oct 14, 2010 at 9:43 AM, Douglas Creager <[EMAIL PROTECTED]>wrote:
> > I didn't write the existing C library, but I've used it and done some > work > > on it. I'm currently writing my own more minimal and more streamlined > > implementation of Avro in C ... > > > > The issues with glib specifically would be: > > > > > > - The license is not acceptable for use here. (LGPL) > > - It is much bigger than what is needed here. > > - Many of the things that make it more general would also make it > slower > > than necessary. The existing C code isn't a speed demon either, but > the C > > implementation should aim for solid performance. > > Ha! Well you're certainly right that glib's not small. Are you sure > about the speed claims, though? Would it be worth banging out a LGPL, > glib-based prototype to do some initial tests? >
Not to me. :) I'm assuming that you mean something that uses GValue and so on?
I don't want the overhead of that sort of thing at all in my C code. I'm supporting resource constrained platforms, so I just want to go from my C struct straight to a buffer without building an intermediate data structure.
Along those lines, you mention a new C implementation you're working on. > Is that something that you plan to fold back into the main libavro? Or > will it be separate? The spec provides a good basis for defining how > well different implementations interoperate, but so far it seems like > everything has been folded into the single, Apache-sponsored project. > Is there interest in having independent implementations? >
I'm not sure what will happen with my implementation yet. I'm inclined to say that it'll be opened up (Apache 2 licensed) but it will depend on the quality of the code with respect to use outside of my product and other factors. More likely is that what I'm doing is going to serve as a test bed for ideas and an implementation approach that can be merged into the Apache Avro C implementation in the future.
As part of my own implementation of Avro in C, I'm also working on a binary RPC protocol for talking with Cloudera Flume, so I have a bit more motivation to get it opened up ...
- Bruce
+
Bruce Mitchener 2010-10-14, 02:56
Douglas Creager 2010-10-14, 03:49
> Not to me. :) I'm assuming that you mean something that uses GValue and so > on?
Ah, whoops. No, I'm not suggesting GValue. *shudder*
I was thinking more like using:
• GObject for the schema/datum subclassing • GHashTable or GTree to store a record schema's fields, etc. • GIO for the generic I/O interfaces • GQuark instead of the atom implementation that was checked in and then reverted
> I don't want the overhead of that sort of thing at all in my C code. I'm > supporting resource constrained platforms, so I just want to go from my C > struct straight to a buffer without building an intermediate data structure.
We're in violent agreement. One thing I've started experimenting with is a “streaming” API, so that instead of creating a tree of avro_datum_t instances, the file reader calls a series of callback functions as each bit of data is encountered. We're generating Avro files from an existing C network sensor application, and it's a bit of overhead (in both code and speed) to have to move between our actual data types and the avro_datum_t instances.
–doug
+
Douglas Creager 2010-10-14, 03:49
Bruce Mitchener 2010-10-14, 03:58
On Thu, Oct 14, 2010 at 10:49 AM, Douglas Creager <[EMAIL PROTECTED]>wrote:
> > Not to me. :) I'm assuming that you mean something that uses GValue and > so > > on? > > Ah, whoops. No, I'm not suggesting GValue. *shudder* >
*whew* > I was thinking more like using: > > • GObject for the schema/datum subclassing > • GHashTable or GTree to store a record schema's fields, etc. > • GIO for the generic I/O interfaces > • GQuark instead of the atom implementation that was checked in and > then reverted Okay, I see ... but that can't happen within the Apache implementation due to licensing issues. (It also doesn't work for my usages because it isn't clear that LGPL code can be shipped at all legally on some of my target platforms.) > > I don't want the overhead of that sort of thing at all in my C code. I'm > > supporting resource constrained platforms, so I just want to go from my C > > struct straight to a buffer without building an intermediate data > structure. > > We're in violent agreement. One thing I've started experimenting with > is a “streaming” API, so that instead of creating a tree of avro_datum_t > instances, the file reader calls a series of callback functions as each > bit of data is encountered. We're generating Avro files from an > existing C network sensor application, and it's a bit of overhead (in > both code and speed) to have to move between our actual data types and > the avro_datum_t instances. >
Okay, then we're talking about similar things. But you can also just generate code and then you don't need schemas or anything else at runtime, no?
What I'm doing is just a low level API that I can use from generated code. I don't need (or want) schemas or anything else in the way.
Maybe we should talk more off-list.
- Bruce
+
Bruce Mitchener 2010-10-14, 03:58
Matt Massie 2010-10-14, 15:59
Please continue this discussion on the list since that's what it's for. I think it would be great if we could as support for generated code to avro-c. I've been itching lately to do some C programming. Cloudera is having a Hackathon in about a week so maybe I could dedicate some cycles then to help.
-- Matt
On Wed, Oct 13, 2010 at 8:58 PM, Bruce Mitchener <[EMAIL PROTECTED]>wrote:
> On Thu, Oct 14, 2010 at 10:49 AM, Douglas Creager <[EMAIL PROTECTED] > >wrote: > > > > Not to me. :) I'm assuming that you mean something that uses GValue > and > > so > > > on? > > > > Ah, whoops. No, I'm not suggesting GValue. *shudder* > > > > *whew* > > > > I was thinking more like using: > > > > • GObject for the schema/datum subclassing > > • GHashTable or GTree to store a record schema's fields, etc. > > • GIO for the generic I/O interfaces > > • GQuark instead of the atom implementation that was checked in and > > then reverted > > > Okay, I see ... but that can't happen within the Apache implementation due > to licensing issues. (It also doesn't work for my usages because it isn't > clear that LGPL code can be shipped at all legally on some of my target > platforms.) > > > > > I don't want the overhead of that sort of thing at all in my C code. > I'm > > > supporting resource constrained platforms, so I just want to go from my > C > > > struct straight to a buffer without building an intermediate data > > structure. > > > > We're in violent agreement. One thing I've started experimenting with > > is a “streaming” API, so that instead of creating a tree of avro_datum_t > > instances, the file reader calls a series of callback functions as each > > bit of data is encountered. We're generating Avro files from an > > existing C network sensor application, and it's a bit of overhead (in > > both code and speed) to have to move between our actual data types and > > the avro_datum_t instances. > > > > Okay, then we're talking about similar things. But you can also just > generate code and then you don't need schemas or anything else at runtime, > no? > > What I'm doing is just a low level API that I can use from generated code. > I > don't need (or want) schemas or anything else in the way. > > Maybe we should talk more off-list. > > - Bruce >
+
Matt Massie 2010-10-14, 15:59
Douglas Creager 2010-10-15, 22:02
> Please continue this discussion on the list since that's what it's > for.
Will do
> I think it would be great if we could as support for generated code > to avro-c. I've been itching lately to do some C programming. > Cloudera is having a Hackathon in about a week so maybe I could > dedicate some cycles then to help.
Generated code certainly sounds useful, but I don't know if it will help my particular problem. In my case, I'm adding Avro support to an existing application, which already has quite a few custom C structs that it's aggregating data into. With the current implementation, I have to copy this data into a tree of avro_datum_t instances before writing the data out to an Avro file. Codegen would probably make that a bit easier, but there would still be a set of (now automatically generated) Avro-specific structs that I'd have to copy into. What I'm looking for / working on is a different approach, where I provide a set of callbacks that tell the Avro file writer how to extract the correct values directly out of my pre-existing, non-Avro-specific struct. My hope is that this will be (a) just as easy to code, and (b) faster, especially when multiplied by tens of millions of rows.
–doug
+
Douglas Creager 2010-10-15, 22:02
|
|