Flume is actually working on what you might call "first class" Avro support
right now, but today you can use it and there are people doing so in
production with success.
First of all, I assume that you want to store binary-encoded avro in each
event. As mentioned previously in this thread, this implies that the schema
needs to come from somewhere. Right now, with the released version of Flume
(1.3.1) you would want to write your own EventSerializer <
http://flume.apache.org/FlumeUserGuide.html#event-serializers> for each
schema you need to write to HDFS. There is a base class <
can subclass that makes it easier to serialize Avro at that level.
There is a bunch of new development underway to make this a lot easier to
1. Something to parse Avro container files and send them to Flume:
2. A generic event serializer that keys off a hash in the event header to
determine the schema: https://issues.apache.org/jira/browse/FLUME-2010
Regarding Ruby support, we recently added support for Thrift RPC, so you
can now send messages to Flume via Ruby and other non-JVM languages. We
don't have out-of-the-box client APIs for those yet but would be happy to
accept patches for it :)
Feel free to reach out to [EMAIL PROTECTED] or [EMAIL PROTECTED] if
you'd like more information or want to help get these features finalized
On Tue, May 28, 2013 at 3:38 PM, Mark <[EMAIL PROTECTED]> wrote:
> Thanks for all of the information.
> I actually looked into Kafka quite some time ago and I think we passed on
> it because it didn't have much ruby support (That may have changed by now).
> On May 27, 2013, at 12:34 PM, Martin Kleppmann <[EMAIL PROTECTED]>
> On 27 May 2013 20:00, Stefan Krawczyk <[EMAIL PROTECTED]> wrote:
>> So it's up to you what you stick into the body of that Avro event. It
>> could just be json, or it could be your own serialized Avro event - and as
>> far as I understand serialized Avro always has the schema with it (right?).
> In an Avro data file, yes, because you just need to specify the schema
> once, followed by (say) a million records that all use the same schema. And
> in an RPC context, you can negotiate the schema once per connection. But
> when using a message broker, you're serializing individual records and
> don't have an end-to-end connection with the consumer, so you'd need to
> include the schema with every single message.
> It probably doesn't make sense to include the full schema with every one,
> as a typical schema might be 2 kB whereas a serialized record might be less
> than 100 bytes (numbers obviously vary wildly by application), so the
> schema size would dominate. Hence my suggestion of including a schema
> version number or hash with every message.
> Be aware that Flume doesn't have great support for languages outside of
>> the JVM.
> The same caveat unfortunately applies with Kafka too. There are clients
> for non-JVM languages, but they lack important features, so I would
> recommend using the official JVM client (if your application is non-JVM you
> could simply pipe your application's stdout into the Kafka producer, or
> vice versa on the consumer side).