FWIW: I would go with Kafka if you can, it's far more flexible; we aren't
using it until it authenticates producers and consumers and provides a way
to encrypt transport - we run in the cloud...
Anyway, so we're using Flume. For Flume, with the current out of the box
implementation, they encapsulate data in an Avro event themselves.
So it's up to you what you stick into the body of that Avro event. It could
just be json, or it could be your own serialized Avro event - and as far as
I understand serialized Avro always has the schema with it (right?).
Be aware that Flume doesn't have great support for languages outside of the
JVM. Flume's Avro source that you can communicate with via Avro RPC uses
NettyServer/NettyTransceiver underneath, and as far as I know, there's been
no updates to other Avro RPC libraries e.g. Python, Ruby that enable
talking to such an Avro RPC endpoint. So you either have to build a client
that speaks that, or create your own source.
On Mon, May 27, 2013 at 11:08 AM, Russell Jurney
> Whats more, there are examples and support for Kafka, but not so much for
> On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
>> I don't have experience with Flume, so I can't comment on that. At
>> LinkedIn we ship logs around by sending Avro-encoded messages to Kafka (
>> http://kafka.apache.org/). Kafka is nice, it scales very well and gives
>> a great deal of flexibility — logs can be consumed by any number of
>> independent consumers, consumers can catch up on a backlog if they're
>> disconnected for a while, and it comes with Hadoop import out of the box.
>> (RabbitMQ is more designed or use cases where each message corresponds to
>> a task that needs to be performed by a worker. IMHO Kafka is a better fit
>> for logs, which are more stream-like.)
>> With any message broker, you'll need to somehow tag each message with the
>> schema that was used to encode it. You could include the full schema with
>> every message, but unless you have very large messages, that would be a
>> huge overhead. Better to give each version of your schema a sequential
>> version number, or hash the schema, and include the version number/hash in
>> each message. You can then keep a repository of schemas for resolving those
>> version numbers or hashes – simply in files that you distribute to all
>> producers/consumers, or in a simple REST service like
>> Hope that helps,
>> On 26 May 2013 17:39, Mark <[EMAIL PROTECTED]> wrote:
>>> Yes our central server would be Hadoop.
>>> Exactly how would this work with flume? Would I write Avro to a file
>>> source which flume would then ship over to one of our collectors or is
>>> there a better/native way? Would I have to include the schema in each
>>> event? FYI we would be doing this primarily from a rails application.
>>> Does anyone ever use Avro with a message bus like RabbitMQ?
>>> On May 23, 2013, at 9:16 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:
>>> Yep. Avro would be great at that (provided your central consumer is Avro
>>> friendly, like a Hadoop system). Make sure that all of your schemas have
>>> default values defined for fields so that schema evolution will be easier
>>> in the future.
>>> On Thu, May 23, 2013 at 4:29 PM, Mark <[EMAIL PROTECTED]> wrote:
>>>> We're thinking about generating logs and events with Avro and shipping
>>>> them to a central collector service via Flume. Is this a valid use case?
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.