Flume, mail # dev - Transforming 1 event to n events

Re: Transforming 1 event to n events
Mike Percy 2012-08-13, 18:55
Hi Jeremy,

On Mon, Aug 13, 2012 at 9:55 AM, Jeremy Custenborder <
> > I believe you are just
> > trying to work around a limitation of the exec source, since it appears
> > you're describing a serialization issue."
> > Alternatively, one could use an HBase serializer to generate multiple
> > increment / decrement operations, and just log the original line in HDFS
> > (or use an EventSerializer).
> The is what I'm working towards. I want a 1 for 1 entry in hdfs but
> increment counters in hbase

HBase serializer can generate multiple operations per Event, and the HDFS
serializer could generate whatever output Hive expects as well.
> Given this I was just planning on emitting an event with the body I
> was going to use in hive early in the pipeline. Send the same data to
> hdfs and hbase. Then use a serializer on the hbase side to increment
> the counters. This would allow me to add data to hdfs in the format
> I'm planning on consuming it with without managing two serializers. My
> plans for the hbase serializer was literally generate key, increment
> per record based on the input. So only a couple lines of code.

Yeah, if you are doing much parsing in your serializers it's going to be a
bit more complex.

 > I pondered this a bit over the last day or so and I'm kind of lukewarm on
> > adding preconditions checks at this time. The reason I didn't do it
> > initially is that while I wanted a particular contract for that
> component,
> > in order to make Interceptors viable to maintain and understand with the
> > current design of the Flume core, I wasn't sure if it would be sufficient
> > for all future use cases. So if someone wants to do something that breaks
> > that contract, then they are "on their own", doing stuff that may break
> in
> > future implementations. If they're willing to accept that risk then they
> > have the freedom to maybe do something novel and awesome, which might
> > prompt us to add a different kind of extension mechanism in the future to
> > support whatever that use case is.
> I think there should be an approved method for this case. A different
> extension that could perform processing like this could be helpful. To
> me when I looked at an interceptor I thought of using it as a
> replacement for a decorator in the old version of flume. We have a lot
> of code that will take a log entry and replace the body with a
> protocol buffer representation. I prefer to run this code on an
> upstream tier from the web server. Interceptors would work fine for
> the one in one out case.

Have you considered using an Interceptor or a custom source to generate a
single event that has a series of timestamps within it? You could use
protobufs for serialization of that data structure.

Since you have multiple timestamps / timings on the same log line, I wonder
if it isn't a single "event" with multiple facets and this isn't just a
semantics thing.