Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> Transforming 1 event to n events

Copy link to this message
Re: Transforming 1 event to n events
I put that comment there for a few reasons that I can recall off the top of
my head (I should have done a better job documenting this when I was
writing the code):

1. The max transaction size on the channel must currently be manually
balanced with (or made to exceed) the batchSize setting on batching sources
and sinks. If the number of events added or taken in a single transaction
exceeds this maximum size, an exception will be thrown. However, if
generating multiple events from a single event, it is no longer sufficient
to make the batchSize less or equal to this value, and it would be easier
to blow out your transaction size in a potentially unpredictable way,
causing potentially confusing errors.

2. An Event is what you might call the basic unit of "flow" in Flume. From
the perspective of management and monitoring, having the same number of
events enter and exit the system helps you know that your cluster is
healthy. OTOH, when you generate a variable number of events from a single
event in an Interceptor, it is really quite difficult to know how the data
is flowing.

3. Since the interceptor typically runs in an I/O worker thread or in the
only thread in a Source, doing any significant computation there will
likely affect the overall throughput of the system.

In my view, Interceptors as a generally applicable component are well
suited to do header "tagging", simple transformations, and filtering, but
they're not a good place to put batching/un-batching logic. Maybe the Exec
Source should have a line-parsing plugin interface to allow people to take
text lines and generate Events from them. I know this seems similar to the
Interceptor in the context of the data flow, but I believe you are just
trying to work around a limitation of the exec source, since it appears
you're describing a serialization issue.

Alternatively, one could use an HBase serializer to generate multiple
increment / decrement operations, and just log the original line in HDFS
(or use an EventSerializer).


On Fri, Aug 10, 2012 at 5:15 PM, Patrick Wendell <[EMAIL PROTECTED]> wrote:

> to clarify - I mean I think it's within the scope of the design
> intentions. I agree that it is currently disallowed (at least in
> documentation).
> On Fri, Aug 10, 2012 at 5:14 PM, Patrick Wendell <[EMAIL PROTECTED]>
> wrote:
> > Hey Jeremy,
> >
> > That comment has been in the code now for some time, but I don't think
> > it is actually enforced anywhere programatically. I think the idea was
> > just that if you are writing something which is capable of generating
> > new event data it should be in a source - though I'm also curious to
> > hear why this was put in there.
> >
> > IMHO, doing some type of event splitting seems within the scope of how
> > interceptors are used.
> >
> > - Patrick
> >
> > On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder
> > <[EMAIL PROTECTED]> wrote:
> >> Hello All,
> >>
> >> I'm wondering if you could provide some guidance for me. One of the
> >> inputs I'm working with batches several entries to a single event.
> >> This is a lot simpler than my data but it provides an easy example.
> >> For example:
> >>
> >> timestamp - 5,4,3,2,1
> >> timestamp - 9,7,5,5,6
> >>
> >> If I tail the file this results in 2 events being generated. This
> >> example has the data for 10 events.
> >>
> >> Here is high level what I want to accomplish.
> >> (web server - agent 1)
> >> exec source tail -f /<some file path>
> >> collector-client to (agent 2)
> >>
> >> (collector - agent 2)
> >> collector-server
> >> Custom Interceptor (input 1 event, output n events)
> >> Multiplex to
> >> hdfs
> >> hbase
> >>
> >> An interceptor looked like the most logical spot for me to add this.
> >> Is there a better place to add this functionality? Has anyone run into
> >> a similar case?
> >>
> >> Looking at the docs for Interceptor. intercept(List<Event> events) it
> >> says "Output list of events. The size of output list MUST NOT BE
> >> GREATER than the size of the input list (i.e. transformation and