Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> Transforming 1 event to n events


Copy link to this message
-
Re: Transforming 1 event to n events
Hey Patrick,
I pondered this a bit over the last day or so and I'm kind of lukewarm on
adding preconditions checks at this time. The reason I didn't do it
initially is that while I wanted a particular contract for that component,
in order to make Interceptors viable to maintain and understand with the
current design of the Flume core, I wasn't sure if it would be sufficient
for all future use cases. So if someone wants to do something that breaks
that contract, then they are "on their own", doing stuff that may break in
future implementations. If they're willing to accept that risk then they
have the freedom to maybe do something novel and awesome, which might
prompt us to add a different kind of extension mechanism in the future to
support whatever that use case is.

Regards,
Mike

On Fri, Aug 10, 2012 at 10:22 PM, Patrick Wendell <[EMAIL PROTECTED]>wrote:

> Hey Mike,
>
> That context is super helpful. If it is a correctness problem to have
> interceptors returning more events than they receive, can I propose
> that we:
>
> a) Add a check in InterceptorChain that verifies the interceptor isn't
> growing the size of events (better to throw an error here than
> somewhere down the line) which will be harder to debug.
>
> b) Explain in the javadoc briefly why it is a correctness issue.
>
> c) Put a note of caution in the user or dev guide for those who want
> to build custom interceptors, explaining that they are solely for
> transformation and filtering, not event creation (this may exist,
> haven't looked closely).
>
> I am happy to do these myself, but do you think this makes sense?
>
> Two other people have asked me off list whether they can do this, so I
> think we need to be very clear that his is outside the specification
> for interceptors.
>
> - Patrick
>
> On Fri, Aug 10, 2012 at 6:51 PM, Mike Percy <[EMAIL PROTECTED]> wrote:
> > I put that comment there for a few reasons that I can recall off the top
> of
> > my head (I should have done a better job documenting this when I was
> > writing the code):
> >
> > 1. The max transaction size on the channel must currently be manually
> > balanced with (or made to exceed) the batchSize setting on batching
> sources
> > and sinks. If the number of events added or taken in a single transaction
> > exceeds this maximum size, an exception will be thrown. However, if
> > generating multiple events from a single event, it is no longer
> sufficient
> > to make the batchSize less or equal to this value, and it would be easier
> > to blow out your transaction size in a potentially unpredictable way,
> > causing potentially confusing errors.
> >
> > 2. An Event is what you might call the basic unit of "flow" in Flume.
> From
> > the perspective of management and monitoring, having the same number of
> > events enter and exit the system helps you know that your cluster is
> > healthy. OTOH, when you generate a variable number of events from a
> single
> > event in an Interceptor, it is really quite difficult to know how the
> data
> > is flowing.
> >
> > 3. Since the interceptor typically runs in an I/O worker thread or in the
> > only thread in a Source, doing any significant computation there will
> > likely affect the overall throughput of the system.
> >
> > In my view, Interceptors as a generally applicable component are well
> > suited to do header "tagging", simple transformations, and filtering, but
> > they're not a good place to put batching/un-batching logic. Maybe the
> Exec
> > Source should have a line-parsing plugin interface to allow people to
> take
> > text lines and generate Events from them. I know this seems similar to
> the
> > Interceptor in the context of the data flow, but I believe you are just
> > trying to work around a limitation of the exec source, since it appears
> > you're describing a serialization issue.
> >
> > Alternatively, one could use an HBase serializer to generate multiple
> > increment / decrement operations, and just log the original line in HDFS