Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # dev - Transforming 1 event to n events


+
Jeremy Custenborder 2012-08-10, 18:07
+
Patrick Wendell 2012-08-11, 00:14
+
Patrick Wendell 2012-08-11, 00:15
+
Mike Percy 2012-08-11, 01:51
Copy link to this message
-
Re: Transforming 1 event to n events
Patrick Wendell 2012-08-11, 05:22
Hey Mike,

That context is super helpful. If it is a correctness problem to have
interceptors returning more events than they receive, can I propose
that we:

a) Add a check in InterceptorChain that verifies the interceptor isn't
growing the size of events (better to throw an error here than
somewhere down the line) which will be harder to debug.

b) Explain in the javadoc briefly why it is a correctness issue.

c) Put a note of caution in the user or dev guide for those who want
to build custom interceptors, explaining that they are solely for
transformation and filtering, not event creation (this may exist,
haven't looked closely).

I am happy to do these myself, but do you think this makes sense?

Two other people have asked me off list whether they can do this, so I
think we need to be very clear that his is outside the specification
for interceptors.

- Patrick

On Fri, Aug 10, 2012 at 6:51 PM, Mike Percy <[EMAIL PROTECTED]> wrote:
> I put that comment there for a few reasons that I can recall off the top of
> my head (I should have done a better job documenting this when I was
> writing the code):
>
> 1. The max transaction size on the channel must currently be manually
> balanced with (or made to exceed) the batchSize setting on batching sources
> and sinks. If the number of events added or taken in a single transaction
> exceeds this maximum size, an exception will be thrown. However, if
> generating multiple events from a single event, it is no longer sufficient
> to make the batchSize less or equal to this value, and it would be easier
> to blow out your transaction size in a potentially unpredictable way,
> causing potentially confusing errors.
>
> 2. An Event is what you might call the basic unit of "flow" in Flume. From
> the perspective of management and monitoring, having the same number of
> events enter and exit the system helps you know that your cluster is
> healthy. OTOH, when you generate a variable number of events from a single
> event in an Interceptor, it is really quite difficult to know how the data
> is flowing.
>
> 3. Since the interceptor typically runs in an I/O worker thread or in the
> only thread in a Source, doing any significant computation there will
> likely affect the overall throughput of the system.
>
> In my view, Interceptors as a generally applicable component are well
> suited to do header "tagging", simple transformations, and filtering, but
> they're not a good place to put batching/un-batching logic. Maybe the Exec
> Source should have a line-parsing plugin interface to allow people to take
> text lines and generate Events from them. I know this seems similar to the
> Interceptor in the context of the data flow, but I believe you are just
> trying to work around a limitation of the exec source, since it appears
> you're describing a serialization issue.
>
> Alternatively, one could use an HBase serializer to generate multiple
> increment / decrement operations, and just log the original line in HDFS
> (or use an EventSerializer).
>
> Regards,
> Mike
>
> On Fri, Aug 10, 2012 at 5:15 PM, Patrick Wendell <[EMAIL PROTECTED]> wrote:
>
>> to clarify - I mean I think it's within the scope of the design
>> intentions. I agree that it is currently disallowed (at least in
>> documentation).
>>
>> On Fri, Aug 10, 2012 at 5:14 PM, Patrick Wendell <[EMAIL PROTECTED]>
>> wrote:
>> > Hey Jeremy,
>> >
>> > That comment has been in the code now for some time, but I don't think
>> > it is actually enforced anywhere programatically. I think the idea was
>> > just that if you are writing something which is capable of generating
>> > new event data it should be in a source - though I'm also curious to
>> > hear why this was put in there.
>> >
>> > IMHO, doing some type of event splitting seems within the scope of how
>> > interceptors are used.
>> >
>> > - Patrick
>> >
>> > On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder
>> > <[EMAIL PROTECTED]> wrote:
>> >> Hello All,
>> >>
>> >> I'm wondering if you could provide some guidance for me. One of the
+
Mike Percy 2012-08-12, 22:58
+
Jeremy Custenborder 2012-08-13, 16:55
+
Mike Percy 2012-08-13, 18:55
+
Jeremy Custenborder 2012-08-13, 22:34
+
Mike Percy 2012-08-14, 01:59
+
Jeremy Custenborder 2012-08-14, 20:51
+
Mike Percy 2012-08-15, 18:14