|
|
-
Transforming 1 event to n events
Jeremy Custenborder 2012-08-10, 18:07
Hello All,
I'm wondering if you could provide some guidance for me. One of the inputs I'm working with batches several entries to a single event. This is a lot simpler than my data but it provides an easy example. For example:
timestamp - 5,4,3,2,1 timestamp - 9,7,5,5,6
If I tail the file this results in 2 events being generated. This example has the data for 10 events.
Here is high level what I want to accomplish. (web server - agent 1) exec source tail -f /<some file path> collector-client to (agent 2)
(collector - agent 2) collector-server Custom Interceptor (input 1 event, output n events) Multiplex to hdfs hbase
An interceptor looked like the most logical spot for me to add this. Is there a better place to add this functionality? Has anyone run into a similar case?
Looking at the docs for Interceptor. intercept(List<Event> events) it says "Output list of events. The size of output list MUST NOT BE GREATER than the size of the input list (i.e. transformation and removal ONLY)." which tells me not to emit more events than given. intercept(Event event) only returns a single event so I can't use it there either. Why is there a requirement to only return 1 for 1?
For now I'm implementing a custom source that will handle generating multiple events from the events coming in on the web server. My preference was to do this transformation on the collector agent before I hand off to hdfs and hbase. I know another alternative would be to implement custom RPC but I would prefer not to do that. I would prefer to rely on what is currently available.
Thanks! j
+
Jeremy Custenborder 2012-08-10, 18:07
-
Re: Transforming 1 event to n events
Patrick Wendell 2012-08-11, 00:14
Hey Jeremy,
That comment has been in the code now for some time, but I don't think it is actually enforced anywhere programatically. I think the idea was just that if you are writing something which is capable of generating new event data it should be in a source - though I'm also curious to hear why this was put in there.
IMHO, doing some type of event splitting seems within the scope of how interceptors are used.
- Patrick
On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder <[EMAIL PROTECTED]> wrote: > Hello All, > > I'm wondering if you could provide some guidance for me. One of the > inputs I'm working with batches several entries to a single event. > This is a lot simpler than my data but it provides an easy example. > For example: > > timestamp - 5,4,3,2,1 > timestamp - 9,7,5,5,6 > > If I tail the file this results in 2 events being generated. This > example has the data for 10 events. > > Here is high level what I want to accomplish. > (web server - agent 1) > exec source tail -f /<some file path> > collector-client to (agent 2) > > (collector - agent 2) > collector-server > Custom Interceptor (input 1 event, output n events) > Multiplex to > hdfs > hbase > > An interceptor looked like the most logical spot for me to add this. > Is there a better place to add this functionality? Has anyone run into > a similar case? > > Looking at the docs for Interceptor. intercept(List<Event> events) it > says "Output list of events. The size of output list MUST NOT BE > GREATER than the size of the input list (i.e. transformation and > removal ONLY)." which tells me not to emit more events than given. > intercept(Event event) only returns a single event so I can't use it > there either. Why is there a requirement to only return 1 for 1? > > For now I'm implementing a custom source that will handle generating > multiple events from the events coming in on the web server. My > preference was to do this transformation on the collector agent before > I hand off to hdfs and hbase. I know another alternative would be to > implement custom RPC but I would prefer not to do that. I would prefer > to rely on what is currently available. > > Thanks! > j
+
Patrick Wendell 2012-08-11, 00:14
-
Re: Transforming 1 event to n events
Patrick Wendell 2012-08-11, 00:15
to clarify - I mean I think it's within the scope of the design intentions. I agree that it is currently disallowed (at least in documentation).
On Fri, Aug 10, 2012 at 5:14 PM, Patrick Wendell <[EMAIL PROTECTED]> wrote: > Hey Jeremy, > > That comment has been in the code now for some time, but I don't think > it is actually enforced anywhere programatically. I think the idea was > just that if you are writing something which is capable of generating > new event data it should be in a source - though I'm also curious to > hear why this was put in there. > > IMHO, doing some type of event splitting seems within the scope of how > interceptors are used. > > - Patrick > > On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder > <[EMAIL PROTECTED]> wrote: >> Hello All, >> >> I'm wondering if you could provide some guidance for me. One of the >> inputs I'm working with batches several entries to a single event. >> This is a lot simpler than my data but it provides an easy example. >> For example: >> >> timestamp - 5,4,3,2,1 >> timestamp - 9,7,5,5,6 >> >> If I tail the file this results in 2 events being generated. This >> example has the data for 10 events. >> >> Here is high level what I want to accomplish. >> (web server - agent 1) >> exec source tail -f /<some file path> >> collector-client to (agent 2) >> >> (collector - agent 2) >> collector-server >> Custom Interceptor (input 1 event, output n events) >> Multiplex to >> hdfs >> hbase >> >> An interceptor looked like the most logical spot for me to add this. >> Is there a better place to add this functionality? Has anyone run into >> a similar case? >> >> Looking at the docs for Interceptor. intercept(List<Event> events) it >> says "Output list of events. The size of output list MUST NOT BE >> GREATER than the size of the input list (i.e. transformation and >> removal ONLY)." which tells me not to emit more events than given. >> intercept(Event event) only returns a single event so I can't use it >> there either. Why is there a requirement to only return 1 for 1? >> >> For now I'm implementing a custom source that will handle generating >> multiple events from the events coming in on the web server. My >> preference was to do this transformation on the collector agent before >> I hand off to hdfs and hbase. I know another alternative would be to >> implement custom RPC but I would prefer not to do that. I would prefer >> to rely on what is currently available. >> >> Thanks! >> j
+
Patrick Wendell 2012-08-11, 00:15
-
Re: Transforming 1 event to n events
Mike Percy 2012-08-11, 01:51
I put that comment there for a few reasons that I can recall off the top of my head (I should have done a better job documenting this when I was writing the code):
1. The max transaction size on the channel must currently be manually balanced with (or made to exceed) the batchSize setting on batching sources and sinks. If the number of events added or taken in a single transaction exceeds this maximum size, an exception will be thrown. However, if generating multiple events from a single event, it is no longer sufficient to make the batchSize less or equal to this value, and it would be easier to blow out your transaction size in a potentially unpredictable way, causing potentially confusing errors.
2. An Event is what you might call the basic unit of "flow" in Flume. From the perspective of management and monitoring, having the same number of events enter and exit the system helps you know that your cluster is healthy. OTOH, when you generate a variable number of events from a single event in an Interceptor, it is really quite difficult to know how the data is flowing.
3. Since the interceptor typically runs in an I/O worker thread or in the only thread in a Source, doing any significant computation there will likely affect the overall throughput of the system.
In my view, Interceptors as a generally applicable component are well suited to do header "tagging", simple transformations, and filtering, but they're not a good place to put batching/un-batching logic. Maybe the Exec Source should have a line-parsing plugin interface to allow people to take text lines and generate Events from them. I know this seems similar to the Interceptor in the context of the data flow, but I believe you are just trying to work around a limitation of the exec source, since it appears you're describing a serialization issue.
Alternatively, one could use an HBase serializer to generate multiple increment / decrement operations, and just log the original line in HDFS (or use an EventSerializer).
Regards, Mike
On Fri, Aug 10, 2012 at 5:15 PM, Patrick Wendell <[EMAIL PROTECTED]> wrote:
> to clarify - I mean I think it's within the scope of the design > intentions. I agree that it is currently disallowed (at least in > documentation). > > On Fri, Aug 10, 2012 at 5:14 PM, Patrick Wendell <[EMAIL PROTECTED]> > wrote: > > Hey Jeremy, > > > > That comment has been in the code now for some time, but I don't think > > it is actually enforced anywhere programatically. I think the idea was > > just that if you are writing something which is capable of generating > > new event data it should be in a source - though I'm also curious to > > hear why this was put in there. > > > > IMHO, doing some type of event splitting seems within the scope of how > > interceptors are used. > > > > - Patrick > > > > On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder > > <[EMAIL PROTECTED]> wrote: > >> Hello All, > >> > >> I'm wondering if you could provide some guidance for me. One of the > >> inputs I'm working with batches several entries to a single event. > >> This is a lot simpler than my data but it provides an easy example. > >> For example: > >> > >> timestamp - 5,4,3,2,1 > >> timestamp - 9,7,5,5,6 > >> > >> If I tail the file this results in 2 events being generated. This > >> example has the data for 10 events. > >> > >> Here is high level what I want to accomplish. > >> (web server - agent 1) > >> exec source tail -f /<some file path> > >> collector-client to (agent 2) > >> > >> (collector - agent 2) > >> collector-server > >> Custom Interceptor (input 1 event, output n events) > >> Multiplex to > >> hdfs > >> hbase > >> > >> An interceptor looked like the most logical spot for me to add this. > >> Is there a better place to add this functionality? Has anyone run into > >> a similar case? > >> > >> Looking at the docs for Interceptor. intercept(List<Event> events) it > >> says "Output list of events. The size of output list MUST NOT BE > >> GREATER than the size of the input list (i.e. transformation and
+
Mike Percy 2012-08-11, 01:51
-
Re: Transforming 1 event to n events
Patrick Wendell 2012-08-11, 05:22
Hey Mike,
That context is super helpful. If it is a correctness problem to have interceptors returning more events than they receive, can I propose that we:
a) Add a check in InterceptorChain that verifies the interceptor isn't growing the size of events (better to throw an error here than somewhere down the line) which will be harder to debug.
b) Explain in the javadoc briefly why it is a correctness issue.
c) Put a note of caution in the user or dev guide for those who want to build custom interceptors, explaining that they are solely for transformation and filtering, not event creation (this may exist, haven't looked closely).
I am happy to do these myself, but do you think this makes sense?
Two other people have asked me off list whether they can do this, so I think we need to be very clear that his is outside the specification for interceptors.
- Patrick
On Fri, Aug 10, 2012 at 6:51 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > I put that comment there for a few reasons that I can recall off the top of > my head (I should have done a better job documenting this when I was > writing the code): > > 1. The max transaction size on the channel must currently be manually > balanced with (or made to exceed) the batchSize setting on batching sources > and sinks. If the number of events added or taken in a single transaction > exceeds this maximum size, an exception will be thrown. However, if > generating multiple events from a single event, it is no longer sufficient > to make the batchSize less or equal to this value, and it would be easier > to blow out your transaction size in a potentially unpredictable way, > causing potentially confusing errors. > > 2. An Event is what you might call the basic unit of "flow" in Flume. From > the perspective of management and monitoring, having the same number of > events enter and exit the system helps you know that your cluster is > healthy. OTOH, when you generate a variable number of events from a single > event in an Interceptor, it is really quite difficult to know how the data > is flowing. > > 3. Since the interceptor typically runs in an I/O worker thread or in the > only thread in a Source, doing any significant computation there will > likely affect the overall throughput of the system. > > In my view, Interceptors as a generally applicable component are well > suited to do header "tagging", simple transformations, and filtering, but > they're not a good place to put batching/un-batching logic. Maybe the Exec > Source should have a line-parsing plugin interface to allow people to take > text lines and generate Events from them. I know this seems similar to the > Interceptor in the context of the data flow, but I believe you are just > trying to work around a limitation of the exec source, since it appears > you're describing a serialization issue. > > Alternatively, one could use an HBase serializer to generate multiple > increment / decrement operations, and just log the original line in HDFS > (or use an EventSerializer). > > Regards, > Mike > > On Fri, Aug 10, 2012 at 5:15 PM, Patrick Wendell <[EMAIL PROTECTED]> wrote: > >> to clarify - I mean I think it's within the scope of the design >> intentions. I agree that it is currently disallowed (at least in >> documentation). >> >> On Fri, Aug 10, 2012 at 5:14 PM, Patrick Wendell <[EMAIL PROTECTED]> >> wrote: >> > Hey Jeremy, >> > >> > That comment has been in the code now for some time, but I don't think >> > it is actually enforced anywhere programatically. I think the idea was >> > just that if you are writing something which is capable of generating >> > new event data it should be in a source - though I'm also curious to >> > hear why this was put in there. >> > >> > IMHO, doing some type of event splitting seems within the scope of how >> > interceptors are used. >> > >> > - Patrick >> > >> > On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder >> > <[EMAIL PROTECTED]> wrote: >> >> Hello All, >> >> >> >> I'm wondering if you could provide some guidance for me. One of the
+
Patrick Wendell 2012-08-11, 05:22
-
Re: Transforming 1 event to n events
Mike Percy 2012-08-12, 22:58
Hey Patrick, I pondered this a bit over the last day or so and I'm kind of lukewarm on adding preconditions checks at this time. The reason I didn't do it initially is that while I wanted a particular contract for that component, in order to make Interceptors viable to maintain and understand with the current design of the Flume core, I wasn't sure if it would be sufficient for all future use cases. So if someone wants to do something that breaks that contract, then they are "on their own", doing stuff that may break in future implementations. If they're willing to accept that risk then they have the freedom to maybe do something novel and awesome, which might prompt us to add a different kind of extension mechanism in the future to support whatever that use case is.
Regards, Mike
On Fri, Aug 10, 2012 at 10:22 PM, Patrick Wendell <[EMAIL PROTECTED]>wrote:
> Hey Mike, > > That context is super helpful. If it is a correctness problem to have > interceptors returning more events than they receive, can I propose > that we: > > a) Add a check in InterceptorChain that verifies the interceptor isn't > growing the size of events (better to throw an error here than > somewhere down the line) which will be harder to debug. > > b) Explain in the javadoc briefly why it is a correctness issue. > > c) Put a note of caution in the user or dev guide for those who want > to build custom interceptors, explaining that they are solely for > transformation and filtering, not event creation (this may exist, > haven't looked closely). > > I am happy to do these myself, but do you think this makes sense? > > Two other people have asked me off list whether they can do this, so I > think we need to be very clear that his is outside the specification > for interceptors. > > - Patrick > > On Fri, Aug 10, 2012 at 6:51 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > > I put that comment there for a few reasons that I can recall off the top > of > > my head (I should have done a better job documenting this when I was > > writing the code): > > > > 1. The max transaction size on the channel must currently be manually > > balanced with (or made to exceed) the batchSize setting on batching > sources > > and sinks. If the number of events added or taken in a single transaction > > exceeds this maximum size, an exception will be thrown. However, if > > generating multiple events from a single event, it is no longer > sufficient > > to make the batchSize less or equal to this value, and it would be easier > > to blow out your transaction size in a potentially unpredictable way, > > causing potentially confusing errors. > > > > 2. An Event is what you might call the basic unit of "flow" in Flume. > From > > the perspective of management and monitoring, having the same number of > > events enter and exit the system helps you know that your cluster is > > healthy. OTOH, when you generate a variable number of events from a > single > > event in an Interceptor, it is really quite difficult to know how the > data > > is flowing. > > > > 3. Since the interceptor typically runs in an I/O worker thread or in the > > only thread in a Source, doing any significant computation there will > > likely affect the overall throughput of the system. > > > > In my view, Interceptors as a generally applicable component are well > > suited to do header "tagging", simple transformations, and filtering, but > > they're not a good place to put batching/un-batching logic. Maybe the > Exec > > Source should have a line-parsing plugin interface to allow people to > take > > text lines and generate Events from them. I know this seems similar to > the > > Interceptor in the context of the data flow, but I believe you are just > > trying to work around a limitation of the exec source, since it appears > > you're describing a serialization issue. > > > > Alternatively, one could use an HBase serializer to generate multiple > > increment / decrement operations, and just log the original line in HDFS
+
Mike Percy 2012-08-12, 22:58
-
Re: Transforming 1 event to n events
Jeremy Custenborder 2012-08-13, 16:55
Mike / Patrick
Thanks for the replies. Sorry if this reply seems out of order and for the delay. I was not subscribed to the mailing list when I sent my earlier email. I just happened to stumble on your messages reading the list history this morning. I appreciate the answers.
Patrick - I agree with you. It made sense to put it in an interceptor. To me when I looked at an interceptor I thought of using it as a replacement for a decorator in the old version of flume.
> 1. The max transaction size on the channel must currently be manually > balanced with (or made to exceed) the batchSize setting on batching sources > and sinks. If the number of events added or taken in a single transaction > exceeds this maximum size, an exception will be thrown. However, if > generating multiple events from a single event, it is no longer sufficient > to make the batchSize less or equal to this value, and it would be easier > to blow out your transaction size in a potentially unpredictable way, > causing potentially confusing errors.
This makes sense.
> 2. An Event is what you might call the basic unit of "flow" in Flume. From > the perspective of management and monitoring, having the same number of > events enter and exit the system helps you know that your cluster is > healthy. OTOH, when you generate a variable number of events from a single > event in an Interceptor, it is really quite difficult to know how the data > is flowing.
This seems like a good method for monitoring.
> 3. Since the interceptor typically runs in an I/O worker thread or in the > only thread in a Source, doing any significant computation there will > likely affect the overall throughput of the system. > In my view, Interceptors as a generally applicable component are well > suited to do header "tagging", simple transformations, and filtering, but > they're not a good place to put batching/un-batching logic. Maybe the Exec > Source should have a line-parsing plugin interface to allow people to take > text lines and generate Events from them. I know this seems similar to the > Interceptor in the context of the data flow, but I believe you are just > trying to work around a limitation of the exec source, since it appears > you're describing a serialization issue."
> Alternatively, one could use an HBase serializer to generate multiple > increment / decrement operations, and just log the original line in HDFS > (or use an EventSerializer).
The is what I'm working towards. I want a 1 for 1 entry in hdfs but increment counters in hbase, so given the following input:
timestamp - 5,4,3,2,1
hive
timestamp - 5 timestamp - 4 timestamp - 3 timestamp - 2 timestamp - 1
hbase
timestamp - 5 increment timestamp - 4 increment timestamp - 3 increment timestamp - 2 increment timestamp - 1 increment
Given this I was just planning on emitting an event with the body I was going to use in hive early in the pipeline. Send the same data to hdfs and hbase. Then use a serializer on the hbase side to increment the counters. This would allow me to add data to hdfs in the format I'm planning on consuming it with without managing two serializers. My plans for the hbase serializer was literally generate key, increment per record based on the input. So only a couple lines of code.
> I pondered this a bit over the last day or so and I'm kind of lukewarm on > adding preconditions checks at this time. The reason I didn't do it > initially is that while I wanted a particular contract for that component, > in order to make Interceptors viable to maintain and understand with the > current design of the Flume core, I wasn't sure if it would be sufficient > for all future use cases. So if someone wants to do something that breaks > that contract, then they are "on their own", doing stuff that may break in > future implementations. If they're willing to accept that risk then they > have the freedom to maybe do something novel and awesome, which might > prompt us to add a different kind of extension mechanism in the future to
I think there should be an approved method for this case. A different extension that could perform processing like this could be helpful. To me when I looked at an interceptor I thought of using it as a replacement for a decorator in the old version of flume. We have a lot of code that will take a log entry and replace the body with a protocol buffer representation. I prefer to run this code on an upstream tier from the web server. Interceptors would work fine for the one in one out case.
On Fri, Aug 10, 2012 at 1:07 PM, Jeremy Custenborder <[EMAIL PROTECTED]> wrote:
+
Jeremy Custenborder 2012-08-13, 16:55
-
Re: Transforming 1 event to n events
Mike Percy 2012-08-13, 18:55
Hi Jeremy,
On Mon, Aug 13, 2012 at 9:55 AM, Jeremy Custenborder < [EMAIL PROTECTED]> wrote: > > > > I believe you are just > > trying to work around a limitation of the exec source, since it appears > > you're describing a serialization issue." > > > Alternatively, one could use an HBase serializer to generate multiple > > increment / decrement operations, and just log the original line in HDFS > > (or use an EventSerializer). > > The is what I'm working towards. I want a 1 for 1 entry in hdfs but > increment counters in hbase >
HBase serializer can generate multiple operations per Event, and the HDFS serializer could generate whatever output Hive expects as well. > Given this I was just planning on emitting an event with the body I > was going to use in hive early in the pipeline. Send the same data to > hdfs and hbase. Then use a serializer on the hbase side to increment > the counters. This would allow me to add data to hdfs in the format > I'm planning on consuming it with without managing two serializers. My > plans for the hbase serializer was literally generate key, increment > per record based on the input. So only a couple lines of code. >
Yeah, if you are doing much parsing in your serializers it's going to be a bit more complex.
> I pondered this a bit over the last day or so and I'm kind of lukewarm on > > adding preconditions checks at this time. The reason I didn't do it > > initially is that while I wanted a particular contract for that > component, > > in order to make Interceptors viable to maintain and understand with the > > current design of the Flume core, I wasn't sure if it would be sufficient > > for all future use cases. So if someone wants to do something that breaks > > that contract, then they are "on their own", doing stuff that may break > in > > future implementations. If they're willing to accept that risk then they > > have the freedom to maybe do something novel and awesome, which might > > prompt us to add a different kind of extension mechanism in the future to > > support whatever that use case is. > > I think there should be an approved method for this case. A different > extension that could perform processing like this could be helpful. To > me when I looked at an interceptor I thought of using it as a > replacement for a decorator in the old version of flume. We have a lot > of code that will take a log entry and replace the body with a > protocol buffer representation. I prefer to run this code on an > upstream tier from the web server. Interceptors would work fine for > the one in one out case. >
Have you considered using an Interceptor or a custom source to generate a single event that has a series of timestamps within it? You could use protobufs for serialization of that data structure.
Since you have multiple timestamps / timings on the same log line, I wonder if it isn't a single "event" with multiple facets and this isn't just a semantics thing.
Regards, Mike
+
Mike Percy 2012-08-13, 18:55
-
Re: Transforming 1 event to n events
Jeremy Custenborder 2012-08-13, 22:34
On Mon, Aug 13, 2012 at 1:55 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > Hi Jeremy, > > On Mon, Aug 13, 2012 at 9:55 AM, Jeremy Custenborder < > [EMAIL PROTECTED]> wrote: >> >> >> > I believe you are just >> > trying to work around a limitation of the exec source, since it appears >> > you're describing a serialization issue." >> >> > Alternatively, one could use an HBase serializer to generate multiple >> > increment / decrement operations, and just log the original line in HDFS >> > (or use an EventSerializer). >> >> The is what I'm working towards. I want a 1 for 1 entry in hdfs but >> increment counters in hbase >> > > HBase serializer can generate multiple operations per Event, and the HDFS > serializer could generate whatever output Hive expects as well. >
Yea
> >> Given this I was just planning on emitting an event with the body I >> was going to use in hive early in the pipeline. Send the same data to >> hdfs and hbase. Then use a serializer on the hbase side to increment >> the counters. This would allow me to add data to hdfs in the format >> I'm planning on consuming it with without managing two serializers. My >> plans for the hbase serializer was literally generate key, increment >> per record based on the input. So only a couple lines of code. >> > > Yeah, if you are doing much parsing in your serializers it's going to be a > bit more complex. > > > I pondered this a bit over the last day or so and I'm kind of lukewarm on >> > adding preconditions checks at this time. The reason I didn't do it >> > initially is that while I wanted a particular contract for that >> component, >> > in order to make Interceptors viable to maintain and understand with the >> > current design of the Flume core, I wasn't sure if it would be sufficient >> > for all future use cases. So if someone wants to do something that breaks >> > that contract, then they are "on their own", doing stuff that may break >> in >> > future implementations. If they're willing to accept that risk then they >> > have the freedom to maybe do something novel and awesome, which might >> > prompt us to add a different kind of extension mechanism in the future to >> > support whatever that use case is. >> >> I think there should be an approved method for this case. A different >> extension that could perform processing like this could be helpful. To >> me when I looked at an interceptor I thought of using it as a >> replacement for a decorator in the old version of flume. We have a lot >> of code that will take a log entry and replace the body with a >> protocol buffer representation. I prefer to run this code on an >> upstream tier from the web server. Interceptors would work fine for >> the one in one out case. >> > > Have you considered using an Interceptor or a custom source to generate a > single event that has a series of timestamps within it? You could use > protobufs for serialization of that data structure. > > Since you have multiple timestamps / timings on the same log line, I wonder > if it isn't a single "event" with multiple facets and this isn't just a > semantics thing.
I just used the multiple counters on a single line as an example. My use case is much more complex and I thought it wouldn't add much to the conversation. I need to have the multiple objects available to hive. The upstream object is actually a protobuf with hierarchy. I was planning on flattening the object for hive. Here is an example of what I'm collecting. The actual protobuf has many more fields, but this gives you an idea.
requestid page timestamp useragent impressions =[12345, 43212,12344,12345,43122, etc]
transforming for each impression.
requestid page timestamp useragent index objectid
This gives me one row in hive per impression. This might be a little more contextual. I picked the earlier example because I didn't want to get caught up in my use case. I could move this code to serializers buy I need to do similar logic twice since I'm incrementing a counter in hbase per impression and adding a row per impression in hdfs(hive). If I transformed the event to multiple events earlier in the pipe. I would only have to write code to generate keys per event. At this point I'm going to implement two serializers. One to handle hdfs and one for hbase.
Thanks again for your responses! J
+
Jeremy Custenborder 2012-08-13, 22:34
-
Re: Transforming 1 event to n events
Mike Percy 2012-08-14, 01:59
On Mon, Aug 13, 2012 at 3:34 PM, Jeremy Custenborder < [EMAIL PROTECTED]> wrote:
> I need to have the multiple objects available to > hive. The upstream object is actually a protobuf with hierarchy. I was > planning on flattening the object for hive. Here is an example of what > I'm collecting. The actual protobuf has many more fields, but this > gives you an idea. > > requestid > page > timestamp > useragent > impressions =[12345, 43212,12344,12345,43122, etc] > > transforming for each impression. > > requestid > page > timestamp > useragent > index > objectid > > This gives me one row in hive per impression. This might be a little > more contextual. I picked the earlier example because I didn't want to > get caught up in my use case. I could move this code to serializers > buy I need to do similar logic twice since I'm incrementing a counter > in hbase per impression and adding a row per impression in hdfs(hive). > If I transformed the event to multiple events earlier in the pipe. I > would only have to write code to generate keys per event. At this > point I'm going to implement two serializers. One to handle hdfs and > one for hbase. >
Hi Jeremy,
Thanks for the extra color. It's an interesting flow. As more people continue to adopt Flume, I think we'll start to see patterns where the design or implementation of Flume is lacking and we can work towards bridging those gaps, and your use case provides valuable data on that. As for where we are now, I'm happy to hear that you have found a way forward.
If you can keep us apprised as things progress with your Flume deployment I would love to hear about it!
Regards, Mike
+
Mike Percy 2012-08-14, 01:59
-
Re: Transforming 1 event to n events
Jeremy Custenborder 2012-08-14, 20:51
Hi Mike, I think I'm still blocked on this or I'll have to move the splitting of the data up to the source which I know will work for sure. I've just been trying to avoid it because I didn't want to deploy this to all of the web servers. I'm looking into the EventSerializer and I don't think it's going to work for me either. All of the examples I've seen so far write data to an output stream that seems to be the raw data file. It looks like append is only called once per event. This prevents me from writing multiple events as separate records in the squencefile on HDFS. https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java#L72Am I off base here? J On Mon, Aug 13, 2012 at 8:59 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > On Mon, Aug 13, 2012 at 3:34 PM, Jeremy Custenborder < > [EMAIL PROTECTED]> wrote: > >> I need to have the multiple objects available to >> hive. The upstream object is actually a protobuf with hierarchy. I was >> planning on flattening the object for hive. Here is an example of what >> I'm collecting. The actual protobuf has many more fields, but this >> gives you an idea. >> >> requestid >> page >> timestamp >> useragent >> impressions =[12345, 43212,12344,12345,43122, etc] >> >> transforming for each impression. >> >> requestid >> page >> timestamp >> useragent >> index >> objectid >> >> This gives me one row in hive per impression. This might be a little >> more contextual. I picked the earlier example because I didn't want to >> get caught up in my use case. I could move this code to serializers >> buy I need to do similar logic twice since I'm incrementing a counter >> in hbase per impression and adding a row per impression in hdfs(hive). >> If I transformed the event to multiple events earlier in the pipe. I >> would only have to write code to generate keys per event. At this >> point I'm going to implement two serializers. One to handle hdfs and >> one for hbase. >> > > Hi Jeremy, > > Thanks for the extra color. It's an interesting flow. As more people > continue to adopt Flume, I think we'll start to see patterns where the > design or implementation of Flume is lacking and we can work towards > bridging those gaps, and your use case provides valuable data on that. As > for where we are now, I'm happy to hear that you have found a way forward. > > If you can keep us apprised as things progress with your Flume deployment I > would love to hear about it! > > Regards, > Mike
+
Jeremy Custenborder 2012-08-14, 20:51
-
Re: Transforming 1 event to n events
Mike Percy 2012-08-15, 18:14
Jeremy, I have not done much w/ the Sequence File support in HDFS sink (in terms of much usage or modification), although I know it is there. It has its own type of serialization API. I know that the EventSerializer using the DataStream fileType can handle writing arbitrary data, i.e. multiple records, etc, but that may not be possible with the Formatter API included in Sequence File support. At the risk of exposing my ignorance on this, and not having lots of extra cycles to investigate immediately, it may be worth taking a look @ the patch recently submitted by Chris (see thread I just replied to) to see if it meets your needs... if the existing Formatter API is not pluggable, then it may not be a backwards-compatibility risk to modify it to support creating multiple keys to handle this use case. Once it's exposed as an extension point and a release is made, of course we cannot modify it without breaking backcompat. Just a thought and I don't know if all of those assumptions hold true, might be worth investigating though. Regards, Mike On Tue, Aug 14, 2012 at 1:51 PM, Jeremy Custenborder < [EMAIL PROTECTED]> wrote: > Hi Mike, > > I think I'm still blocked on this or I'll have to move the splitting > of the data up to the source which I know will work for sure. I've > just been trying to avoid it because I didn't want to deploy this to > all of the web servers. > > I'm looking into the EventSerializer and I don't think it's going to > work for me either. All of the examples I've seen so far write data to > an output stream that seems to be the raw data file. It looks like > append is only called once per event. This prevents me from writing > multiple events as separate records in the squencefile on HDFS. > > https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java#L72> > Am I off base here? > J > > On Mon, Aug 13, 2012 at 8:59 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > > On Mon, Aug 13, 2012 at 3:34 PM, Jeremy Custenborder < > > [EMAIL PROTECTED]> wrote: > > > >> I need to have the multiple objects available to > >> hive. The upstream object is actually a protobuf with hierarchy. I was > >> planning on flattening the object for hive. Here is an example of what > >> I'm collecting. The actual protobuf has many more fields, but this > >> gives you an idea. > >> > >> requestid > >> page > >> timestamp > >> useragent > >> impressions =[12345, 43212,12344,12345,43122, etc] > >> > >> transforming for each impression. > >> > >> requestid > >> page > >> timestamp > >> useragent > >> index > >> objectid > >> > >> This gives me one row in hive per impression. This might be a little > >> more contextual. I picked the earlier example because I didn't want to > >> get caught up in my use case. I could move this code to serializers > >> buy I need to do similar logic twice since I'm incrementing a counter > >> in hbase per impression and adding a row per impression in hdfs(hive). > >> If I transformed the event to multiple events earlier in the pipe. I > >> would only have to write code to generate keys per event. At this > >> point I'm going to implement two serializers. One to handle hdfs and > >> one for hbase. > >> > > > > Hi Jeremy, > > > > Thanks for the extra color. It's an interesting flow. As more people > > continue to adopt Flume, I think we'll start to see patterns where the > > design or implementation of Flume is lacking and we can work towards > > bridging those gaps, and your use case provides valuable data on that. As > > for where we are now, I'm happy to hear that you have found a way > forward. > > > > If you can keep us apprised as things progress with your Flume > deployment I > > would love to hear about it! > > > > Regards, > > Mike >
+
Mike Percy 2012-08-15, 18:14
|
|