Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # dev - Transforming 1 event to n events

Copy link to this message
Re: Transforming 1 event to n events
Mike Percy 2012-08-15, 18:14
Jeremy, I have not done much w/ the Sequence File support in HDFS sink (in
terms of much usage or modification), although I know it is there. It has
its own type of serialization API. I know that the EventSerializer using
the DataStream fileType can handle writing arbitrary data, i.e. multiple
records, etc, but that may not be possible with the Formatter API included
in Sequence File support.

At the risk of exposing my ignorance on this, and not having lots of extra
cycles to investigate immediately, it may be worth taking a look @ the
patch recently submitted by Chris (see thread I just replied to) to see if
it meets your needs... if the existing Formatter API is not pluggable, then
it may not be a backwards-compatibility risk to modify it to support
creating multiple keys to handle this use case. Once it's exposed as an
extension point and a release is made, of course we cannot modify it
without breaking backcompat. Just a thought and I don't know if all of
those assumptions hold true, might be worth investigating though.


On Tue, Aug 14, 2012 at 1:51 PM, Jeremy Custenborder <

> Hi Mike,
> I think I'm still blocked on this or I'll have to move the splitting
> of the data up to the source which I know will work for sure. I've
> just been trying to avoid it because I didn't want to deploy this to
> all of the web servers.
> I'm looking into the EventSerializer and I don't think it's going to
> work for me either. All of the examples I've seen so far write data to
> an output stream that seems to be the raw data file. It looks like
> append is only called once per event. This prevents me from writing
> multiple events as separate records in the squencefile on HDFS.
> https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java#L72
> Am I off base here?
> J
> On Mon, Aug 13, 2012 at 8:59 PM, Mike Percy <[EMAIL PROTECTED]> wrote:
> > On Mon, Aug 13, 2012 at 3:34 PM, Jeremy Custenborder <
> > [EMAIL PROTECTED]> wrote:
> >
> >> I need to have the multiple objects available to
> >> hive. The upstream object is actually a protobuf with hierarchy. I was
> >> planning on flattening the object for hive. Here is an example of what
> >> I'm collecting. The actual protobuf has many more fields, but this
> >> gives you an idea.
> >>
> >> requestid
> >> page
> >> timestamp
> >> useragent
> >> impressions =[12345, 43212,12344,12345,43122, etc]
> >>
> >> transforming for each impression.
> >>
> >> requestid
> >> page
> >> timestamp
> >> useragent
> >> index
> >> objectid
> >>
> >> This gives me one row in hive per impression. This might be a little
> >> more contextual. I picked the earlier example because I didn't want to
> >> get caught up in my use case.  I could move this code to serializers
> >> buy I need to do similar logic twice since I'm incrementing a counter
> >> in hbase per impression and adding a row per impression in hdfs(hive).
> >> If I transformed the event to multiple events earlier in the pipe. I
> >> would only have to write code to generate keys per event. At this
> >> point I'm going to implement two serializers. One to handle hdfs and
> >> one for hbase.
> >>
> >
> > Hi Jeremy,
> >
> > Thanks for the extra color. It's an interesting flow. As more people
> > continue to adopt Flume, I think we'll start to see patterns where the
> > design or implementation of Flume is lacking and we can work towards
> > bridging those gaps, and your use case provides valuable data on that. As
> > for where we are now, I'm happy to hear that you have found a way
> forward.
> >
> > If you can keep us apprised as things progress with your Flume
> deployment I
> > would love to hear about it!
> >
> > Regards,
> > Mike