Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - aggregation by time window


Copy link to this message
-
Re: aggregation by time window
Oleg Ruchovets 2013-01-28, 14:49
Well , much more clear , but still have a questions :-)

Suppose we have 3 map input records

event1 | 10:07
event2 | 10:10
event3 | 10:12

Output from map(event1 | 10:07) will be :

 mapOutput(10:04:event1)
 mapOutput(10:05:event1)
 mapOutput(10:06:event1)
 mapOutput(10:07:event1)
 mapOutput(10:08:event1)
 mapOutput(10:09:event1)
 mapOutput(10:10:event1)

Output  for map(event2 | 10:10) will be:

mapOutput(10:07:event2)
mapOutput(10:08:event2)
mapOutput(10:09:event2)
mapOutput(10:10:event2)
mapOutput(10:11:event2)
mapOutput(10:12:event2)
mapOutput(10:13:event2)

 Output for map (event3 |  10:12) will be:
mapOutput(10:09: event3)
mapOutput(10:10 : event3)
mapOutput(10:11 : event3)
mapOutput(10:12 : event3)
mapOutput(10:13 : event3)
mapOutput(10:14 : event3)
mapOutput(10:15 : event3)

 Is it correct?

If yes ,

in reducer phase I will get such inputs:

reducer(10:04:event1)
reducer(10:05:event1)
reducer(10:06:event1)
reducer(10:07:event1 ,event2)
reducer(10:08:event1 , event2)
reducer(10:09:event1 , event2 , event3)
reducer(10:10:event1 , event2 , event3)
reducer(10:11:event3)
reducer(10:12:event3)
reducer(10:13:event3)
reducer(10:14:event3)
reducer(10:15:event3)

Iterating over each reducer input how can I know at the end of aggregations
which events were during 7 minutes?

Thansk
Oleg.

On Mon, Jan 28, 2013 at 3:48 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:

> Hi again,
>
> the idea is that you emit every event multiple times. So your map input
> record (event1, 10:07) will be emitted seven times during the map() call.
> Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be
> the seven outputs for processing a single event.
>
> The output key will be the time stamps in which neighbourhood or interval
> each event should be joined with events that happened +/- 3 minutes near
> it. So events which happened within a 7 minutes distance will both be
> emitted with the same time stamp as the map() output, and thus meet in a
> reduce() call.
>
> A reduce() call will look like this: reduce(10:03, list_of_events). And
> those events had time stamps between 10:00 and 10:06 in the original input.
>
> Kai
>
> Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>:
>
> > Hi Kai.
> >    It is very interesting. Can you please explain in more details your
> > Idea?
> > What will be a key in a map phase?
> >
> > Suppose we have event at 10:07. How would you emit this to the multiple
> > buckets?
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> >
> >> Quick idea:
> >>
> >> since each of your events will go into several buckets, you could use
> >> map() to emit each item multiple times for each bucket.
> >>
> >> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>:
> >>
> >>> Hi ,
> >>>   I have such row data structure:
> >>>
> >>> event_id  |   time
> >>> =============> >>> event1     |  10:07
> >>> event2     |  10:10
> >>> event3     |  10:12
> >>>
> >>> event4     |   10:20
> >>> event5     |   10:23
> >>> event6     |   10:25
> >>
> >> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
> >> (10:10,event1) and so on.
> >>
> >> In reduce(), all your desired events would meet for the same minute.
> >>
> >> Kai
> >>
> >> --
> >> Kai Voigt
> >> [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >>
>
> --
> Kai Voigt
> [EMAIL PROTECTED]
>
>
>
>
>