Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> aggregation by time window


+
Oleg Ruchovets 2013-01-28, 12:56
+
Kai Voigt 2013-01-28, 13:17
+
Oleg Ruchovets 2013-01-28, 13:43
+
Kai Voigt 2013-01-28, 13:48
Copy link to this message
-
Re: aggregation by time window
Well , much more clear , but still have a questions :-)

Suppose we have 3 map input records

event1 | 10:07
event2 | 10:10
event3 | 10:12

Output from map(event1 | 10:07) will be :

 mapOutput(10:04:event1)
 mapOutput(10:05:event1)
 mapOutput(10:06:event1)
 mapOutput(10:07:event1)
 mapOutput(10:08:event1)
 mapOutput(10:09:event1)
 mapOutput(10:10:event1)

Output  for map(event2 | 10:10) will be:

mapOutput(10:07:event2)
mapOutput(10:08:event2)
mapOutput(10:09:event2)
mapOutput(10:10:event2)
mapOutput(10:11:event2)
mapOutput(10:12:event2)
mapOutput(10:13:event2)

 Output for map (event3 |  10:12) will be:
mapOutput(10:09: event3)
mapOutput(10:10 : event3)
mapOutput(10:11 : event3)
mapOutput(10:12 : event3)
mapOutput(10:13 : event3)
mapOutput(10:14 : event3)
mapOutput(10:15 : event3)

 Is it correct?

If yes ,

in reducer phase I will get such inputs:

reducer(10:04:event1)
reducer(10:05:event1)
reducer(10:06:event1)
reducer(10:07:event1 ,event2)
reducer(10:08:event1 , event2)
reducer(10:09:event1 , event2 , event3)
reducer(10:10:event1 , event2 , event3)
reducer(10:11:event3)
reducer(10:12:event3)
reducer(10:13:event3)
reducer(10:14:event3)
reducer(10:15:event3)

Iterating over each reducer input how can I know at the end of aggregations
which events were during 7 minutes?

Thansk
Oleg.

On Mon, Jan 28, 2013 at 3:48 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:

> Hi again,
>
> the idea is that you emit every event multiple times. So your map input
> record (event1, 10:07) will be emitted seven times during the map() call.
> Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be
> the seven outputs for processing a single event.
>
> The output key will be the time stamps in which neighbourhood or interval
> each event should be joined with events that happened +/- 3 minutes near
> it. So events which happened within a 7 minutes distance will both be
> emitted with the same time stamp as the map() output, and thus meet in a
> reduce() call.
>
> A reduce() call will look like this: reduce(10:03, list_of_events). And
> those events had time stamps between 10:00 and 10:06 in the original input.
>
> Kai
>
> Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>:
>
> > Hi Kai.
> >    It is very interesting. Can you please explain in more details your
> > Idea?
> > What will be a key in a map phase?
> >
> > Suppose we have event at 10:07. How would you emit this to the multiple
> > buckets?
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> >
> >> Quick idea:
> >>
> >> since each of your events will go into several buckets, you could use
> >> map() to emit each item multiple times for each bucket.
> >>
> >> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>:
> >>
> >>> Hi ,
> >>>   I have such row data structure:
> >>>
> >>> event_id  |   time
> >>> =============> >>> event1     |  10:07
> >>> event2     |  10:10
> >>> event3     |  10:12
> >>>
> >>> event4     |   10:20
> >>> event5     |   10:23
> >>> event6     |   10:25
> >>
> >> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
> >> (10:10,event1) and so on.
> >>
> >> In reduce(), all your desired events would meet for the same minute.
> >>
> >> Kai
> >>
> >> --
> >> Kai Voigt
> >> [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >>
>
> --
> Kai Voigt
> [EMAIL PROTECTED]
>
>
>
>
>
+
Zhiwei Lin 2013-01-28, 13:09
+
Oleg Ruchovets 2013-01-28, 13:51
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB