|
Oleg Ruchovets
2013-01-28, 12:56
Zhiwei Lin
2013-01-28, 13:09
Kai Voigt
2013-01-28, 13:17
Oleg Ruchovets
2013-01-28, 13:43
Kai Voigt
2013-01-28, 13:48
Oleg Ruchovets
2013-01-28, 13:51
Oleg Ruchovets
2013-01-28, 14:49
|
-
aggregation by time windowOleg Ruchovets 2013-01-28, 12:56
Hi ,
I have such row data structure: event_id | time =============event1 | 10:07 event2 | 10:10 event3 | 10:12 event4 | 10:20 event5 | 10:23 event6 | 10:25 Numbers of records is 50-100 million. Question: I need to get events that was during time T. For example: if T=7 munutes. event1 , event2 , event3 were detected durint 7 minutes. event4 , event5 , event6 were detected during 7 minutes. How can I implement such aggregation using map/reduce. Thanks Oleg.
-
Re: aggregation by time windowZhiwei Lin 2013-01-28, 13:09
do you mean every 7 mins?
e.g, [10:07, 10:14), [10:14, 10:21) ..... On 28 January 2013 12:56, Oleg Ruchovets <[EMAIL PROTECTED]> wrote: > Hi , > I have such row data structure: > > event_id | time > =============> event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > > event4 | 10:20 > event5 | 10:23 > event6 | 10:25 > > Numbers of records is 50-100 million. > > Question: > I need to get events that was during time T. > > For example: if T=7 munutes. > event1 , event2 , event3 were detected durint 7 minutes. > event4 , event5 , event6 were detected during 7 minutes. > > How can I implement such aggregation using map/reduce. > > Thanks > Oleg. > -- Best wishes. Zhiwei
-
Re: aggregation by time windowKai Voigt 2013-01-28, 13:17
Quick idea:
since each of your events will go into several buckets, you could use map() to emit each item multiple times for each bucket. Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>: > Hi , > I have such row data structure: > > event_id | time > =============> event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > > event4 | 10:20 > event5 | 10:23 > event6 | 10:25 map(event1,10:07) would emit (10:04,event1), (10:05,event1), ..., (10:10,event1) and so on. In reduce(), all your desired events would meet for the same minute. Kai -- Kai Voigt [EMAIL PROTECTED]
-
Re: aggregation by time windowOleg Ruchovets 2013-01-28, 13:43
Hi Kai.
It is very interesting. Can you please explain in more details your Idea? What will be a key in a map phase? Suppose we have event at 10:07. How would you emit this to the multiple buckets? Thanks Oleg. On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > Quick idea: > > since each of your events will go into several buckets, you could use > map() to emit each item multiple times for each bucket. > > Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>: > > > Hi , > > I have such row data structure: > > > > event_id | time > > =============> > event1 | 10:07 > > event2 | 10:10 > > event3 | 10:12 > > > > event4 | 10:20 > > event5 | 10:23 > > event6 | 10:25 > > map(event1,10:07) would emit (10:04,event1), (10:05,event1), ..., > (10:10,event1) and so on. > > In reduce(), all your desired events would meet for the same minute. > > Kai > > -- > Kai Voigt > [EMAIL PROTECTED] > > > > >
-
Re: aggregation by time windowKai Voigt 2013-01-28, 13:48
Hi again,
the idea is that you emit every event multiple times. So your map input record (event1, 10:07) will be emitted seven times during the map() call. Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be the seven outputs for processing a single event. The output key will be the time stamps in which neighbourhood or interval each event should be joined with events that happened +/- 3 minutes near it. So events which happened within a 7 minutes distance will both be emitted with the same time stamp as the map() output, and thus meet in a reduce() call. A reduce() call will look like this: reduce(10:03, list_of_events). And those events had time stamps between 10:00 and 10:06 in the original input. Kai Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>: > Hi Kai. > It is very interesting. Can you please explain in more details your > Idea? > What will be a key in a map phase? > > Suppose we have event at 10:07. How would you emit this to the multiple > buckets? > > Thanks > Oleg. > > > On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > >> Quick idea: >> >> since each of your events will go into several buckets, you could use >> map() to emit each item multiple times for each bucket. >> >> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>: >> >>> Hi , >>> I have such row data structure: >>> >>> event_id | time >>> =============>>> event1 | 10:07 >>> event2 | 10:10 >>> event3 | 10:12 >>> >>> event4 | 10:20 >>> event5 | 10:23 >>> event6 | 10:25 >> >> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ..., >> (10:10,event1) and so on. >> >> In reduce(), all your desired events would meet for the same minute. >> >> Kai >> >> -- >> Kai Voigt >> [EMAIL PROTECTED] >> >> >> >> >> -- Kai Voigt [EMAIL PROTECTED]
-
Re: aggregation by time windowOleg Ruchovets 2013-01-28, 13:51
Hi , Zhiwei.
No :-). Every 7 minutes is is easy. just transform time to milisecond/7*60000 will give you a bucket key. I need to do the following: Find the events which was dirung time T related to the event X. In very naive approach I need to take first event and find other events which happend during 7 minutes from first event time. But I think it will be very slow and I am looking for a way to improve this naive approach. Thanks Oleg. On Mon, Jan 28, 2013 at 3:09 PM, Zhiwei Lin <[EMAIL PROTECTED]> wrote: > do you mean every 7 mins? > e.g, [10:07, 10:14), > [10:14, 10:21) ..... > > On 28 January 2013 12:56, Oleg Ruchovets <[EMAIL PROTECTED]> wrote: > > > Hi , > > I have such row data structure: > > > > event_id | time > > =============> > event1 | 10:07 > > event2 | 10:10 > > event3 | 10:12 > > > > event4 | 10:20 > > event5 | 10:23 > > event6 | 10:25 > > > > Numbers of records is 50-100 million. > > > > Question: > > I need to get events that was during time T. > > > > For example: if T=7 munutes. > > event1 , event2 , event3 were detected durint 7 minutes. > > event4 , event5 , event6 were detected during 7 minutes. > > > > How can I implement such aggregation using map/reduce. > > > > Thanks > > Oleg. > > > > > > -- > > Best wishes. > > Zhiwei >
-
Re: aggregation by time windowOleg Ruchovets 2013-01-28, 14:49
Well , much more clear , but still have a questions :-)
Suppose we have 3 map input records event1 | 10:07 event2 | 10:10 event3 | 10:12 Output from map(event1 | 10:07) will be : mapOutput(10:04:event1) mapOutput(10:05:event1) mapOutput(10:06:event1) mapOutput(10:07:event1) mapOutput(10:08:event1) mapOutput(10:09:event1) mapOutput(10:10:event1) Output for map(event2 | 10:10) will be: mapOutput(10:07:event2) mapOutput(10:08:event2) mapOutput(10:09:event2) mapOutput(10:10:event2) mapOutput(10:11:event2) mapOutput(10:12:event2) mapOutput(10:13:event2) Output for map (event3 | 10:12) will be: mapOutput(10:09: event3) mapOutput(10:10 : event3) mapOutput(10:11 : event3) mapOutput(10:12 : event3) mapOutput(10:13 : event3) mapOutput(10:14 : event3) mapOutput(10:15 : event3) Is it correct? If yes , in reducer phase I will get such inputs: reducer(10:04:event1) reducer(10:05:event1) reducer(10:06:event1) reducer(10:07:event1 ,event2) reducer(10:08:event1 , event2) reducer(10:09:event1 , event2 , event3) reducer(10:10:event1 , event2 , event3) reducer(10:11:event3) reducer(10:12:event3) reducer(10:13:event3) reducer(10:14:event3) reducer(10:15:event3) Iterating over each reducer input how can I know at the end of aggregations which events were during 7 minutes? Thansk Oleg. On Mon, Jan 28, 2013 at 3:48 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > Hi again, > > the idea is that you emit every event multiple times. So your map input > record (event1, 10:07) will be emitted seven times during the map() call. > Like I said, (10:04,event1), (10:05,event1), ..., (10:10,event1) will be > the seven outputs for processing a single event. > > The output key will be the time stamps in which neighbourhood or interval > each event should be joined with events that happened +/- 3 minutes near > it. So events which happened within a 7 minutes distance will both be > emitted with the same time stamp as the map() output, and thus meet in a > reduce() call. > > A reduce() call will look like this: reduce(10:03, list_of_events). And > those events had time stamps between 10:00 and 10:06 in the original input. > > Kai > > Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>: > > > Hi Kai. > > It is very interesting. Can you please explain in more details your > > Idea? > > What will be a key in a map phase? > > > > Suppose we have event at 10:07. How would you emit this to the multiple > > buckets? > > > > Thanks > > Oleg. > > > > > > On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > > > >> Quick idea: > >> > >> since each of your events will go into several buckets, you could use > >> map() to emit each item multiple times for each bucket. > >> > >> Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets <[EMAIL PROTECTED]>: > >> > >>> Hi , > >>> I have such row data structure: > >>> > >>> event_id | time > >>> =============> >>> event1 | 10:07 > >>> event2 | 10:10 > >>> event3 | 10:12 > >>> > >>> event4 | 10:20 > >>> event5 | 10:23 > >>> event6 | 10:25 > >> > >> map(event1,10:07) would emit (10:04,event1), (10:05,event1), ..., > >> (10:10,event1) and so on. > >> > >> In reduce(), all your desired events would meet for the same minute. > >> > >> Kai > >> > >> -- > >> Kai Voigt > >> [EMAIL PROTECTED] > >> > >> > >> > >> > >> > > -- > Kai Voigt > [EMAIL PROTECTED] > > > > > |