Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - how to model data based on "time bucket"


Copy link to this message
-
Re: how to model data based on "time bucket"
Rodrigo Ribeiro 2013-01-31, 14:34
Hi,
The Map and Reduce steps that you mention is the same as how i though.

How should I work with this table.Should I have to scan Main table : row by
> row and for every row get event time and based on that time query second
> table?
>
>     In case I will do so , i still need to execute 50 million request?
>
> May be I need to work only with second table. How do I know what to query
> (scan)?
Yes, using that approach you need to query both tables for each eventId you
need to lookup.

I thought about something else right now, i think it'll be better for your
use case.
You could could distinguish the events that begin and those that are after
a time when you emit it.
For the example using T=5, the emits would be:

For event1 in map phase will be (10:07, [*begin*,event1]) , (10:06, [*after*,
event1]) , (10:05, [*after*, event1]), (10:04, [*after*, event1]), (10:03, [
*after*, event1]).
For event2 in map phase will be (10:10, [*begin*, event2]) , (10:09, [*after
*, event2]) , (10:08, [*after*, event2]), (10:07, [*after*, event2]),
(10:06, [*after*, event2]).
For event3 in map phase will be (10:12, [*begin*, event3]) , (10:11, [*after
*, event3]) , (10:10, [*after*, event3]), (10:09, [*after*, event3]),
(10:08, [*after*, event3]).
So, the reduce step know exactly who began in a given time and those in the
window of time after it.

The reduce step for key "10:07", would receive { [*begin*, event1], [*after*,
event2], [*after*, event3] },
So you know that event1 began in this time and events 2 and 3 are in his
window of time, and save it to a second table.

The reduce step for key "10:06", would receive { [*after*, event1], [*after*,
event2]},
No event began this time, so don't need to save.

After all this, you gets a second table that i believe contains exactly
what you want:
eventid | events_window_time
=======================event1  | event2, event3
event2  | event3
event3  |

Let me know if i'm not being clear.

On Thu, Jan 31, 2013 at 10:52 AM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote:

> Hi Rodrigo ,
>   As usual you have very intereting ! :-)
>
> I am not sure that I understand exactly what do you mean and I try to
> simulate:
>      Suppose we have such events in MAIN Table:
>             event1 | 10:07
>             event2 | 10:10
>             event3 | 10:12
>      Time window T=5 minutes.
>
> =================on  map================ :
>
> what should I emit for event1 and event2
>
> For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , (10:05
> ,event1), (10:04 ,event1), (10:03 ,event1).
> For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , (10:08
> ,event2), (10:07 ,event2), (10:06 ,event2).
> For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , (10:10
> ,event3), (10:09 ,event3), (10:08 ,event3).
>
> I calculate from the event time T=5 steps back Is it correct?
>
> ==================on reduce =========:
>
> 10:03|event1
> 10:04|event1
> 10:05|event1
> 10:06|event1,event2
> 10:07|event1,event2
> 10:08|event2,event3
> 10:09|event2,event3
> 10:10|event2,event3
> 10:11|event3
> 10:12|event3
>
> this output will be writtent to the second table. is it correct?
>
> ============================================>
> How should I work with this table.Should I have to scan Main table : row by
> row and for every row get event time and based on that time query second
> table?
>
>     In case I will do so , i still need to execute 50 million request?
>
> May be I need to work only with second table. How do I know what to query
> (scan)?
>
> I am sure I simply don't understand well what is your approach for
> solution.
>
> Please explain.
>
> Thanks
> Oleg.
>
>
>
>
>
>
>
>
> On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro <
> [EMAIL PROTECTED]> wrote:
>
> > There is another option,
> > You could do a MapReduce job that, for each row from the main table,
> emits
> > all times that it would be in the window of time,
> > For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ...
*Rodrigo Pereira Ribeiro*
Software Developer
www.jusbrasil.com.br