Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - how to model data based on "time bucket"


Copy link to this message
-
Re: how to model data based on "time bucket"
Oleg Ruchovets 2013-01-31, 13:52
Hi Rodrigo ,
  As usual you have very intereting ! :-)

I am not sure that I understand exactly what do you mean and I try to
simulate:
     Suppose we have such events in MAIN Table:
            event1 | 10:07
            event2 | 10:10
            event3 | 10:12
     Time window T=5 minutes.

=================on  map================ :

what should I emit for event1 and event2

For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , (10:05
,event1), (10:04 ,event1), (10:03 ,event1).
For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , (10:08
,event2), (10:07 ,event2), (10:06 ,event2).
For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , (10:10
,event3), (10:09 ,event3), (10:08 ,event3).

I calculate from the event time T=5 steps back Is it correct?

==================on reduce =========:

10:03|event1
10:04|event1
10:05|event1
10:06|event1,event2
10:07|event1,event2
10:08|event2,event3
10:09|event2,event3
10:10|event2,event3
10:11|event3
10:12|event3

this output will be writtent to the second table. is it correct?

============================================
How should I work with this table.Should I have to scan Main table : row by
row and for every row get event time and based on that time query second
table?

    In case I will do so , i still need to execute 50 million request?

May be I need to work only with second table. How do I know what to query
(scan)?

I am sure I simply don't understand well what is your approach for solution.

Please explain.

Thanks
Oleg.
On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro <
[EMAIL PROTECTED]> wrote:

> There is another option,
> You could do a MapReduce job that, for each row from the main table, emits
> all times that it would be in the window of time,
> For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ...
> {"10:00": event1}. (also for "10:07" if you want to know those who happen
> in the same minute too)
> And in the Reduce step you aggregate and save in another table all events
> that are in the window of a given time.
>
> For:
> event_id | time
> ============> event1 | 10:07
> event2 | 10:10
> event3 | 10:12
>
> The result table would look like:
> time   | events
> 10:00 | event1
> 10:01 | event1
> 10:02 | event1
> 10:03 | event1,event2
> 10:04 | event1,event2
> 10:05 | event1,event2,event3
> 10:06 | event1,event2,event3
> 10:07 | event2,event3
> 10:08 | event2,event3
> ...
>
> So that, knowing a time when a event happens, you can get the list of
> events after it.
> For event1, we only look in the this table for the key "10:07".
>
> Sorry any typo, writing in a bit of hurry.
>
> On Wed, Jan 30, 2013 at 6:57 AM, Oleg Ruchovets <[EMAIL PROTECTED]
> >wrote:
>
> > Hi Rodrigo.
> >     Using solution with 2 tables : one main and one as index.
> > I have ~50 Million records , in my case I need scan all table and as a
> > result I will have 50 Millions scans and It will kill all performance.
> >
> > Is there any other approach to model my usecase using hbase?
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro <
> > [EMAIL PROTECTED]> wrote:
> >
> > > In the approach that i mentioned, you would need a table to retrieve
> the
> > > time of a certain event(if this information can retrieve in another
> way,
> > > you may ignore this table). It would be like you posted:
> > > event_id | time
> > > ============> > > event1 | 10:07
> > > event2 | 10:10
> > > event3 | 10:12
> > > event4 | 10:20
> > >
> > > And a secundary table would be like:
> > > rowkey
> > > ==========> > > 10:07:event1
> > > 10:10:event2
> > > 10:12:event3
> > > 10:20:event4
> > >
> > > That way, for your first example, you first retrieve the time of the
> > > "event1" on the main table, and then scan starting from his position on
> > the
> > > secondary table("10:07:event1"), until the end of the window.
> > > In this case(T=7) the scan will range ["10:07:event1", "10:05").
> > >
> > > As Michel Segel mentioned, there is a hotspot problem on insertion