Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> how to model data based on "time bucket"


Copy link to this message
-
Re: how to model data based on "time bucket"
Hi Rodrigo ,
  As usual you have very intereting ! :-)

I am not sure that I understand exactly what do you mean and I try to
simulate:
     Suppose we have such events in MAIN Table:
            event1 | 10:07
            event2 | 10:10
            event3 | 10:12
     Time window T=5 minutes.

=================on  map================ :

what should I emit for event1 and event2

For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , (10:05
,event1), (10:04 ,event1), (10:03 ,event1).
For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , (10:08
,event2), (10:07 ,event2), (10:06 ,event2).
For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , (10:10
,event3), (10:09 ,event3), (10:08 ,event3).

I calculate from the event time T=5 steps back Is it correct?

==================on reduce =========:

10:03|event1
10:04|event1
10:05|event1
10:06|event1,event2
10:07|event1,event2
10:08|event2,event3
10:09|event2,event3
10:10|event2,event3
10:11|event3
10:12|event3

this output will be writtent to the second table. is it correct?

============================================
How should I work with this table.Should I have to scan Main table : row by
row and for every row get event time and based on that time query second
table?

    In case I will do so , i still need to execute 50 million request?

May be I need to work only with second table. How do I know what to query
(scan)?

I am sure I simply don't understand well what is your approach for solution.

Please explain.

Thanks
Oleg.
On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro <
[EMAIL PROTECTED]> wrote:

> There is another option,
> You could do a MapReduce job that, for each row from the main table, emits
> all times that it would be in the window of time,
> For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ...
> {"10:00": event1}. (also for "10:07" if you want to know those who happen
> in the same minute too)
> And in the Reduce step you aggregate and save in another table all events
> that are in the window of a given time.
>
> For:
> event_id | time
> ============> event1 | 10:07
> event2 | 10:10
> event3 | 10:12
>
> The result table would look like:
> time   | events
> 10:00 | event1
> 10:01 | event1
> 10:02 | event1
> 10:03 | event1,event2
> 10:04 | event1,event2
> 10:05 | event1,event2,event3
> 10:06 | event1,event2,event3
> 10:07 | event2,event3
> 10:08 | event2,event3
> ...
>
> So that, knowing a time when a event happens, you can get the list of
> events after it.
> For event1, we only look in the this table for the key "10:07".
>
> Sorry any typo, writing in a bit of hurry.
>
> On Wed, Jan 30, 2013 at 6:57 AM, Oleg Ruchovets <[EMAIL PROTECTED]
> >wrote:
>
> > Hi Rodrigo.
> >     Using solution with 2 tables : one main and one as index.
> > I have ~50 Million records , in my case I need scan all table and as a
> > result I will have 50 Millions scans and It will kill all performance.
> >
> > Is there any other approach to model my usecase using hbase?
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro <
> > [EMAIL PROTECTED]> wrote:
> >
> > > In the approach that i mentioned, you would need a table to retrieve
> the
> > > time of a certain event(if this information can retrieve in another
> way,
> > > you may ignore this table). It would be like you posted:
> > > event_id | time
> > > ============> > > event1 | 10:07
> > > event2 | 10:10
> > > event3 | 10:12
> > > event4 | 10:20
> > >
> > > And a secundary table would be like:
> > > rowkey
> > > ==========> > > 10:07:event1
> > > 10:10:event2
> > > 10:12:event3
> > > 10:20:event4
> > >
> > > That way, for your first example, you first retrieve the time of the
> > > "event1" on the main table, and then scan starting from his position on
> > the
> > > secondary table("10:07:event1"), until the end of the window.
> > > In this case(T=7) the scan will range ["10:07:event1", "10:05").
> > >
> > > As Michel Segel mentioned, there is a hotspot problem on insertion
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB