Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> how to model data based on "time bucket"


+
Oleg Ruchovets 2013-01-28, 13:06
+
Rodrigo Ribeiro 2013-01-28, 15:17
+
Oleg Ruchovets 2013-01-28, 15:49
+
Rodrigo Ribeiro 2013-01-28, 16:27
+
Oleg Ruchovets 2013-01-28, 17:07
+
Rodrigo Ribeiro 2013-01-28, 17:24
+
Oleg Ruchovets 2013-01-28, 17:45
+
Oleg Ruchovets 2013-01-30, 09:57
Copy link to this message
-
Re: how to model data based on "time bucket"
There is another option,
You could do a MapReduce job that, for each row from the main table, emits
all times that it would be in the window of time,
For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ...
{"10:00": event1}. (also for "10:07" if you want to know those who happen
in the same minute too)
And in the Reduce step you aggregate and save in another table all events
that are in the window of a given time.

For:
event_id | time
============event1 | 10:07
event2 | 10:10
event3 | 10:12

The result table would look like:
time   | events
10:00 | event1
10:01 | event1
10:02 | event1
10:03 | event1,event2
10:04 | event1,event2
10:05 | event1,event2,event3
10:06 | event1,event2,event3
10:07 | event2,event3
10:08 | event2,event3
...

So that, knowing a time when a event happens, you can get the list of
events after it.
For event1, we only look in the this table for the key "10:07".

Sorry any typo, writing in a bit of hurry.

On Wed, Jan 30, 2013 at 6:57 AM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote:

> Hi Rodrigo.
>     Using solution with 2 tables : one main and one as index.
> I have ~50 Million records , in my case I need scan all table and as a
> result I will have 50 Millions scans and It will kill all performance.
>
> Is there any other approach to model my usecase using hbase?
>
> Thanks
> Oleg.
>
>
> On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro <
> [EMAIL PROTECTED]> wrote:
>
> > In the approach that i mentioned, you would need a table to retrieve the
> > time of a certain event(if this information can retrieve in another way,
> > you may ignore this table). It would be like you posted:
> > event_id | time
> > ============> > event1 | 10:07
> > event2 | 10:10
> > event3 | 10:12
> > event4 | 10:20
> >
> > And a secundary table would be like:
> > rowkey
> > ==========> > 10:07:event1
> > 10:10:event2
> > 10:12:event3
> > 10:20:event4
> >
> > That way, for your first example, you first retrieve the time of the
> > "event1" on the main table, and then scan starting from his position on
> the
> > secondary table("10:07:event1"), until the end of the window.
> > In this case(T=7) the scan will range ["10:07:event1", "10:05").
> >
> > As Michel Segel mentioned, there is a hotspot problem on insertion using
> > this approach alone.
> > Using multiples buckets(could be a hash from the eventId) would
> distribute
> > it better, but requires to scan on all buckets from the second table to
> get
> > all events of the window of time.
> >
> > Assuming you use 3 buckets, it would look like:
> > rowkey
> > ==========> > *1_*10:07:event1
> > *2_*10:10:event2
> > *3_*10:12:event3
> > *2_*10:20:event4
> >
> > The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1",
> > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the
> > results.
> >
> > Hope it helps.
> >
> > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi Rodrigo.
> > >   Can you please explain in more details your solution.You said that I
> > will
> > > have another table. How many table will I have? Will I have 2 tables?
> > What
> > > will be the schema of the tables?
> > >
> > > I try to explain what I try to achive:
> > >     I have ~50 million records like {time|event}. I want to put the
> data
> > in
> > > Hbase in such way :
> > >     events of time X and all events what was after event X during time
> > > T minutes (for example during 7 minutes).
> > > So I will be able to scan all table and get groups like:
> > >
> > >   {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05}
> ,
> > > {event4:10:06}
> > >   {event2:10:30} correnponds to events {events5:10:32} ,
> {event3:10:33} ,
> > > {event3:10:36}.
> > >
> > > Thanks
> > > Oleg.
> > >
> > >
> > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > You can use another table as a index, using a rowkey like
> > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15").
*Rodrigo Pereira Ribeiro*
Software Developer
www.jusbrasil.com.br
+
Oleg Ruchovets 2013-01-31, 13:52
+
Rodrigo Ribeiro 2013-01-31, 14:34
+
Oleg Ruchovets 2013-01-31, 15:39
+
Rodrigo Ribeiro 2013-01-31, 15:51
+
Michel Segel 2013-01-28, 15:54
+
Oleg Ruchovets 2013-01-28, 16:24