Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> how to model data based on "time bucket"


+
Oleg Ruchovets 2013-01-28, 13:06
+
Rodrigo Ribeiro 2013-01-28, 15:17
+
Oleg Ruchovets 2013-01-28, 15:49
Copy link to this message
-
Re: how to model data based on "time bucket"
In the approach that i mentioned, you would need a table to retrieve the
time of a certain event(if this information can retrieve in another way,
you may ignore this table). It would be like you posted:
event_id | time
============event1 | 10:07
event2 | 10:10
event3 | 10:12
event4 | 10:20

And a secundary table would be like:
rowkey
==========10:07:event1
10:10:event2
10:12:event3
10:20:event4

That way, for your first example, you first retrieve the time of the
"event1" on the main table, and then scan starting from his position on the
secondary table("10:07:event1"), until the end of the window.
In this case(T=7) the scan will range ["10:07:event1", "10:05").

As Michel Segel mentioned, there is a hotspot problem on insertion using
this approach alone.
Using multiples buckets(could be a hash from the eventId) would distribute
it better, but requires to scan on all buckets from the second table to get
all events of the window of time.

Assuming you use 3 buckets, it would look like:
rowkey
==========*1_*10:07:event1
*2_*10:10:event2
*3_*10:12:event3
*2_*10:20:event4

The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1",
"2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the
results.

Hope it helps.

On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote:

> Hi Rodrigo.
>   Can you please explain in more details your solution.You said that I will
> have another table. How many table will I have? Will I have 2 tables? What
> will be the schema of the tables?
>
> I try to explain what I try to achive:
>     I have ~50 million records like {time|event}. I want to put the data in
> Hbase in such way :
>     events of time X and all events what was after event X during time
> T minutes (for example during 7 minutes).
> So I will be able to scan all table and get groups like:
>
>   {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} ,
> {event4:10:06}
>   {event2:10:30} correnponds to events {events5:10:32} , {event3:10:33} ,
> {event3:10:36}.
>
> Thanks
> Oleg.
>
>
> On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro <
> [EMAIL PROTECTED]> wrote:
>
> > You can use another table as a index, using a rowkey like
> > '{time}:{event_id}', and then scan in the range ["10:07", "10:15").
> >
> > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi ,
> > >
> > > I have such row data structure:
> > >
> > > event_id | time
> > > ============> > > event1 | 10:07
> > > event2 | 10:10
> > > event3 | 10:12
> > >
> > > event4 | 10:20
> > > event5 | 10:23
> > > event6 | 10:25
> > >
> > >
> > > Numbers of records is 50-100 million.
> > >
> > >
> > > Question:
> > >
> > > I need to find group of events starting form eventX and enters to the
> > time
> > > window bucket = T.
> > >
> > >
> > > For example: if T=7 munutes.
> > > Starting from event event1- {event1, event2 , event3} were detected
> > durint
> > > 7 minutes.
> > >
> > > Starting from event event2- {event2 , event3} were detected durint 7
> > > minutes.
> > >
> > > Starting from event event4 - {event4, event5 , event6} were detected
> > during
> > > 7 minutes.
> > > Is there a way to model the data in hbase to get?
> > >
> > > Thanks
> > >
> >
> >
> >
> > --
> >
> > *Rodrigo Pereira Ribeiro*
> > Software Developer
> > www.jusbrasil.com.br
> >
>

--

*Rodrigo Pereira Ribeiro*
Software Developer
T (71) 3033-6371
C (71) 8612-5847
[EMAIL PROTECTED]
www.jusbrasil.com.br
+
Oleg Ruchovets 2013-01-28, 17:07
+
Rodrigo Ribeiro 2013-01-28, 17:24
+
Oleg Ruchovets 2013-01-28, 17:45
+
Oleg Ruchovets 2013-01-30, 09:57
+
Rodrigo Ribeiro 2013-01-30, 18:34
+
Oleg Ruchovets 2013-01-31, 13:52
+
Rodrigo Ribeiro 2013-01-31, 14:34
+
Oleg Ruchovets 2013-01-31, 15:39
+
Rodrigo Ribeiro 2013-01-31, 15:51
+
Michel Segel 2013-01-28, 15:54
+
Oleg Ruchovets 2013-01-28, 16:24
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB