Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - how to model data based on "time bucket"


+
Oleg Ruchovets 2013-01-28, 13:06
+
Rodrigo Ribeiro 2013-01-28, 15:17
+
Oleg Ruchovets 2013-01-28, 15:49
+
Rodrigo Ribeiro 2013-01-28, 16:27
+
Oleg Ruchovets 2013-01-28, 17:07
Copy link to this message
-
Re: how to model data based on "time bucket"
Rodrigo Ribeiro 2013-01-28, 17:24
Yes, it's possible,
Check this solution:
http://stackoverflow.com/questions/11353911/extending-hadoops-tableinputformat-to-scan-with-a-prefix-used-for-distribution

On Mon, Jan 28, 2013 at 2:07 PM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote:

>  Yes.
> This is very interesting approach.
>
>        Is it possible to read from main key and scan from another using
> map/reduce? I don't want to read from single client. I use hbase version
> 0.94.2.21.
>
> Thanks
> Oleg.
>
>
> On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro <
> [EMAIL PROTECTED]> wrote:
>
> > In the approach that i mentioned, you would need a table to retrieve the
> > time of a certain event(if this information can retrieve in another way,
> > you may ignore this table). It would be like you posted:
> > event_id | time
> > ============> > event1 | 10:07
> > event2 | 10:10
> > event3 | 10:12
> > event4 | 10:20
> >
> > And a secundary table would be like:
> > rowkey
> > ==========> > 10:07:event1
> > 10:10:event2
> > 10:12:event3
> > 10:20:event4
> >
> > That way, for your first example, you first retrieve the time of the
> > "event1" on the main table, and then scan starting from his position on
> the
> > secondary table("10:07:event1"), until the end of the window.
> > In this case(T=7) the scan will range ["10:07:event1", "10:05").
> >
> > As Michel Segel mentioned, there is a hotspot problem on insertion using
> > this approach alone.
> > Using multiples buckets(could be a hash from the eventId) would
> distribute
> > it better, but requires to scan on all buckets from the second table to
> get
> > all events of the window of time.
> >
> > Assuming you use 3 buckets, it would look like:
> > rowkey
> > ==========> > *1_*10:07:event1
> > *2_*10:10:event2
> > *3_*10:12:event3
> > *2_*10:20:event4
> >
> > The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1",
> > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the
> > results.
> >
> > Hope it helps.
> >
> > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi Rodrigo.
> > >   Can you please explain in more details your solution.You said that I
> > will
> > > have another table. How many table will I have? Will I have 2 tables?
> > What
> > > will be the schema of the tables?
> > >
> > > I try to explain what I try to achive:
> > >     I have ~50 million records like {time|event}. I want to put the
> data
> > in
> > > Hbase in such way :
> > >     events of time X and all events what was after event X during time
> > > T minutes (for example during 7 minutes).
> > > So I will be able to scan all table and get groups like:
> > >
> > >   {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05}
> ,
> > > {event4:10:06}
> > >   {event2:10:30} correnponds to events {events5:10:32} ,
> {event3:10:33} ,
> > > {event3:10:36}.
> > >
> > > Thanks
> > > Oleg.
> > >
> > >
> > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > You can use another table as a index, using a rowkey like
> > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15").
> > > >
> > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <
> [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Hi ,
> > > > >
> > > > > I have such row data structure:
> > > > >
> > > > > event_id | time
> > > > > ============> > > > > event1 | 10:07
> > > > > event2 | 10:10
> > > > > event3 | 10:12
> > > > >
> > > > > event4 | 10:20
> > > > > event5 | 10:23
> > > > > event6 | 10:25
> > > > >
> > > > >
> > > > > Numbers of records is 50-100 million.
> > > > >
> > > > >
> > > > > Question:
> > > > >
> > > > > I need to find group of events starting form eventX and enters to
> the
> > > > time
> > > > > window bucket = T.
> > > > >
> > > > >
> > > > > For example: if T=7 munutes.
> > > > > Starting from event event1- {event1, event2 , event3} were detected
> > > > durint
> > > > > 7 minutes.
> > > > >
*Rodrigo Pereira Ribeiro*
Software Developer
www.jusbrasil.com.br
+
Oleg Ruchovets 2013-01-28, 17:45
+
Oleg Ruchovets 2013-01-30, 09:57
+
Rodrigo Ribeiro 2013-01-30, 18:34
+
Oleg Ruchovets 2013-01-31, 13:52
+
Rodrigo Ribeiro 2013-01-31, 14:34
+
Oleg Ruchovets 2013-01-31, 15:39
+
Rodrigo Ribeiro 2013-01-31, 15:51
+
Michel Segel 2013-01-28, 15:54
+
Oleg Ruchovets 2013-01-28, 16:24