|
Oleg Ruchovets
2013-01-28, 13:06
Rodrigo Ribeiro
2013-01-28, 15:17
Oleg Ruchovets
2013-01-28, 15:49
Michel Segel
2013-01-28, 15:54
Oleg Ruchovets
2013-01-28, 16:24
Rodrigo Ribeiro
2013-01-28, 16:27
Oleg Ruchovets
2013-01-28, 17:07
Rodrigo Ribeiro
2013-01-28, 17:24
Oleg Ruchovets
2013-01-28, 17:45
Oleg Ruchovets
2013-01-30, 09:57
Rodrigo Ribeiro
2013-01-30, 18:34
Oleg Ruchovets
2013-01-31, 13:52
Rodrigo Ribeiro
2013-01-31, 14:34
Oleg Ruchovets
2013-01-31, 15:39
Rodrigo Ribeiro
2013-01-31, 15:51
|
-
how to model data based on "time bucket"Oleg Ruchovets 2013-01-28, 13:06
Hi ,
I have such row data structure: event_id | time ============event1 | 10:07 event2 | 10:10 event3 | 10:12 event4 | 10:20 event5 | 10:23 event6 | 10:25 Numbers of records is 50-100 million. Question: I need to find group of events starting form eventX and enters to the time window bucket = T. For example: if T=7 munutes. Starting from event event1- {event1, event2 , event3} were detected durint 7 minutes. Starting from event event2- {event2 , event3} were detected durint 7 minutes. Starting from event event4 - {event4, event5 , event6} were detected during 7 minutes. Is there a way to model the data in hbase to get? Thanks
-
Re: how to model data based on "time bucket"Rodrigo Ribeiro 2013-01-28, 15:17
You can use another table as a index, using a rowkey like
'{time}:{event_id}', and then scan in the range ["10:07", "10:15"). On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote: > Hi , > > I have such row data structure: > > event_id | time > ============> event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > > event4 | 10:20 > event5 | 10:23 > event6 | 10:25 > > > Numbers of records is 50-100 million. > > > Question: > > I need to find group of events starting form eventX and enters to the time > window bucket = T. > > > For example: if T=7 munutes. > Starting from event event1- {event1, event2 , event3} were detected durint > 7 minutes. > > Starting from event event2- {event2 , event3} were detected durint 7 > minutes. > > Starting from event event4 - {event4, event5 , event6} were detected during > 7 minutes. > Is there a way to model the data in hbase to get? > > Thanks > -- *Rodrigo Pereira Ribeiro* Software Developer www.jusbrasil.com.br
-
Re: how to model data based on "time bucket"Oleg Ruchovets 2013-01-28, 15:49
Hi Rodrigo.
Can you please explain in more details your solution.You said that I will have another table. How many table will I have? Will I have 2 tables? What will be the schema of the tables? I try to explain what I try to achive: I have ~50 million records like {time|event}. I want to put the data in Hbase in such way : events of time X and all events what was after event X during time T minutes (for example during 7 minutes). So I will be able to scan all table and get groups like: {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} , {event4:10:06} {event2:10:30} correnponds to events {events5:10:32} , {event3:10:33} , {event3:10:36}. Thanks Oleg. On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < [EMAIL PROTECTED]> wrote: > You can use another table as a index, using a rowkey like > '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <[EMAIL PROTECTED] > >wrote: > > > Hi , > > > > I have such row data structure: > > > > event_id | time > > ============> > event1 | 10:07 > > event2 | 10:10 > > event3 | 10:12 > > > > event4 | 10:20 > > event5 | 10:23 > > event6 | 10:25 > > > > > > Numbers of records is 50-100 million. > > > > > > Question: > > > > I need to find group of events starting form eventX and enters to the > time > > window bucket = T. > > > > > > For example: if T=7 munutes. > > Starting from event event1- {event1, event2 , event3} were detected > durint > > 7 minutes. > > > > Starting from event event2- {event2 , event3} were detected durint 7 > > minutes. > > > > Starting from event event4 - {event4, event5 , event6} were detected > during > > 7 minutes. > > Is there a way to model the data in hbase to get? > > > > Thanks > > > > > > -- > > *Rodrigo Pereira Ribeiro* > Software Developer > www.jusbrasil.com.br >
-
Re: how to model data based on "time bucket"Michel Segel 2013-01-28, 15:54
Tough one in that if your events are keyed on time alone, you will hit a hot spot on write. Reads,not so much...
TSDB would be a good start ... You may not need 'buckets' but just a time stamp and set up a start and stop key values. Sent from a remote device. Please excuse any typos... Mike Segel On Jan 28, 2013, at 7:06 AM, Oleg Ruchovets <[EMAIL PROTECTED]> wrote: > Hi , > > I have such row data structure: > > event_id | time > ============> event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > > event4 | 10:20 > event5 | 10:23 > event6 | 10:25 > > > Numbers of records is 50-100 million. > > > Question: > > I need to find group of events starting form eventX and enters to the time > window bucket = T. > > > For example: if T=7 munutes. > Starting from event event1- {event1, event2 , event3} were detected durint > 7 minutes. > > Starting from event event2- {event2 , event3} were detected durint 7 > minutes. > > Starting from event event4 - {event4, event5 , event6} were detected during > 7 minutes. > Is there a way to model the data in hbase to get? > > Thanks
-
Re: how to model data based on "time bucket"Oleg Ruchovets 2013-01-28, 16:24
Yes ,
I agree that using only timestamp it will cause hotspot. I can create prespliting for regions. I saw TSDB video and presentation and their data model. I think this is not suitable for my case. I looked thru google alot and for my surprise there is any post about such clasic problem. It is very strange. I try to group timeseries not like most solutions provides -- every 1h , 1day , 5 minutes. it is simple. I need to group element relatively to itself by time: I mean I have {event1: 10:05} and I want to group it with elements which was after 10:05 during time X. in case X=7 minutes all events between 10:05 and 10:12 will be in the group. It is like a join of each row with all other rows , but the performance will be very bad. Currently I have 50Millon events => so it will be 50Million^2. That is why I don't want to use pure map/reduce. I want to use hbase as output of map/reduce and model data in a such way I described above. So is there a way to model data in such tipe of time buckets? Please advice. Thanks Oleg. On Mon, Jan 28, 2013 at 5:54 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > Tough one in that if your events are keyed on time alone, you will hit a > hot spot on write. Reads,not so much... > > TSDB would be a good start ... > > You may not need 'buckets' but just a time stamp and set up a start and > stop key values. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Jan 28, 2013, at 7:06 AM, Oleg Ruchovets <[EMAIL PROTECTED]> wrote: > > > Hi , > > > > I have such row data structure: > > > > event_id | time > > ============> > event1 | 10:07 > > event2 | 10:10 > > event3 | 10:12 > > > > event4 | 10:20 > > event5 | 10:23 > > event6 | 10:25 > > > > > > Numbers of records is 50-100 million. > > > > > > Question: > > > > I need to find group of events starting form eventX and enters to the > time > > window bucket = T. > > > > > > For example: if T=7 munutes. > > Starting from event event1- {event1, event2 , event3} were detected > durint > > 7 minutes. > > > > Starting from event event2- {event2 , event3} were detected durint 7 > > minutes. > > > > Starting from event event4 - {event4, event5 , event6} were detected > during > > 7 minutes. > > Is there a way to model the data in hbase to get? > > > > Thanks >
-
Re: how to model data based on "time bucket"Rodrigo Ribeiro 2013-01-28, 16:27
In the approach that i mentioned, you would need a table to retrieve the
time of a certain event(if this information can retrieve in another way, you may ignore this table). It would be like you posted: event_id | time ============event1 | 10:07 event2 | 10:10 event3 | 10:12 event4 | 10:20 And a secundary table would be like: rowkey ==========10:07:event1 10:10:event2 10:12:event3 10:20:event4 That way, for your first example, you first retrieve the time of the "event1" on the main table, and then scan starting from his position on the secondary table("10:07:event1"), until the end of the window. In this case(T=7) the scan will range ["10:07:event1", "10:05"). As Michel Segel mentioned, there is a hotspot problem on insertion using this approach alone. Using multiples buckets(could be a hash from the eventId) would distribute it better, but requires to scan on all buckets from the second table to get all events of the window of time. Assuming you use 3 buckets, it would look like: rowkey ==========*1_*10:07:event1 *2_*10:10:event2 *3_*10:12:event3 *2_*10:20:event4 The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1", "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the results. Hope it helps. On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote: > Hi Rodrigo. > Can you please explain in more details your solution.You said that I will > have another table. How many table will I have? Will I have 2 tables? What > will be the schema of the tables? > > I try to explain what I try to achive: > I have ~50 million records like {time|event}. I want to put the data in > Hbase in such way : > events of time X and all events what was after event X during time > T minutes (for example during 7 minutes). > So I will be able to scan all table and get groups like: > > {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} , > {event4:10:06} > {event2:10:30} correnponds to events {events5:10:32} , {event3:10:33} , > {event3:10:36}. > > Thanks > Oleg. > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < > [EMAIL PROTECTED]> wrote: > > > You can use another table as a index, using a rowkey like > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <[EMAIL PROTECTED] > > >wrote: > > > > > Hi , > > > > > > I have such row data structure: > > > > > > event_id | time > > > ============> > > event1 | 10:07 > > > event2 | 10:10 > > > event3 | 10:12 > > > > > > event4 | 10:20 > > > event5 | 10:23 > > > event6 | 10:25 > > > > > > > > > Numbers of records is 50-100 million. > > > > > > > > > Question: > > > > > > I need to find group of events starting form eventX and enters to the > > time > > > window bucket = T. > > > > > > > > > For example: if T=7 munutes. > > > Starting from event event1- {event1, event2 , event3} were detected > > durint > > > 7 minutes. > > > > > > Starting from event event2- {event2 , event3} were detected durint 7 > > > minutes. > > > > > > Starting from event event4 - {event4, event5 , event6} were detected > > during > > > 7 minutes. > > > Is there a way to model the data in hbase to get? > > > > > > Thanks > > > > > > > > > > > -- > > > > *Rodrigo Pereira Ribeiro* > > Software Developer > > www.jusbrasil.com.br > > > -- *Rodrigo Pereira Ribeiro* Software Developer T (71) 3033-6371 C (71) 8612-5847 [EMAIL PROTECTED] www.jusbrasil.com.br
-
Re: how to model data based on "time bucket"Oleg Ruchovets 2013-01-28, 17:07
Yes.
This is very interesting approach. Is it possible to read from main key and scan from another using map/reduce? I don't want to read from single client. I use hbase version 0.94.2.21. Thanks Oleg. On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < [EMAIL PROTECTED]> wrote: > In the approach that i mentioned, you would need a table to retrieve the > time of a certain event(if this information can retrieve in another way, > you may ignore this table). It would be like you posted: > event_id | time > ============> event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > event4 | 10:20 > > And a secundary table would be like: > rowkey > ==========> 10:07:event1 > 10:10:event2 > 10:12:event3 > 10:20:event4 > > That way, for your first example, you first retrieve the time of the > "event1" on the main table, and then scan starting from his position on the > secondary table("10:07:event1"), until the end of the window. > In this case(T=7) the scan will range ["10:07:event1", "10:05"). > > As Michel Segel mentioned, there is a hotspot problem on insertion using > this approach alone. > Using multiples buckets(could be a hash from the eventId) would distribute > it better, but requires to scan on all buckets from the second table to get > all events of the window of time. > > Assuming you use 3 buckets, it would look like: > rowkey > ==========> *1_*10:07:event1 > *2_*10:10:event2 > *3_*10:12:event3 > *2_*10:20:event4 > > The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1", > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the > results. > > Hope it helps. > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED] > >wrote: > > > Hi Rodrigo. > > Can you please explain in more details your solution.You said that I > will > > have another table. How many table will I have? Will I have 2 tables? > What > > will be the schema of the tables? > > > > I try to explain what I try to achive: > > I have ~50 million records like {time|event}. I want to put the data > in > > Hbase in such way : > > events of time X and all events what was after event X during time > > T minutes (for example during 7 minutes). > > So I will be able to scan all table and get groups like: > > > > {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} , > > {event4:10:06} > > {event2:10:30} correnponds to events {events5:10:32} , {event3:10:33} , > > {event3:10:36}. > > > > Thanks > > Oleg. > > > > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < > > [EMAIL PROTECTED]> wrote: > > > > > You can use another table as a index, using a rowkey like > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). > > > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <[EMAIL PROTECTED] > > > >wrote: > > > > > > > Hi , > > > > > > > > I have such row data structure: > > > > > > > > event_id | time > > > > ============> > > > event1 | 10:07 > > > > event2 | 10:10 > > > > event3 | 10:12 > > > > > > > > event4 | 10:20 > > > > event5 | 10:23 > > > > event6 | 10:25 > > > > > > > > > > > > Numbers of records is 50-100 million. > > > > > > > > > > > > Question: > > > > > > > > I need to find group of events starting form eventX and enters to the > > > time > > > > window bucket = T. > > > > > > > > > > > > For example: if T=7 munutes. > > > > Starting from event event1- {event1, event2 , event3} were detected > > > durint > > > > 7 minutes. > > > > > > > > Starting from event event2- {event2 , event3} were detected durint 7 > > > > minutes. > > > > > > > > Starting from event event4 - {event4, event5 , event6} were detected > > > during > > > > 7 minutes. > > > > Is there a way to model the data in hbase to get? > > > > > > > > Thanks > > > > > > > > > > > > > > > > -- > > > > > > *Rodrigo Pereira Ribeiro* > > > Software Developer > > > www.jusbrasil.com.br > > > > > > > > > -- > > *Rodrigo Pereira Ribeiro* > Software Developer > T (71) 3033-6371
-
Re: how to model data based on "time bucket"Rodrigo Ribeiro 2013-01-28, 17:24
Yes, it's possible,
Check this solution: http://stackoverflow.com/questions/11353911/extending-hadoops-tableinputformat-to-scan-with-a-prefix-used-for-distribution On Mon, Jan 28, 2013 at 2:07 PM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote: > Yes. > This is very interesting approach. > > Is it possible to read from main key and scan from another using > map/reduce? I don't want to read from single client. I use hbase version > 0.94.2.21. > > Thanks > Oleg. > > > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < > [EMAIL PROTECTED]> wrote: > > > In the approach that i mentioned, you would need a table to retrieve the > > time of a certain event(if this information can retrieve in another way, > > you may ignore this table). It would be like you posted: > > event_id | time > > ============> > event1 | 10:07 > > event2 | 10:10 > > event3 | 10:12 > > event4 | 10:20 > > > > And a secundary table would be like: > > rowkey > > ==========> > 10:07:event1 > > 10:10:event2 > > 10:12:event3 > > 10:20:event4 > > > > That way, for your first example, you first retrieve the time of the > > "event1" on the main table, and then scan starting from his position on > the > > secondary table("10:07:event1"), until the end of the window. > > In this case(T=7) the scan will range ["10:07:event1", "10:05"). > > > > As Michel Segel mentioned, there is a hotspot problem on insertion using > > this approach alone. > > Using multiples buckets(could be a hash from the eventId) would > distribute > > it better, but requires to scan on all buckets from the second table to > get > > all events of the window of time. > > > > Assuming you use 3 buckets, it would look like: > > rowkey > > ==========> > *1_*10:07:event1 > > *2_*10:10:event2 > > *3_*10:12:event3 > > *2_*10:20:event4 > > > > The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1", > > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the > > results. > > > > Hope it helps. > > > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Rodrigo. > > > Can you please explain in more details your solution.You said that I > > will > > > have another table. How many table will I have? Will I have 2 tables? > > What > > > will be the schema of the tables? > > > > > > I try to explain what I try to achive: > > > I have ~50 million records like {time|event}. I want to put the > data > > in > > > Hbase in such way : > > > events of time X and all events what was after event X during time > > > T minutes (for example during 7 minutes). > > > So I will be able to scan all table and get groups like: > > > > > > {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} > , > > > {event4:10:06} > > > {event2:10:30} correnponds to events {events5:10:32} , > {event3:10:33} , > > > {event3:10:36}. > > > > > > Thanks > > > Oleg. > > > > > > > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < > > > [EMAIL PROTECTED]> wrote: > > > > > > > You can use another table as a index, using a rowkey like > > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). > > > > > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Hi , > > > > > > > > > > I have such row data structure: > > > > > > > > > > event_id | time > > > > > ============> > > > > event1 | 10:07 > > > > > event2 | 10:10 > > > > > event3 | 10:12 > > > > > > > > > > event4 | 10:20 > > > > > event5 | 10:23 > > > > > event6 | 10:25 > > > > > > > > > > > > > > > Numbers of records is 50-100 million. > > > > > > > > > > > > > > > Question: > > > > > > > > > > I need to find group of events starting form eventX and enters to > the > > > > time > > > > > window bucket = T. > > > > > > > > > > > > > > > For example: if T=7 munutes. > > > > > Starting from event event1- {event1, event2 , event3} were detected > > > > durint > > > > > 7 minutes. > > > > > *Rodrigo Pereira Ribeiro* Software Developer www.jusbrasil.com.br
-
Re: how to model data based on "time bucket"Oleg Ruchovets 2013-01-28, 17:45
I think I didn't explain correct.
I want to read from 2 table in context of 1 mapreduce job. I mean I want to read one key from main table and scan range from another in the same mapreduce job.I only found MultiTableOutputFormat and there is no MultiTableInputFormat. Is there any workaround to read from 2 tables from one mapreduce? By the way I can use bulkloading to prevent hotspots and it gives capabilities of fast scan. Thansk Oleg. On Mon, Jan 28, 2013 at 7:24 PM, Rodrigo Ribeiro < [EMAIL PROTECTED]> wrote: > Yes, it's possible, > Check this solution: > > http://stackoverflow.com/questions/11353911/extending-hadoops-tableinputformat-to-scan-with-a-prefix-used-for-distribution > > On Mon, Jan 28, 2013 at 2:07 PM, Oleg Ruchovets <[EMAIL PROTECTED] > >wrote: > > > Yes. > > This is very interesting approach. > > > > Is it possible to read from main key and scan from another using > > map/reduce? I don't want to read from single client. I use hbase version > > 0.94.2.21. > > > > Thanks > > Oleg. > > > > > > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < > > [EMAIL PROTECTED]> wrote: > > > > > In the approach that i mentioned, you would need a table to retrieve > the > > > time of a certain event(if this information can retrieve in another > way, > > > you may ignore this table). It would be like you posted: > > > event_id | time > > > ============> > > event1 | 10:07 > > > event2 | 10:10 > > > event3 | 10:12 > > > event4 | 10:20 > > > > > > And a secundary table would be like: > > > rowkey > > > ==========> > > 10:07:event1 > > > 10:10:event2 > > > 10:12:event3 > > > 10:20:event4 > > > > > > That way, for your first example, you first retrieve the time of the > > > "event1" on the main table, and then scan starting from his position on > > the > > > secondary table("10:07:event1"), until the end of the window. > > > In this case(T=7) the scan will range ["10:07:event1", "10:05"). > > > > > > As Michel Segel mentioned, there is a hotspot problem on insertion > using > > > this approach alone. > > > Using multiples buckets(could be a hash from the eventId) would > > distribute > > > it better, but requires to scan on all buckets from the second table to > > get > > > all events of the window of time. > > > > > > Assuming you use 3 buckets, it would look like: > > > rowkey > > > ==========> > > *1_*10:07:event1 > > > *2_*10:10:event2 > > > *3_*10:12:event3 > > > *2_*10:20:event4 > > > > > > The scans would be: ["*1*_10:07:event1", "1_10:15"), > ["*2*_10:07:event1", > > > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine > the > > > results. > > > > > > Hope it helps. > > > > > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED] > > > >wrote: > > > > > > > Hi Rodrigo. > > > > Can you please explain in more details your solution.You said that > I > > > will > > > > have another table. How many table will I have? Will I have 2 tables? > > > What > > > > will be the schema of the tables? > > > > > > > > I try to explain what I try to achive: > > > > I have ~50 million records like {time|event}. I want to put the > > data > > > in > > > > Hbase in such way : > > > > events of time X and all events what was after event X during > time > > > > T minutes (for example during 7 minutes). > > > > So I will be able to scan all table and get groups like: > > > > > > > > {event1:10:02} corresponds to events {event2:10:03} , > {event3:10:05} > > , > > > > {event4:10:06} > > > > {event2:10:30} correnponds to events {events5:10:32} , > > {event3:10:33} , > > > > {event3:10:36}. > > > > > > > > Thanks > > > > Oleg. > > > > > > > > > > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > You can use another table as a index, using a rowkey like > > > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). > > > > > > > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <
-
Re: how to model data based on "time bucket"Oleg Ruchovets 2013-01-30, 09:57
Hi Rodrigo.
Using solution with 2 tables : one main and one as index. I have ~50 Million records , in my case I need scan all table and as a result I will have 50 Millions scans and It will kill all performance. Is there any other approach to model my usecase using hbase? Thanks Oleg. On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < [EMAIL PROTECTED]> wrote: > In the approach that i mentioned, you would need a table to retrieve the > time of a certain event(if this information can retrieve in another way, > you may ignore this table). It would be like you posted: > event_id | time > ============> event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > event4 | 10:20 > > And a secundary table would be like: > rowkey > ==========> 10:07:event1 > 10:10:event2 > 10:12:event3 > 10:20:event4 > > That way, for your first example, you first retrieve the time of the > "event1" on the main table, and then scan starting from his position on the > secondary table("10:07:event1"), until the end of the window. > In this case(T=7) the scan will range ["10:07:event1", "10:05"). > > As Michel Segel mentioned, there is a hotspot problem on insertion using > this approach alone. > Using multiples buckets(could be a hash from the eventId) would distribute > it better, but requires to scan on all buckets from the second table to get > all events of the window of time. > > Assuming you use 3 buckets, it would look like: > rowkey > ==========> *1_*10:07:event1 > *2_*10:10:event2 > *3_*10:12:event3 > *2_*10:20:event4 > > The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1", > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the > results. > > Hope it helps. > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED] > >wrote: > > > Hi Rodrigo. > > Can you please explain in more details your solution.You said that I > will > > have another table. How many table will I have? Will I have 2 tables? > What > > will be the schema of the tables? > > > > I try to explain what I try to achive: > > I have ~50 million records like {time|event}. I want to put the data > in > > Hbase in such way : > > events of time X and all events what was after event X during time > > T minutes (for example during 7 minutes). > > So I will be able to scan all table and get groups like: > > > > {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} , > > {event4:10:06} > > {event2:10:30} correnponds to events {events5:10:32} , {event3:10:33} , > > {event3:10:36}. > > > > Thanks > > Oleg. > > > > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < > > [EMAIL PROTECTED]> wrote: > > > > > You can use another table as a index, using a rowkey like > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). > > > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <[EMAIL PROTECTED] > > > >wrote: > > > > > > > Hi , > > > > > > > > I have such row data structure: > > > > > > > > event_id | time > > > > ============> > > > event1 | 10:07 > > > > event2 | 10:10 > > > > event3 | 10:12 > > > > > > > > event4 | 10:20 > > > > event5 | 10:23 > > > > event6 | 10:25 > > > > > > > > > > > > Numbers of records is 50-100 million. > > > > > > > > > > > > Question: > > > > > > > > I need to find group of events starting form eventX and enters to the > > > time > > > > window bucket = T. > > > > > > > > > > > > For example: if T=7 munutes. > > > > Starting from event event1- {event1, event2 , event3} were detected > > > durint > > > > 7 minutes. > > > > > > > > Starting from event event2- {event2 , event3} were detected durint 7 > > > > minutes. > > > > > > > > Starting from event event4 - {event4, event5 , event6} were detected > > > during > > > > 7 minutes. > > > > Is there a way to model the data in hbase to get? > > > > > > > > Thanks > > > > > > > > > > > > > > > > -- > > > > > > *Rodrigo Pereira Ribeiro* > > > Software Developer > > > www.jusbrasil.com.br > >
-
Re: how to model data based on "time bucket"Rodrigo Ribeiro 2013-01-30, 18:34
There is another option,
You could do a MapReduce job that, for each row from the main table, emits all times that it would be in the window of time, For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ... {"10:00": event1}. (also for "10:07" if you want to know those who happen in the same minute too) And in the Reduce step you aggregate and save in another table all events that are in the window of a given time. For: event_id | time ============event1 | 10:07 event2 | 10:10 event3 | 10:12 The result table would look like: time | events 10:00 | event1 10:01 | event1 10:02 | event1 10:03 | event1,event2 10:04 | event1,event2 10:05 | event1,event2,event3 10:06 | event1,event2,event3 10:07 | event2,event3 10:08 | event2,event3 ... So that, knowing a time when a event happens, you can get the list of events after it. For event1, we only look in the this table for the key "10:07". Sorry any typo, writing in a bit of hurry. On Wed, Jan 30, 2013 at 6:57 AM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote: > Hi Rodrigo. > Using solution with 2 tables : one main and one as index. > I have ~50 Million records , in my case I need scan all table and as a > result I will have 50 Millions scans and It will kill all performance. > > Is there any other approach to model my usecase using hbase? > > Thanks > Oleg. > > > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < > [EMAIL PROTECTED]> wrote: > > > In the approach that i mentioned, you would need a table to retrieve the > > time of a certain event(if this information can retrieve in another way, > > you may ignore this table). It would be like you posted: > > event_id | time > > ============> > event1 | 10:07 > > event2 | 10:10 > > event3 | 10:12 > > event4 | 10:20 > > > > And a secundary table would be like: > > rowkey > > ==========> > 10:07:event1 > > 10:10:event2 > > 10:12:event3 > > 10:20:event4 > > > > That way, for your first example, you first retrieve the time of the > > "event1" on the main table, and then scan starting from his position on > the > > secondary table("10:07:event1"), until the end of the window. > > In this case(T=7) the scan will range ["10:07:event1", "10:05"). > > > > As Michel Segel mentioned, there is a hotspot problem on insertion using > > this approach alone. > > Using multiples buckets(could be a hash from the eventId) would > distribute > > it better, but requires to scan on all buckets from the second table to > get > > all events of the window of time. > > > > Assuming you use 3 buckets, it would look like: > > rowkey > > ==========> > *1_*10:07:event1 > > *2_*10:10:event2 > > *3_*10:12:event3 > > *2_*10:20:event4 > > > > The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1", > > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the > > results. > > > > Hope it helps. > > > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Rodrigo. > > > Can you please explain in more details your solution.You said that I > > will > > > have another table. How many table will I have? Will I have 2 tables? > > What > > > will be the schema of the tables? > > > > > > I try to explain what I try to achive: > > > I have ~50 million records like {time|event}. I want to put the > data > > in > > > Hbase in such way : > > > events of time X and all events what was after event X during time > > > T minutes (for example during 7 minutes). > > > So I will be able to scan all table and get groups like: > > > > > > {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} > , > > > {event4:10:06} > > > {event2:10:30} correnponds to events {events5:10:32} , > {event3:10:33} , > > > {event3:10:36}. > > > > > > Thanks > > > Oleg. > > > > > > > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro < > > > [EMAIL PROTECTED]> wrote: > > > > > > > You can use another table as a index, using a rowkey like > > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). *Rodrigo Pereira Ribeiro* Software Developer www.jusbrasil.com.br
-
Re: how to model data based on "time bucket"Oleg Ruchovets 2013-01-31, 13:52
Hi Rodrigo ,
As usual you have very intereting ! :-) I am not sure that I understand exactly what do you mean and I try to simulate: Suppose we have such events in MAIN Table: event1 | 10:07 event2 | 10:10 event3 | 10:12 Time window T=5 minutes. =================on map================ : what should I emit for event1 and event2 For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , (10:05 ,event1), (10:04 ,event1), (10:03 ,event1). For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , (10:08 ,event2), (10:07 ,event2), (10:06 ,event2). For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , (10:10 ,event3), (10:09 ,event3), (10:08 ,event3). I calculate from the event time T=5 steps back Is it correct? ==================on reduce =========: 10:03|event1 10:04|event1 10:05|event1 10:06|event1,event2 10:07|event1,event2 10:08|event2,event3 10:09|event2,event3 10:10|event2,event3 10:11|event3 10:12|event3 this output will be writtent to the second table. is it correct? ============================================ How should I work with this table.Should I have to scan Main table : row by row and for every row get event time and based on that time query second table? In case I will do so , i still need to execute 50 million request? May be I need to work only with second table. How do I know what to query (scan)? I am sure I simply don't understand well what is your approach for solution. Please explain. Thanks Oleg. On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro < [EMAIL PROTECTED]> wrote: > There is another option, > You could do a MapReduce job that, for each row from the main table, emits > all times that it would be in the window of time, > For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ... > {"10:00": event1}. (also for "10:07" if you want to know those who happen > in the same minute too) > And in the Reduce step you aggregate and save in another table all events > that are in the window of a given time. > > For: > event_id | time > ============> event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > > The result table would look like: > time | events > 10:00 | event1 > 10:01 | event1 > 10:02 | event1 > 10:03 | event1,event2 > 10:04 | event1,event2 > 10:05 | event1,event2,event3 > 10:06 | event1,event2,event3 > 10:07 | event2,event3 > 10:08 | event2,event3 > ... > > So that, knowing a time when a event happens, you can get the list of > events after it. > For event1, we only look in the this table for the key "10:07". > > Sorry any typo, writing in a bit of hurry. > > On Wed, Jan 30, 2013 at 6:57 AM, Oleg Ruchovets <[EMAIL PROTECTED] > >wrote: > > > Hi Rodrigo. > > Using solution with 2 tables : one main and one as index. > > I have ~50 Million records , in my case I need scan all table and as a > > result I will have 50 Millions scans and It will kill all performance. > > > > Is there any other approach to model my usecase using hbase? > > > > Thanks > > Oleg. > > > > > > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < > > [EMAIL PROTECTED]> wrote: > > > > > In the approach that i mentioned, you would need a table to retrieve > the > > > time of a certain event(if this information can retrieve in another > way, > > > you may ignore this table). It would be like you posted: > > > event_id | time > > > ============> > > event1 | 10:07 > > > event2 | 10:10 > > > event3 | 10:12 > > > event4 | 10:20 > > > > > > And a secundary table would be like: > > > rowkey > > > ==========> > > 10:07:event1 > > > 10:10:event2 > > > 10:12:event3 > > > 10:20:event4 > > > > > > That way, for your first example, you first retrieve the time of the > > > "event1" on the main table, and then scan starting from his position on > > the > > > secondary table("10:07:event1"), until the end of the window. > > > In this case(T=7) the scan will range ["10:07:event1", "10:05"). > > > > > > As Michel Segel mentioned, there is a hotspot problem on insertion
-
Re: how to model data based on "time bucket"Rodrigo Ribeiro 2013-01-31, 14:34
Hi,
The Map and Reduce steps that you mention is the same as how i though. How should I work with this table.Should I have to scan Main table : row by > row and for every row get event time and based on that time query second > table? > > In case I will do so , i still need to execute 50 million request? > > May be I need to work only with second table. How do I know what to query > (scan)? Yes, using that approach you need to query both tables for each eventId you need to lookup. I thought about something else right now, i think it'll be better for your use case. You could could distinguish the events that begin and those that are after a time when you emit it. For the example using T=5, the emits would be: For event1 in map phase will be (10:07, [*begin*,event1]) , (10:06, [*after*, event1]) , (10:05, [*after*, event1]), (10:04, [*after*, event1]), (10:03, [ *after*, event1]). For event2 in map phase will be (10:10, [*begin*, event2]) , (10:09, [*after *, event2]) , (10:08, [*after*, event2]), (10:07, [*after*, event2]), (10:06, [*after*, event2]). For event3 in map phase will be (10:12, [*begin*, event3]) , (10:11, [*after *, event3]) , (10:10, [*after*, event3]), (10:09, [*after*, event3]), (10:08, [*after*, event3]). So, the reduce step know exactly who began in a given time and those in the window of time after it. The reduce step for key "10:07", would receive { [*begin*, event1], [*after*, event2], [*after*, event3] }, So you know that event1 began in this time and events 2 and 3 are in his window of time, and save it to a second table. The reduce step for key "10:06", would receive { [*after*, event1], [*after*, event2]}, No event began this time, so don't need to save. After all this, you gets a second table that i believe contains exactly what you want: eventid | events_window_time =======================event1 | event2, event3 event2 | event3 event3 | Let me know if i'm not being clear. On Thu, Jan 31, 2013 at 10:52 AM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote: > Hi Rodrigo , > As usual you have very intereting ! :-) > > I am not sure that I understand exactly what do you mean and I try to > simulate: > Suppose we have such events in MAIN Table: > event1 | 10:07 > event2 | 10:10 > event3 | 10:12 > Time window T=5 minutes. > > =================on map================ : > > what should I emit for event1 and event2 > > For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , (10:05 > ,event1), (10:04 ,event1), (10:03 ,event1). > For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , (10:08 > ,event2), (10:07 ,event2), (10:06 ,event2). > For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , (10:10 > ,event3), (10:09 ,event3), (10:08 ,event3). > > I calculate from the event time T=5 steps back Is it correct? > > ==================on reduce =========: > > 10:03|event1 > 10:04|event1 > 10:05|event1 > 10:06|event1,event2 > 10:07|event1,event2 > 10:08|event2,event3 > 10:09|event2,event3 > 10:10|event2,event3 > 10:11|event3 > 10:12|event3 > > this output will be writtent to the second table. is it correct? > > ============================================> > How should I work with this table.Should I have to scan Main table : row by > row and for every row get event time and based on that time query second > table? > > In case I will do so , i still need to execute 50 million request? > > May be I need to work only with second table. How do I know what to query > (scan)? > > I am sure I simply don't understand well what is your approach for > solution. > > Please explain. > > Thanks > Oleg. > > > > > > > > > On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro < > [EMAIL PROTECTED]> wrote: > > > There is another option, > > You could do a MapReduce job that, for each row from the main table, > emits > > all times that it would be in the window of time, > > For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ... *Rodrigo Pereira Ribeiro* Software Developer www.jusbrasil.com.br
-
Re: how to model data based on "time bucket"Oleg Ruchovets 2013-01-31, 15:39
Hi Rodrigo ,
That is just GREAT Idea :-) !!! But how did you get a final result: =======================event1 | event2, event3 event2 | event3 event3 | I tried to simulate and didn't get event1| event2,event3 (10:03, [*after*, event1]) (10:04, [*after*, event1]) (10:05, [*after*, event1]) (10:06, [*after*, event1]), (10:06, [*after*, event2]) (10:07, *[*begin*,*event1]) , (10:07, [*after*, event2]) (10:08, [*after*, event2]), (10:08, [*after*, event3]) (10:09, [*after *, event2]), (10:09, [*after*, event3]) (10:10, *[*begin*, *event2]), (10:10, [*after*, event3]) (10:11, [*after *, event3]) (10:12, *[*begin*, *event3]) Thanks Oleg. On Thu, Jan 31, 2013 at 4:34 PM, Rodrigo Ribeiro < [EMAIL PROTECTED]> wrote: > Hi, > The Map and Reduce steps that you mention is the same as how i though. > > How should I work with this table.Should I have to scan Main table : row by > > row and for every row get event time and based on that time query second > > table? > > > > In case I will do so , i still need to execute 50 million request? > > > > May be I need to work only with second table. How do I know what to query > > (scan)? > > > Yes, using that approach you need to query both tables for each eventId you > need to lookup. > > I thought about something else right now, i think it'll be better for your > use case. > You could could distinguish the events that begin and those that are after > a time when you emit it. > For the example using T=5, the emits would be: > > For event1 in map phase will be (10:07, [*begin*,event1]) , (10:06, > [*after*, > event1]) , (10:05, [*after*, event1]), (10:04, [*after*, event1]), (10:03, > [ > *after*, event1]). > For event2 in map phase will be (10:10, [*begin*, event2]) , (10:09, > [*after > *, event2]) , (10:08, [*after*, event2]), (10:07, [*after*, event2]), > (10:06, [*after*, event2]). > For event3 in map phase will be (10:12, [*begin*, event3]) , (10:11, > [*after > *, event3]) , (10:10, [*after*, event3]), (10:09, [*after*, event3]), > (10:08, [*after*, event3]). > > > So, the reduce step know exactly who began in a given time and those in the > window of time after it. > > The reduce step for key "10:07", would receive { [*begin*, event1], > [*after*, > event2], [*after*, event3] }, > So you know that event1 began in this time and events 2 and 3 are in his > window of time, and save it to a second table. > > The reduce step for key "10:06", would receive { [*after*, event1], > [*after*, > event2]}, > No event began this time, so don't need to save. > > After all this, you gets a second table that i believe contains exactly > what you want: > eventid | events_window_time > =======================> event1 | event2, event3 > event2 | event3 > event3 | > > Let me know if i'm not being clear. > > On Thu, Jan 31, 2013 at 10:52 AM, Oleg Ruchovets <[EMAIL PROTECTED] > >wrote: > > > Hi Rodrigo , > > As usual you have very intereting ! :-) > > > > I am not sure that I understand exactly what do you mean and I try to > > simulate: > > Suppose we have such events in MAIN Table: > > event1 | 10:07 > > event2 | 10:10 > > event3 | 10:12 > > Time window T=5 minutes. > > > > =================on map================ : > > > > what should I emit for event1 and event2 > > > > For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , > (10:05 > > ,event1), (10:04 ,event1), (10:03 ,event1). > > For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , > (10:08 > > ,event2), (10:07 ,event2), (10:06 ,event2). > > For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , > (10:10 > > ,event3), (10:09 ,event3), (10:08 ,event3). > > > > I calculate from the event time T=5 steps back Is it correct? > > > > ==================on reduce =========: > > > > 10:03|event1 > > 10:04|event1 > > 10:05|event1 > > 10:06|event1,event2 > > 10:07|event1,event2 > > 10:08|event2,event3 > > 10:09|event2,event3
-
Re: how to model data based on "time bucket"Rodrigo Ribeiro 2013-01-31, 15:51
Yes, you are correct, event3 never emits for the time "10:07".
The proper result table is, as you mention: =======================event1 | event2 event2 | event3 event3 | I guess i was thinking about the old example(T=7). :) On Thu, Jan 31, 2013 at 12:39 PM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote: > Hi Rodrigo , > > That is just GREAT Idea :-) !!! > > But how did you get a final result: > > =======================> event1 | event2, event3 > event2 | event3 > event3 | > I tried to simulate and didn't get event1| event2,event3 > > > (10:03, [*after*, event1]) > (10:04, [*after*, event1]) > (10:05, [*after*, event1]) > (10:06, [*after*, event1]), (10:06, [*after*, event2]) > (10:07, *[*begin*,*event1]) , (10:07, [*after*, event2]) > (10:08, [*after*, event2]), (10:08, [*after*, event3]) > (10:09, [*after *, event2]), (10:09, [*after*, event3]) > (10:10, *[*begin*, *event2]), (10:10, [*after*, event3]) > (10:11, [*after *, event3]) > (10:12, *[*begin*, *event3]) > > Thanks > Oleg. > > > > > On Thu, Jan 31, 2013 at 4:34 PM, Rodrigo Ribeiro < > [EMAIL PROTECTED]> wrote: > > > Hi, > > The Map and Reduce steps that you mention is the same as how i though. > > > > How should I work with this table.Should I have to scan Main table : row > by > > > row and for every row get event time and based on that time query > second > > > table? > > > > > > In case I will do so , i still need to execute 50 million request? > > > > > > May be I need to work only with second table. How do I know what to > query > > > (scan)? > > > > > > Yes, using that approach you need to query both tables for each eventId > you > > need to lookup. > > > > I thought about something else right now, i think it'll be better for > your > > use case. > > You could could distinguish the events that begin and those that are > after > > a time when you emit it. > > For the example using T=5, the emits would be: > > > > For event1 in map phase will be (10:07, [*begin*,event1]) , (10:06, > > [*after*, > > event1]) , (10:05, [*after*, event1]), (10:04, [*after*, event1]), > (10:03, > > [ > > *after*, event1]). > > For event2 in map phase will be (10:10, [*begin*, event2]) , (10:09, > > [*after > > *, event2]) , (10:08, [*after*, event2]), (10:07, [*after*, event2]), > > (10:06, [*after*, event2]). > > For event3 in map phase will be (10:12, [*begin*, event3]) , (10:11, > > [*after > > *, event3]) , (10:10, [*after*, event3]), (10:09, [*after*, event3]), > > (10:08, [*after*, event3]). > > > > > > So, the reduce step know exactly who began in a given time and those in > the > > window of time after it. > > > > The reduce step for key "10:07", would receive { [*begin*, event1], > > [*after*, > > event2], [*after*, event3] }, > > So you know that event1 began in this time and events 2 and 3 are in his > > window of time, and save it to a second table. > > > > The reduce step for key "10:06", would receive { [*after*, event1], > > [*after*, > > event2]}, > > No event began this time, so don't need to save. > > > > After all this, you gets a second table that i believe contains exactly > > what you want: > > eventid | events_window_time > > =======================> > event1 | event2, event3 > > event2 | event3 > > event3 | > > > > Let me know if i'm not being clear. > > > > On Thu, Jan 31, 2013 at 10:52 AM, Oleg Ruchovets <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Rodrigo , > > > As usual you have very intereting ! :-) > > > > > > I am not sure that I understand exactly what do you mean and I try to > > > simulate: > > > Suppose we have such events in MAIN Table: > > > event1 | 10:07 > > > event2 | 10:10 > > > event3 | 10:12 > > > Time window T=5 minutes. > > > > > > =================on map================ : > > > > > > what should I emit for event1 and event2 > > > > > > For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , > > (10:05 > > > ,event1), (10:04 ,event1), (10:03 ,event1). *Rodrigo Pereira Ribeiro* Software Developer www.jusbrasil.com.br |