Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> how to model data based on "time bucket"


+
Oleg Ruchovets 2013-01-28, 13:06
+
Rodrigo Ribeiro 2013-01-28, 15:17
+
Oleg Ruchovets 2013-01-28, 15:49
+
Rodrigo Ribeiro 2013-01-28, 16:27
+
Oleg Ruchovets 2013-01-28, 17:07
+
Rodrigo Ribeiro 2013-01-28, 17:24
+
Oleg Ruchovets 2013-01-28, 17:45
+
Oleg Ruchovets 2013-01-30, 09:57
+
Rodrigo Ribeiro 2013-01-30, 18:34
+
Oleg Ruchovets 2013-01-31, 13:52
Copy link to this message
-
Re: how to model data based on "time bucket"
Hi,
The Map and Reduce steps that you mention is the same as how i though.

How should I work with this table.Should I have to scan Main table : row by
> row and for every row get event time and based on that time query second
> table?
>
>     In case I will do so , i still need to execute 50 million request?
>
> May be I need to work only with second table. How do I know what to query
> (scan)?
Yes, using that approach you need to query both tables for each eventId you
need to lookup.

I thought about something else right now, i think it'll be better for your
use case.
You could could distinguish the events that begin and those that are after
a time when you emit it.
For the example using T=5, the emits would be:

For event1 in map phase will be (10:07, [*begin*,event1]) , (10:06, [*after*,
event1]) , (10:05, [*after*, event1]), (10:04, [*after*, event1]), (10:03, [
*after*, event1]).
For event2 in map phase will be (10:10, [*begin*, event2]) , (10:09, [*after
*, event2]) , (10:08, [*after*, event2]), (10:07, [*after*, event2]),
(10:06, [*after*, event2]).
For event3 in map phase will be (10:12, [*begin*, event3]) , (10:11, [*after
*, event3]) , (10:10, [*after*, event3]), (10:09, [*after*, event3]),
(10:08, [*after*, event3]).
So, the reduce step know exactly who began in a given time and those in the
window of time after it.

The reduce step for key "10:07", would receive { [*begin*, event1], [*after*,
event2], [*after*, event3] },
So you know that event1 began in this time and events 2 and 3 are in his
window of time, and save it to a second table.

The reduce step for key "10:06", would receive { [*after*, event1], [*after*,
event2]},
No event began this time, so don't need to save.

After all this, you gets a second table that i believe contains exactly
what you want:
eventid | events_window_time
=======================event1  | event2, event3
event2  | event3
event3  |

Let me know if i'm not being clear.

On Thu, Jan 31, 2013 at 10:52 AM, Oleg Ruchovets <[EMAIL PROTECTED]>wrote:

> Hi Rodrigo ,
>   As usual you have very intereting ! :-)
>
> I am not sure that I understand exactly what do you mean and I try to
> simulate:
>      Suppose we have such events in MAIN Table:
>             event1 | 10:07
>             event2 | 10:10
>             event3 | 10:12
>      Time window T=5 minutes.
>
> =================on  map================ :
>
> what should I emit for event1 and event2
>
> For event1 in map phase will be (10:07 ,event1) , (10:06 ,event1) , (10:05
> ,event1), (10:04 ,event1), (10:03 ,event1).
> For event2 in map phase will be (10:10 ,event2) , (10:09 ,event2) , (10:08
> ,event2), (10:07 ,event2), (10:06 ,event2).
> For event3 in map phase will be (10:12 ,event3) , (10:11 ,event3) , (10:10
> ,event3), (10:09 ,event3), (10:08 ,event3).
>
> I calculate from the event time T=5 steps back Is it correct?
>
> ==================on reduce =========:
>
> 10:03|event1
> 10:04|event1
> 10:05|event1
> 10:06|event1,event2
> 10:07|event1,event2
> 10:08|event2,event3
> 10:09|event2,event3
> 10:10|event2,event3
> 10:11|event3
> 10:12|event3
>
> this output will be writtent to the second table. is it correct?
>
> ============================================>
> How should I work with this table.Should I have to scan Main table : row by
> row and for every row get event time and based on that time query second
> table?
>
>     In case I will do so , i still need to execute 50 million request?
>
> May be I need to work only with second table. How do I know what to query
> (scan)?
>
> I am sure I simply don't understand well what is your approach for
> solution.
>
> Please explain.
>
> Thanks
> Oleg.
>
>
>
>
>
>
>
>
> On Wed, Jan 30, 2013 at 8:34 PM, Rodrigo Ribeiro <
> [EMAIL PROTECTED]> wrote:
>
> > There is another option,
> > You could do a MapReduce job that, for each row from the main table,
> emits
> > all times that it would be in the window of time,
> > For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ...
*Rodrigo Pereira Ribeiro*
Software Developer
www.jusbrasil.com.br
+
Oleg Ruchovets 2013-01-31, 15:39
+
Rodrigo Ribeiro 2013-01-31, 15:51
+
Michel Segel 2013-01-28, 15:54
+
Oleg Ruchovets 2013-01-28, 16:24
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB