


Rolling MAU computation
Tom Hubina 20121010, 22:05
I'm trying to compute the number of active users in the previous 30 days for each day over a date range. I can't think of any way to do it directly within Hive so I'm wondering if you guys have any ideas.
Basically the algorithm is something like:
For each day in date range: SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < 30;
Thanks for your help!
Tom
+
Tom Hubina 20121010, 22:05

Re: Rolling MAU computation
Roberto Sanabria 20121010, 22:59
I've done this with a map side join using a table that stores days of the week. I use that to drive the day im calculating the count for. Let me know if you need an example.
Cheers, R
On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
> I'm trying to compute the number of active users in the previous 30 days > for each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30; > > Thanks for your help! > > Tom > >
+
Roberto Sanabria 20121010, 22:59

Re: Rolling MAU computation
Tom Hubina 20121010, 23:04
An example would be awesome .. I've never used a map side join (though I'm searching on that now .. )
Tom
On Wed, Oct 10, 2012 at 3:59 PM, Roberto Sanabria <[EMAIL PROTECTED]>wrote:
> I've done this with a map side join using a table that stores days of the > week. I use that to drive the day im calculating the count for. Let me know > if you need an example. > > Cheers, > R > > > On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > >> I'm trying to compute the number of active users in the previous 30 days >> for each day over a date range. I can't think of any way to do it directly >> within Hive so I'm wondering if you guys have any ideas. >> >> Basically the algorithm is something like: >> >> For each day in date range: >> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day >> < 30; >> >> Thanks for your help! >> >> Tom >> >> >
+
Tom Hubina 20121010, 23:04

Re: Rolling MAU computation
MiaoMiao 20121011, 03:50
How about SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < 30 GROUP BY day;
On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <[EMAIL PROTECTED]> wrote: > I'm trying to compute the number of active users in the previous 30 days for > each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < > 30; > > Thanks for your help! > > Tom >
+
MiaoMiao 20121011, 03:50

Re: Rolling MAU computation
Tom Hubina 20121012, 19:59
The problem is that "day" is the value in the for loop.
I've tried doing a join with a table that contains the set of days, but the problem is that you can't do a join on a range ... Hive only support equality in the join. For example:
INSERT OVERWRITE TABLE mausummary SELECT day, COUNT(DISTINCT(userid)) FROM days JOIN logins ON date_add(logins.t, 30) >= days.day AND logins.t <days.day GROUP BY day;
fails because of the range in the join.
Tom On Wed, Oct 10, 2012 at 8:50 PM, MiaoMiao <[EMAIL PROTECTED]> wrote:
> How about > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30 GROUP BY day; > > On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <[EMAIL PROTECTED]> wrote: > > I'm trying to compute the number of active users in the previous 30 days > for > > each day over a date range. I can't think of any way to do it directly > > within Hive so I'm wondering if you guys have any ideas. > > > > Basically the algorithm is something like: > > > > For each day in date range: > > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  > logins.day < > > 30; > > > > Thanks for your help! > > > > Tom > > >
+
Tom Hubina 20121012, 19:59

Re: Rolling MAU computation
Igor Tatarinov 20121011, 06:05
If you have a lot of data, you might have to write a custom reducer (in python) to keep track of the moving date window.
If you don't have that much data, you might want to use a temp table <start_date, end_date> such that datediff(end_date, start_date) < 30. To create such a table, you can selfjoin a table of unique dates using the above condition. Then, you would join your data with that table on start_date and group by end_date counting distinct user_ids. Hope I got that right :)
The latter approach will essentially multiply the number of rows by 30.
igor decide.com On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
> I'm trying to compute the number of active users in the previous 30 days > for each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30; > > Thanks for your help! > > Tom > >
+
Igor Tatarinov 20121011, 06:05

Re: Rolling MAU computation
Tom Hubina 20121012, 20:02
I think I see what you're saying about the temp table with start/end dates (30x expansion makes sense) and it sounds like it should work. I just need to figure out a good way to generate the table. Thanks!
Tom
On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
> If you have a lot of data, you might have to write a custom reducer (in > python) to keep track of the moving date window. > > If you don't have that much data, you might want to use a temp table > <start_date, end_date> such that datediff(end_date, start_date) < 30. To > create such a table, you can selfjoin a table of unique dates using the > above condition. Then, you would join your data with that table on > start_date and group by end_date counting distinct user_ids. Hope I got > that right :) > > The latter approach will essentially multiply the number of rows by 30. > > igor > decide.com > > > On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > >> I'm trying to compute the number of active users in the previous 30 days >> for each day over a date range. I can't think of any way to do it directly >> within Hive so I'm wondering if you guys have any ideas. >> >> Basically the algorithm is something like: >> >> For each day in date range: >> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day >> < 30; >> >> Thanks for your help! >> >> Tom >> >> >
+
Tom Hubina 20121012, 20:02

Re: Rolling MAU computation
Igor Tatarinov 20121012, 20:08
You just need to put the join condition in the WHERE clause. That way Hive will do a cartesian product followed by a filter.
On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
> I think I see what you're saying about the temp table with start/end dates > (30x expansion makes sense) and it sounds like it should work. I just need > to figure out a good way to generate the table. Thanks! > > Tom > > On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: > >> If you have a lot of data, you might have to write a custom reducer (in >> python) to keep track of the moving date window. >> >> If you don't have that much data, you might want to use a temp table >> <start_date, end_date> such that datediff(end_date, start_date) < 30. To >> create such a table, you can selfjoin a table of unique dates using the >> above condition. Then, you would join your data with that table on >> start_date and group by end_date counting distinct user_ids. Hope I got >> that right :) >> >> The latter approach will essentially multiply the number of rows by 30. >> >> igor >> decide.com >> >> >> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >> >>> I'm trying to compute the number of active users in the previous 30 days >>> for each day over a date range. I can't think of any way to do it directly >>> within Hive so I'm wondering if you guys have any ideas. >>> >>> Basically the algorithm is something like: >>> >>> For each day in date range: >>> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  >>> logins.day < 30; >>> >>> Thanks for your help! >>> >>> Tom >>> >>> >> >
+
Igor Tatarinov 20121012, 20:08

Re: Rolling MAU computation
Vijay 20121012, 20:42
For both simplicity and efficiency, I'd recommend making the mausummary table partitioned on date and generate the MAU data each day. There is no reason to generate MAU data for a given day more than once (unless you find some problems with the source data or something).
On Fri, Oct 12, 2012 at 1:08 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: > You just need to put the join condition in the WHERE clause. That way Hive > will do a cartesian product followed by a filter. > > On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >> >> I think I see what you're saying about the temp table with start/end dates >> (30x expansion makes sense) and it sounds like it should work. I just need >> to figure out a good way to generate the table. Thanks! >> >> Tom >> >> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: >>> >>> If you have a lot of data, you might have to write a custom reducer (in >>> python) to keep track of the moving date window. >>> >>> If you don't have that much data, you might want to use a temp table >>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To >>> create such a table, you can selfjoin a table of unique dates using the >>> above condition. Then, you would join your data with that table on >>> start_date and group by end_date counting distinct user_ids. Hope I got that >>> right :) >>> >>> The latter approach will essentially multiply the number of rows by 30. >>> >>> igor >>> decide.com >>> >>> >>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >>>> >>>> I'm trying to compute the number of active users in the previous 30 days >>>> for each day over a date range. I can't think of any way to do it directly >>>> within Hive so I'm wondering if you guys have any ideas. >>>> >>>> Basically the algorithm is something like: >>>> >>>> For each day in date range: >>>> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  >>>> logins.day < 30; >>>> >>>> Thanks for your help! >>>> >>>> Tom >>>> >>> >> >
+
Vijay 20121012, 20:42

