


Rolling MAU computation
I'm trying to compute the number of active users in the previous 30 days for each day over a date range. I can't think of any way to do it directly within Hive so I'm wondering if you guys have any ideas.
Basically the algorithm is something like:
For each day in date range: SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < 30;
Thanks for your help!
Tom

Re: Rolling MAU computation
I've done this with a map side join using a table that stores days of the week. I use that to drive the day im calculating the count for. Let me know if you need an example.
Cheers, R
On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
> I'm trying to compute the number of active users in the previous 30 days > for each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30; > > Thanks for your help! > > Tom > >

Re: Rolling MAU computation
An example would be awesome .. I've never used a map side join (though I'm searching on that now .. )
Tom
On Wed, Oct 10, 2012 at 3:59 PM, Roberto Sanabria <[EMAIL PROTECTED]>wrote:
> I've done this with a map side join using a table that stores days of the > week. I use that to drive the day im calculating the count for. Let me know > if you need an example. > > Cheers, > R > > > On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > >> I'm trying to compute the number of active users in the previous 30 days >> for each day over a date range. I can't think of any way to do it directly >> within Hive so I'm wondering if you guys have any ideas. >> >> Basically the algorithm is something like: >> >> For each day in date range: >> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day >> < 30; >> >> Thanks for your help! >> >> Tom >> >> >

Re: Rolling MAU computation
How about SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < 30 GROUP BY day;
On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <[EMAIL PROTECTED]> wrote: > I'm trying to compute the number of active users in the previous 30 days for > each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < > 30; > > Thanks for your help! > > Tom >

Re: Rolling MAU computation
If you have a lot of data, you might have to write a custom reducer (in python) to keep track of the moving date window.
If you don't have that much data, you might want to use a temp table <start_date, end_date> such that datediff(end_date, start_date) < 30. To create such a table, you can selfjoin a table of unique dates using the above condition. Then, you would join your data with that table on start_date and group by end_date counting distinct user_ids. Hope I got that right :)
The latter approach will essentially multiply the number of rows by 30.
igor decide.com On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
> I'm trying to compute the number of active users in the previous 30 days > for each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30; > > Thanks for your help! > > Tom > >

Re: Rolling MAU computation
The problem is that "day" is the value in the for loop.
I've tried doing a join with a table that contains the set of days, but the problem is that you can't do a join on a range ... Hive only support equality in the join. For example:
INSERT OVERWRITE TABLE mausummary SELECT day, COUNT(DISTINCT(userid)) FROM days JOIN logins ON date_add(logins.t, 30) >= days.day AND logins.t <days.day GROUP BY day;
fails because of the range in the join.
Tom On Wed, Oct 10, 2012 at 8:50 PM, MiaoMiao <[EMAIL PROTECTED]> wrote:
> How about > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30 GROUP BY day; > > On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <[EMAIL PROTECTED]> wrote: > > I'm trying to compute the number of active users in the previous 30 days > for > > each day over a date range. I can't think of any way to do it directly > > within Hive so I'm wondering if you guys have any ideas. > > > > Basically the algorithm is something like: > > > > For each day in date range: > > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  > logins.day < > > 30; > > > > Thanks for your help! > > > > Tom > > >

Re: Rolling MAU computation
I think I see what you're saying about the temp table with start/end dates (30x expansion makes sense) and it sounds like it should work. I just need to figure out a good way to generate the table. Thanks!
Tom
On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
> If you have a lot of data, you might have to write a custom reducer (in > python) to keep track of the moving date window. > > If you don't have that much data, you might want to use a temp table > <start_date, end_date> such that datediff(end_date, start_date) < 30. To > create such a table, you can selfjoin a table of unique dates using the > above condition. Then, you would join your data with that table on > start_date and group by end_date counting distinct user_ids. Hope I got > that right :) > > The latter approach will essentially multiply the number of rows by 30. > > igor > decide.com > > > On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > >> I'm trying to compute the number of active users in the previous 30 days >> for each day over a date range. I can't think of any way to do it directly >> within Hive so I'm wondering if you guys have any ideas. >> >> Basically the algorithm is something like: >> >> For each day in date range: >> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day >> < 30; >> >> Thanks for your help! >> >> Tom >> >> >

Re: Rolling MAU computation
You just need to put the join condition in the WHERE clause. That way Hive will do a cartesian product followed by a filter.
On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
> I think I see what you're saying about the temp table with start/end dates > (30x expansion makes sense) and it sounds like it should work. I just need > to figure out a good way to generate the table. Thanks! > > Tom > > On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: > >> If you have a lot of data, you might have to write a custom reducer (in >> python) to keep track of the moving date window. >> >> If you don't have that much data, you might want to use a temp table >> <start_date, end_date> such that datediff(end_date, start_date) < 30. To >> create such a table, you can selfjoin a table of unique dates using the >> above condition. Then, you would join your data with that table on >> start_date and group by end_date counting distinct user_ids. Hope I got >> that right :) >> >> The latter approach will essentially multiply the number of rows by 30. >> >> igor >> decide.com >> >> >> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >> >>> I'm trying to compute the number of active users in the previous 30 days >>> for each day over a date range. I can't think of any way to do it directly >>> within Hive so I'm wondering if you guys have any ideas. >>> >>> Basically the algorithm is something like: >>> >>> For each day in date range: >>> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  >>> logins.day < 30; >>> >>> Thanks for your help! >>> >>> Tom >>> >>> >> >

Re: Rolling MAU computation
For both simplicity and efficiency, I'd recommend making the mausummary table partitioned on date and generate the MAU data each day. There is no reason to generate MAU data for a given day more than once (unless you find some problems with the source data or something).
On Fri, Oct 12, 2012 at 1:08 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: > You just need to put the join condition in the WHERE clause. That way Hive > will do a cartesian product followed by a filter. > > On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >> >> I think I see what you're saying about the temp table with start/end dates >> (30x expansion makes sense) and it sounds like it should work. I just need >> to figure out a good way to generate the table. Thanks! >> >> Tom >> >> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: >>> >>> If you have a lot of data, you might have to write a custom reducer (in >>> python) to keep track of the moving date window. >>> >>> If you don't have that much data, you might want to use a temp table >>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To >>> create such a table, you can selfjoin a table of unique dates using the >>> above condition. Then, you would join your data with that table on >>> start_date and group by end_date counting distinct user_ids. Hope I got that >>> right :) >>> >>> The latter approach will essentially multiply the number of rows by 30. >>> >>> igor >>> decide.com >>> >>> >>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >>>> >>>> I'm trying to compute the number of active users in the previous 30 days >>>> for each day over a date range. I can't think of any way to do it directly >>>> within Hive so I'm wondering if you guys have any ideas. >>>> >>>> Basically the algorithm is something like: >>>> >>>> For each day in date range: >>>> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  >>>> logins.day < 30; >>>> >>>> Thanks for your help! >>>> >>>> Tom >>>> >>> >> >

