Tom Hubina
20121010, 22:05
Roberto Sanabria
20121010, 22:59
Tom Hubina
20121010, 23:04
MiaoMiao
20121011, 03:50
Igor Tatarinov
20121011, 06:05
Tom Hubina
20121012, 19:59
Tom Hubina
20121012, 20:02
Igor Tatarinov
20121012, 20:08
Vijay
20121012, 20:42


Rolling MAU computationTom Hubina 20121010, 22:05
I'm trying to compute the number of active users in the previous 30 days
for each day over a date range. I can't think of any way to do it directly within Hive so I'm wondering if you guys have any ideas. Basically the algorithm is something like: For each day in date range: SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < 30; Thanks for your help! Tom 
Re: Rolling MAU computationRoberto Sanabria 20121010, 22:59
I've done this with a map side join using a table that stores days of the
week. I use that to drive the day im calculating the count for. Let me know if you need an example. Cheers, R On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > I'm trying to compute the number of active users in the previous 30 days > for each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30; > > Thanks for your help! > > Tom > > 
Re: Rolling MAU computationTom Hubina 20121010, 23:04
An example would be awesome .. I've never used a map side join (though I'm
searching on that now .. ) Tom On Wed, Oct 10, 2012 at 3:59 PM, Roberto Sanabria <[EMAIL PROTECTED]>wrote: > I've done this with a map side join using a table that stores days of the > week. I use that to drive the day im calculating the count for. Let me know > if you need an example. > > Cheers, > R > > > On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > >> I'm trying to compute the number of active users in the previous 30 days >> for each day over a date range. I can't think of any way to do it directly >> within Hive so I'm wondering if you guys have any ideas. >> >> Basically the algorithm is something like: >> >> For each day in date range: >> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day >> < 30; >> >> Thanks for your help! >> >> Tom >> >> > 
Re: Rolling MAU computationMiaoMiao 20121011, 03:50
How about
SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < 30 GROUP BY day; On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <[EMAIL PROTECTED]> wrote: > I'm trying to compute the number of active users in the previous 30 days for > each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day < > 30; > > Thanks for your help! > > Tom > 
Re: Rolling MAU computationIgor Tatarinov 20121011, 06:05
If you have a lot of data, you might have to write a custom reducer (in
python) to keep track of the moving date window. If you don't have that much data, you might want to use a temp table <start_date, end_date> such that datediff(end_date, start_date) < 30. To create such a table, you can selfjoin a table of unique dates using the above condition. Then, you would join your data with that table on start_date and group by end_date counting distinct user_ids. Hope I got that right :) The latter approach will essentially multiply the number of rows by 30. igor decide.com On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > I'm trying to compute the number of active users in the previous 30 days > for each day over a date range. I can't think of any way to do it directly > within Hive so I'm wondering if you guys have any ideas. > > Basically the algorithm is something like: > > For each day in date range: > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30; > > Thanks for your help! > > Tom > > 
Re: Rolling MAU computationTom Hubina 20121012, 19:59
The problem is that "day" is the value in the for loop.
I've tried doing a join with a table that contains the set of days, but the problem is that you can't do a join on a range ... Hive only support equality in the join. For example: INSERT OVERWRITE TABLE mausummary SELECT day, COUNT(DISTINCT(userid)) FROM days JOIN logins ON date_add(logins.t, 30) >= days.day AND logins.t <days.day GROUP BY day; fails because of the range in the join. Tom On Wed, Oct 10, 2012 at 8:50 PM, MiaoMiao <[EMAIL PROTECTED]> wrote: > How about > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day > < 30 GROUP BY day; > > On Thu, Oct 11, 2012 at 6:05 AM, Tom Hubina <[EMAIL PROTECTED]> wrote: > > I'm trying to compute the number of active users in the previous 30 days > for > > each day over a date range. I can't think of any way to do it directly > > within Hive so I'm wondering if you guys have any ideas. > > > > Basically the algorithm is something like: > > > > For each day in date range: > > SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  > logins.day < > > 30; > > > > Thanks for your help! > > > > Tom > > > 
Re: Rolling MAU computationTom Hubina 20121012, 20:02
I think I see what you're saying about the temp table with start/end dates
(30x expansion makes sense) and it sounds like it should work. I just need to figure out a good way to generate the table. Thanks! Tom On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: > If you have a lot of data, you might have to write a custom reducer (in > python) to keep track of the moving date window. > > If you don't have that much data, you might want to use a temp table > <start_date, end_date> such that datediff(end_date, start_date) < 30. To > create such a table, you can selfjoin a table of unique dates using the > above condition. Then, you would join your data with that table on > start_date and group by end_date counting distinct user_ids. Hope I got > that right :) > > The latter approach will essentially multiply the number of rows by 30. > > igor > decide.com > > > On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > >> I'm trying to compute the number of active users in the previous 30 days >> for each day over a date range. I can't think of any way to do it directly >> within Hive so I'm wondering if you guys have any ideas. >> >> Basically the algorithm is something like: >> >> For each day in date range: >> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  logins.day >> < 30; >> >> Thanks for your help! >> >> Tom >> >> > 
Re: Rolling MAU computationIgor Tatarinov 20121012, 20:08
You just need to put the join condition in the WHERE clause. That way Hive
will do a cartesian product followed by a filter. On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: > I think I see what you're saying about the temp table with start/end dates > (30x expansion makes sense) and it sounds like it should work. I just need > to figure out a good way to generate the table. Thanks! > > Tom > > On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: > >> If you have a lot of data, you might have to write a custom reducer (in >> python) to keep track of the moving date window. >> >> If you don't have that much data, you might want to use a temp table >> <start_date, end_date> such that datediff(end_date, start_date) < 30. To >> create such a table, you can selfjoin a table of unique dates using the >> above condition. Then, you would join your data with that table on >> start_date and group by end_date counting distinct user_ids. Hope I got >> that right :) >> >> The latter approach will essentially multiply the number of rows by 30. >> >> igor >> decide.com >> >> >> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >> >>> I'm trying to compute the number of active users in the previous 30 days >>> for each day over a date range. I can't think of any way to do it directly >>> within Hive so I'm wondering if you guys have any ideas. >>> >>> Basically the algorithm is something like: >>> >>> For each day in date range: >>> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  >>> logins.day < 30; >>> >>> Thanks for your help! >>> >>> Tom >>> >>> >> > 
Re: Rolling MAU computationVijay 20121012, 20:42
For both simplicity and efficiency, I'd recommend making the
mausummary table partitioned on date and generate the MAU data each day. There is no reason to generate MAU data for a given day more than once (unless you find some problems with the source data or something). On Fri, Oct 12, 2012 at 1:08 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: > You just need to put the join condition in the WHERE clause. That way Hive > will do a cartesian product followed by a filter. > > On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >> >> I think I see what you're saying about the temp table with start/end dates >> (30x expansion makes sense) and it sounds like it should work. I just need >> to figure out a good way to generate the table. Thanks! >> >> Tom >> >> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote: >>> >>> If you have a lot of data, you might have to write a custom reducer (in >>> python) to keep track of the moving date window. >>> >>> If you don't have that much data, you might want to use a temp table >>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To >>> create such a table, you can selfjoin a table of unique dates using the >>> above condition. Then, you would join your data with that table on >>> start_date and group by end_date counting distinct user_ids. Hope I got that >>> right :) >>> >>> The latter approach will essentially multiply the number of rows by 30. >>> >>> igor >>> decide.com >>> >>> >>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote: >>>> >>>> I'm trying to compute the number of active users in the previous 30 days >>>> for each day over a date range. I can't think of any way to do it directly >>>> within Hive so I'm wondering if you guys have any ideas. >>>> >>>> Basically the algorithm is something like: >>>> >>>> For each day in date range: >>>> SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day  >>>> logins.day < 30; >>>> >>>> Thanks for your help! >>>> >>>> Tom >>>> >>> >> > 