Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Rolling MAU computation


+
Tom Hubina 2012-10-10, 22:05
+
Roberto Sanabria 2012-10-10, 22:59
+
Tom Hubina 2012-10-10, 23:04
+
MiaoMiao 2012-10-11, 03:50
+
Tom Hubina 2012-10-12, 19:59
Copy link to this message
-
Re: Rolling MAU computation
Igor Tatarinov 2012-10-11, 06:05
If you have a lot of data, you might have to write a custom reducer (in
python) to keep track of the moving date window.

If you don't have that much data, you might want to use a temp table
<start_date, end_date> such that datediff(end_date, start_date) < 30. To
create such a table, you can self-join a table of unique dates using the
above condition. Then, you would join your data with that table on
start_date and group by end_date counting distinct user_ids. Hope I got
that right :)

The latter approach will essentially multiply the number of rows by 30.

igor
decide.com
On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:

> I'm trying to compute the number of active users in the previous 30 days
> for each day over a date range. I can't think of any way to do it directly
> within Hive so I'm wondering if you guys have any ideas.
>
> Basically the algorithm is something like:
>
> For each day in date range:
>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
> < 30;
>
> Thanks for your help!
>
> Tom
>
>
+
Tom Hubina 2012-10-12, 20:02
+
Igor Tatarinov 2012-10-12, 20:08
+
Vijay 2012-10-12, 20:42