Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Rolling MAU computation


Copy link to this message
-
Re: Rolling MAU computation
If you have a lot of data, you might have to write a custom reducer (in
python) to keep track of the moving date window.

If you don't have that much data, you might want to use a temp table
<start_date, end_date> such that datediff(end_date, start_date) < 30. To
create such a table, you can self-join a table of unique dates using the
above condition. Then, you would join your data with that table on
start_date and group by end_date counting distinct user_ids. Hope I got
that right :)

The latter approach will essentially multiply the number of rows by 30.

igor
decide.com
On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:

> I'm trying to compute the number of active users in the previous 30 days
> for each day over a date range. I can't think of any way to do it directly
> within Hive so I'm wondering if you guys have any ideas.
>
> Basically the algorithm is something like:
>
> For each day in date range:
>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day - logins.day
> < 30;
>
> Thanks for your help!
>
> Tom
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB