Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Rolling MAU computation


Copy link to this message
-
Re: Rolling MAU computation
Vijay 2012-10-12, 20:42
For both simplicity and efficiency, I'd recommend making the
mausummary table partitioned on date and generate the MAU data each
day. There is no reason to generate MAU data for a given day more than
once (unless you find some problems with the source data or
something).

On Fri, Oct 12, 2012 at 1:08 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
> You just need to put the join condition in the WHERE clause. That way Hive
> will do a cartesian product followed by a filter.
>
> On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
>>
>> I think I see what you're saying about the temp table with start/end dates
>> (30x expansion makes sense) and it sounds like it should work. I just need
>> to figure out a good way to generate the table. Thanks!
>>
>> Tom
>>
>> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
>>>
>>> If you have a lot of data, you might have to write a custom reducer (in
>>> python) to keep track of the moving date window.
>>>
>>> If you don't have that much data, you might want to use a temp table
>>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To
>>> create such a table, you can self-join a table of unique dates using the
>>> above condition. Then, you would join your data with that table on
>>> start_date and group by end_date counting distinct user_ids. Hope I got that
>>> right :)
>>>
>>> The latter approach will essentially multiply the number of rows by 30.
>>>
>>> igor
>>> decide.com
>>>
>>>
>>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
>>>>
>>>> I'm trying to compute the number of active users in the previous 30 days
>>>> for each day over a date range. I can't think of any way to do it directly
>>>> within Hive so I'm wondering if you guys have any ideas.
>>>>
>>>> Basically the algorithm is something like:
>>>>
>>>> For each day in date range:
>>>>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day -
>>>> logins.day < 30;
>>>>
>>>> Thanks for your help!
>>>>
>>>> Tom
>>>>
>>>
>>
>