Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Rolling MAU computation


+
Tom Hubina 2012-10-10, 22:05
+
Roberto Sanabria 2012-10-10, 22:59
+
Tom Hubina 2012-10-10, 23:04
+
MiaoMiao 2012-10-11, 03:50
+
Tom Hubina 2012-10-12, 19:59
+
Igor Tatarinov 2012-10-11, 06:05
+
Tom Hubina 2012-10-12, 20:02
+
Igor Tatarinov 2012-10-12, 20:08
Copy link to this message
-
Re: Rolling MAU computation
For both simplicity and efficiency, I'd recommend making the
mausummary table partitioned on date and generate the MAU data each
day. There is no reason to generate MAU data for a given day more than
once (unless you find some problems with the source data or
something).

On Fri, Oct 12, 2012 at 1:08 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
> You just need to put the join condition in the WHERE clause. That way Hive
> will do a cartesian product followed by a filter.
>
> On Fri, Oct 12, 2012 at 1:02 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
>>
>> I think I see what you're saying about the temp table with start/end dates
>> (30x expansion makes sense) and it sounds like it should work. I just need
>> to figure out a good way to generate the table. Thanks!
>>
>> Tom
>>
>> On Wed, Oct 10, 2012 at 11:05 PM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:
>>>
>>> If you have a lot of data, you might have to write a custom reducer (in
>>> python) to keep track of the moving date window.
>>>
>>> If you don't have that much data, you might want to use a temp table
>>> <start_date, end_date> such that datediff(end_date, start_date) < 30. To
>>> create such a table, you can self-join a table of unique dates using the
>>> above condition. Then, you would join your data with that table on
>>> start_date and group by end_date counting distinct user_ids. Hope I got that
>>> right :)
>>>
>>> The latter approach will essentially multiply the number of rows by 30.
>>>
>>> igor
>>> decide.com
>>>
>>>
>>> On Wed, Oct 10, 2012 at 3:05 PM, Tom Hubina <[EMAIL PROTECTED]> wrote:
>>>>
>>>> I'm trying to compute the number of active users in the previous 30 days
>>>> for each day over a date range. I can't think of any way to do it directly
>>>> within Hive so I'm wondering if you guys have any ideas.
>>>>
>>>> Basically the algorithm is something like:
>>>>
>>>> For each day in date range:
>>>>    SELECT day, COUNT(DISTINCT(userid)) FROM logins WHERE day -
>>>> logins.day < 30;
>>>>
>>>> Thanks for your help!
>>>>
>>>> Tom
>>>>
>>>
>>
>