Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Distinct IDs from different time periods


+
Mike Sukmanowsky 2013-08-13, 20:32
Copy link to this message
-
Re: Distinct IDs from different time periods
Not to much knowledge to help you.
What is the nature of your data? You get it daily, montly?
30 days is a sliding window or month?

Imho the approach sould be:
When data arrives find the most resent user activity
Store output to /recent-activity/yyyy/mm/dd/hh

When next pack arrives read data in conjunction with previously found
recent activities and produce output to
/recent-activity/yyyy/mm/dd/hh+1

So you always track the most recent events of users.

Provide more details and we can think how to solve your problem.

Right now there are more questions than answers
14.08.2013 0:33 пользователь "Mike Sukmanowsky" <[EMAIL PROTECTED]> написал:

> Hi all,
>
> Trying to produce some data using clickstream logs from Pig that does the
> following:
>
>    1. Pull data for the past 30 days (current period)
>    2. Classify Group A as users who had activity in the current period but
>    not 30 days prior to the current period.
>    3. Classify Group B effectively as all {users in current period} -
>    {Group A}
>
> To make the example concrete, let's say end date is July 30, 2013.
>
> So Group A users =  anyone who had activity from Jul 1 - Jul 30, 2013 but
> did not have activity in Jun 1 - Jun 30.
> Group B users = anyone who had activity activity from Jul 1 - Jul 30, 2013
> and also had activity in Jun 1 - Jun 30.
>
> I've had some initial thoughts for how to approach this but none of them
> seem great.  Any thoughts from the group?
>
> Mike
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: [EMAIL PROTECTED]
>