Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Distinct IDs from different time periods


Copy link to this message
-
Re: Distinct IDs from different time periods
Not to much knowledge to help you.
What is the nature of your data? You get it daily, montly?
30 days is a sliding window or month?

Imho the approach sould be:
When data arrives find the most resent user activity
Store output to /recent-activity/yyyy/mm/dd/hh

When next pack arrives read data in conjunction with previously found
recent activities and produce output to
/recent-activity/yyyy/mm/dd/hh+1

So you always track the most recent events of users.

Provide more details and we can think how to solve your problem.

Right now there are more questions than answers
14.08.2013 0:33 пользователь "Mike Sukmanowsky" <[EMAIL PROTECTED]> написал:

> Hi all,
>
> Trying to produce some data using clickstream logs from Pig that does the
> following:
>
>    1. Pull data for the past 30 days (current period)
>    2. Classify Group A as users who had activity in the current period but
>    not 30 days prior to the current period.
>    3. Classify Group B effectively as all {users in current period} -
>    {Group A}
>
> To make the example concrete, let's say end date is July 30, 2013.
>
> So Group A users =  anyone who had activity from Jul 1 - Jul 30, 2013 but
> did not have activity in Jun 1 - Jun 30.
> Group B users = anyone who had activity activity from Jul 1 - Jul 30, 2013
> and also had activity in Jun 1 - Jun 30.
>
> I've had some initial thoughts for how to approach this but none of them
> seem great.  Any thoughts from the group?
>
> Mike
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: [EMAIL PROTECTED]
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB