Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Aggregation for chronologically ordered dataset


Copy link to this message
-
Re: Aggregation for chronologically ordered dataset
I'd use rank function to join previous and next row, then filter out middle
rows, then join first to last and calculate time.
15 бер. 2013 19:04, "pranjal rajput" <[EMAIL PROTECTED]> напис.

> Hi,
> I am new to Pig.
> I have a dataset from a time-tracker application.
> It records the the time that users spend on various activities.
> For example:
> UserId | Activity          |  Tool  |  BeginTime | EndTime | DurationMinute
> 1        |  development  | tool1  |  10:00        |    10:15   |   15
> 1        |  development  | tool2  |  10:15        |    10:30   |   15
> 1        |  other             | tool3  |  10:30        |    11:00   |   30
> 1        |  development  | tool1  |  11:00        |    11:20   |   20
> 1        |  other             | tool4  |  11:20        |    12:00   |   40
> 1        |  development  | tool1  |  12:00        |    12:15   |   15
> 2        |  other             | tool3  |  10:00        |    11:00   |   60
> 2        |  development  | tool1  |  11:00        |    11:20   |   20
> 2        |  development  | tool2  |  11:20        |    11:30   |   10
>
> I wish to find out, un-interrupted time slots spent on
> Activity=development. like this:
>
> UserId    |   Activity          |  SumDurationMinutes
> 1           |   development   |  30   /*notice tht two slots are summed*/
> 1           |   other              |  30
> 1           |   development   |  20
> 1           |   other              |  40
> 1           |   development   |  15
> 2           |   other              |  60
> 2           |   development   |  30 /*again sum*/
>
> How can this be done in pig?
> I am open to writing a UDF for the same, or any other work around.
> Thanks in anticipation,
>
> --
> Best Regards
> Pranjal Rajput
>