Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pig Unique Counts on Multiple Subsets of a Large Input

Copy link to this message
Re: Pig Unique Counts on Multiple Subsets of a Large Input
Jonathan Coveney 2013-05-06, 14:56
Are you familiar with the CUBE keyword that was relatively recently added?
This sounds like a perfect use case for it. Furthermore, how are you
splitting on activity? There is a SPLIT operator which is perfect for this,
as you can have a different relation for each one.

What I would do would be to use split to break it down into activities,
then make a macro that gives you the counts you want and use that on each
of the split relations.
2013/5/6 Thomas Edison <[EMAIL PROTECTED]>

> Hi there,
> I have a huge input on an HDFS and I would like to use Pig to calculate
> several unique metrics. To help explain the problem more easily, I assume
> the input file has the following schema:
> userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray,
> dimensionC_key:chararray, activity:chararray, ...
> Each record represent an activity performed by that userId.
> Based on the value in the activity field, this activity record will be
> mapped to 1 or more categories. There are about 10 categories in total.
> Now I need to count the number of unique users for different dimension
> combinations (i.e. A, B, C, A+B, A+C, B+C, A+B+C) for each activity
> category.
> What would be the best practices to perform such calculation?
> I have tried several ways. Although I can get the results I want, it takes
> a very long time (i.e. days). What I found is most of the time is spent on
> the map phase. It looks like the script tries to load the huge input file
> every time it tries to calculate one unique count. Is there a way to
> improve this behavior?
> I also tried something similar to below, but it looks like it reaches the
> memory cap for a single reducer and just stuck at the last reducer step.
> source = load ... as (userId:chararray, dimensionA_key:chararray,
> dimensionB_key:chararray, dimensionC_key:chararray,
> activity:chararray, ...);
> a = group source by (dimensionA_key, dimensionB_key);
> b = foreach a {
>     userId1 = udf.newUserIdForCategory1(userId, activity);
>     -- this udf returns the original user id if the activity should be
> mapped to Category1 and None otherwise
>     userId2 = udf.newUserIdForCategory2(userId, activity);
>     userId3 = udf.newUserIdForCategory3(userId, activity);
>     ...
>     userId10 = udf.newUserIdForCategory10(userId, activity);
>     generate FLATTEN(group), COUNT(userId1), COUNT(userId2),
> COUNT(userId3), ..., COUNT(userId10);
> }
> store b ...;
> Thanks.
> T.E.