Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig Unique Counts on Multiple Subsets of a Large Input


+
Thomas Edison 2013-05-06, 03:11
Copy link to this message
-
Re: Pig Unique Counts on Multiple Subsets of a Large Input
Are you familiar with the CUBE keyword that was relatively recently added?
This sounds like a perfect use case for it. Furthermore, how are you
splitting on activity? There is a SPLIT operator which is perfect for this,
as you can have a different relation for each one.

What I would do would be to use split to break it down into activities,
then make a macro that gives you the counts you want and use that on each
of the split relations.
2013/5/6 Thomas Edison <[EMAIL PROTECTED]>

> Hi there,
>
> I have a huge input on an HDFS and I would like to use Pig to calculate
> several unique metrics. To help explain the problem more easily, I assume
> the input file has the following schema:
>
> userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray,
> dimensionC_key:chararray, activity:chararray, ...
>
> Each record represent an activity performed by that userId.
>
> Based on the value in the activity field, this activity record will be
> mapped to 1 or more categories. There are about 10 categories in total.
>
> Now I need to count the number of unique users for different dimension
> combinations (i.e. A, B, C, A+B, A+C, B+C, A+B+C) for each activity
> category.
>
> What would be the best practices to perform such calculation?
>
> I have tried several ways. Although I can get the results I want, it takes
> a very long time (i.e. days). What I found is most of the time is spent on
> the map phase. It looks like the script tries to load the huge input file
> every time it tries to calculate one unique count. Is there a way to
> improve this behavior?
>
> I also tried something similar to below, but it looks like it reaches the
> memory cap for a single reducer and just stuck at the last reducer step.
>
> source = load ... as (userId:chararray, dimensionA_key:chararray,
> dimensionB_key:chararray, dimensionC_key:chararray,
> activity:chararray, ...);
> a = group source by (dimensionA_key, dimensionB_key);
> b = foreach a {
>     userId1 = udf.newUserIdForCategory1(userId, activity);
>     -- this udf returns the original user id if the activity should be
> mapped to Category1 and None otherwise
>     userId2 = udf.newUserIdForCategory2(userId, activity);
>     userId3 = udf.newUserIdForCategory3(userId, activity);
>     ...
>     userId10 = udf.newUserIdForCategory10(userId, activity);
>     generate FLATTEN(group), COUNT(userId1), COUNT(userId2),
> COUNT(userId3), ..., COUNT(userId10);
> }
> store b ...;
>
> Thanks.
>
> T.E.
>
+
Alan Gates 2013-05-06, 15:12
+
Thomas Edison 2013-05-06, 15:41
+
Thomas Edison 2013-05-06, 17:03
+
Prasanth J 2013-05-09, 03:49
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB