Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem with GROUP_BY and Java heap space


Copy link to this message
-
Problem with GROUP_BY and Java heap space
Hello,

I'm running into OutOfMemoryError exceptions in this Pig script snippet:

grouped_sessions = group sessions_duration
                              by (month, source);
stats = foreach grouped_sessions {
        bounces = filter sessions_duration by duration == 0;
        not_bounces = filter sessions_duration by duration > 0;
        generate flatten(group),
                      StreamingQuartile(not_bounces.duration) as quartiles,
                      AVG(not_bounces.duration) as avg,
                      SQRT(VAR(not_bounces.duration)) as std,
                      (double)COUNT(bounces) as n_bounces,
                      (double)COUNT(sessions_duration) as n_samples;
};

where sessions_duration is a bag of tuples (month, source, duration).
The number of tuples for a given pair of (month, source) can be HUGE
for my data and it seems Pig can't handle that smoothly. Since some of
the UDFs in the generate are no algebraic (e.g. StreamingQuartile), is
there a workaround for it? Maybe there's a better way to express my
intent in Pig that I'm not aware of.

Thank you in advance guys,
Rafael