Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> SpillableMemoryManager - low memory handler called


Copy link to this message
-
Re: SpillableMemoryManager - low memory handler called
Okay we did some filtering on the all_urls in the nested FOREACH and that seems to fix the performance issue.  There are still some mappers that get 8GB of data but the job went down to 2 hours.  

From Dimtry's reply sounds like the low memory handler output in the logs is misleading.

On May 6, 2010, at 3:30 PM, Corbin Hoenes wrote:

> Wondering if when we do a group like this:
>
> grouped_urls_by_site = GROUP all_urls BY site;
>
> if certain site has a lot of urls would they all have to be processed by the same mapper (e.g. a single key?)  Could this account for why we have 8GB in one map and not many in others?
>
> On May 6, 2010, at 3:24 PM, Olga Natkovich wrote:
>
>> Looks like attachments are not coming through. Here is the script from
>> Corbin inline.
>>
>> One thing you might want to try is to switch your cogroups to skewed
>> join and see if that solves the issue:
>>
>> http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref1.html#Skewed+Joins
>>
>> Olga
>>
>> --------------------------------------------topurl.pig------------------
>> -------------------------------------------
>> set job.name 'Generate topurl reports for $out_file1'
>>
>> %default dir_prefix '../..'
>> %default storage 'BinStorage()'
>> %default tynt_udfs 'tynt-udfs.jar'
>> %default topN '20'
>> /* default to 30 days time period so that alltime report will get
>> 14*30=420 min page views*/
>> %default timeperiod '30'
>> %default min_page_views_per_day '14'
>>
>> register $dir_prefix/udfs/target/$tynt_udfs
>> register $dir_prefix/udfs/lib/piggybank.jar
>>
>> ---------------------summarize address bar
>> stats-----------------------------------
>> addbar_stats = LOAD '$in_file1/addbarstats' USING $storage AS
>> (site:chararray, url:chararray, guid:chararray, cnt:long);
>> grouped_addbar_by_url = GROUP addbar_stats BY (site, url) PARALLEL 180;
>> addbar_stats_by_url = FOREACH grouped_addbar_by_url GENERATE
>> FLATTEN(group) AS (site, url), COUNT(addbar_stats) AS addbarcnt,
>> SUM(addbar_stats.cnt) AS addbarvisits;
>> STORE addbar_stats_by_url INTO '$out_file1/addbarstatsbyurl' USING
>> $storage;
>>
>> grouped_addbar_stats_by_site = GROUP addbar_stats_by_url BY site
>> PARALLEL 180;
>> addbar_stats_by_site = FOREACH grouped_addbar_stats_by_site GENERATE
>> group AS site, SUM(addbar_stats_by_url.addbarcnt) AS addbarcnt,
>> SUM(addbar_stats_by_url.addbarvisits) AS addbarvisits;
>> STORE addbar_stats_by_site INTO '$out_file1/addbarstatsbysite' USING
>> $storage;
>>
>> ----------------------calculate
>> ratio------------------------------------------
>> clickstatsbyurl = LOAD '$in_file1/clickstatsbyurl' USING $storage AS
>> (site:chararray, url:chararray, cnt:long, tracecnt:long, tcnt:long,
>> pcnt:long, wcnt:long, utracecnt:long, utcnt:long, upcnt:long,
>> uwcnt:long);
>> viewstatsbyurl = LOAD '$in_file1/viewstatsbyurl' USING $storage AS
>> (site:chararray, url:chararray, title:chararray, cnt:long, etcnt:long,
>> et1cnt:long, et2cnt:long, et3cnt:long, et6cnt:long, et7cnt:long);
>>
>> light_clickstatsbyurl = FOREACH clickstatsbyurl GENERATE site, url, cnt;
>> light_viewstatsbyurl_noisy = FOREACH viewstatsbyurl GENERATE site, url,
>> title, cnt, etcnt;
>>
>> light_viewstatsbyurl = FILTER light_viewstatsbyurl_noisy BY url != '-';
>>
>> --light_addbarstatsbyurl = FOREACH addbar_stats_by_url GENERATE site,
>> url, addbarvisits;
>> --joined_stats_for_ratio = COGROUP light_viewstatsbyurl BY (site, url)
>> INNER, light_clickstatsbyurl BY (site, url) OUTER,
>> light_addbarstatsbyurl BY (site, url) OUTER;
>> --flattened_stats_for_ratio = FOREACH joined_stats_for_ratio GENERATE
>> FLATTEN(light_viewstatsbyurl) AS (site, url, title, cnt, etcnt),
>> --
>> (IsEmpty(light_clickstatsbyurl)?0:MAX(light_clickstatsbyurl.cnt)) as
>> clickcnt,
>> --
>> (IsEmpty(light_addbarstatsbyurl)?0:MAX(light_addbarstatsbyurl.addbarvisi
>> ts)) as addbarcnt;
>>
>> joined_stats_for_ratio = COGROUP light_viewstatsbyurl BY (site, url)
>> INNER, light_clickstatsbyurl BY (site, url) OUTER;
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB