Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - java.lang.OutOfMemoryError while running Pig Job


Copy link to this message
-
Re: java.lang.OutOfMemoryError while running Pig Job
sonia gehlot 2011-05-23, 21:17
Hi Shawn,

I tried using SUBSTRING in my script with different combinations but still
getting OOM errors.

is there is any other alternative to use distinct - count against very large
set of data.

Thanks,
Sonia

On Fri, May 20, 2011 at 1:54 PM, Xiaomeng Wan <[EMAIL PROTECTED]> wrote:

> It servers two purposes:
> 1. divide the group into smaller subgroups
> 2. make sure distinct in subgroup => distinct in group
>
> Shawn
>
> On Fri, May 20, 2011 at 2:20 PM, sonia gehlot <[EMAIL PROTECTED]>
> wrote:
> > Hey, I am sorry but I din't get how substring will help in this?
> >
> > On Fri, May 20, 2011 at 1:08 PM, Xiaomeng Wan <[EMAIL PROTECTED]>
> wrote:
> >
> >> you can try using some divide and conquer, like this:
> >>
> >> a = group data by (key, SUBSTRING(the_field_to_be_distinct, 0, 2));
> >> b = foreach a { x = distinct a.he_field_to_be_distinct; generate
> >> group.key as key, COUNT(x) as cnt; }
> >> c = group b by key;
> >> d = foreach c generate group as key, SUM(b.cnt) as cnt;
> >>
> >> using longer substring if still running into OOM.
> >>
> >> Regards,
> >> Shawn
> >>
> >> On Fri, May 20, 2011 at 1:11 PM, sonia gehlot <[EMAIL PROTECTED]>
> >> wrote:
> >> > Hey Thejas,
> >> >
> >> > I tried setting up property pig.cachedbag.memusage to 0.1 and also
> tried
> >> > computing distinct count for each type separately but still I am
> getting
> >> > errors like
> >> >
> >> > Error: java.lang.OutOfMemoryError: Java heap space
> >> > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> > java.io.IOException: Spill failed
> >> >
> >> > Is there is some other way to do distinct - count that you may
> suggest?
> >> >
> >> > Thanks for your help.
> >> >
> >> > Sonia
> >> >
> >> > On Fri, May 13, 2011 at 4:46 PM, Thejas M Nair <[EMAIL PROTECTED]>
> >> wrote:
> >> >
> >> >> The stack trace shows that the OOM error is happening when the
> distinct
> >> is
> >> >> being applied. It looks like in some record(s) of the relation
> group_it,
> >> >> one
> >> >> more of the following bags is very large - logic.c_users,
> >>  logic.nc_users
> >> >> or
> >> >> logic.registered_users;
> >> >>
> >> >> Try setting the property pig.cachedbag.memusage to 0.1 or lower (
> >> >> -Dpig.cachedbag.memusage=0.1 on java command line). It controls the
> >> memory
> >> >> used by pig internal bags, including those used by distinct.
> >> >>
> >> >> If that does not work, you can try computing count-distinct for each
> >> type
> >> >> of
> >> >> user separately and then combining the result.
> >> >>
> >> >>
> >> >> You might want to have a look at this way of optimizing
> count-distinct
> >> >> queries where skew can be a problem -
> >> >> https://issues.apache.org/jira/browse/PIG-1846
> >> >>
> >> >> -thejas
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 5/12/11 10:43 AM, "sonia gehlot" <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >> > Hi Guys,
> >> >> >
> >> >> > I am running following Pig script in Pig 0.8 version
> >> >> >
> >> >> > page_events = LOAD '/user/sgehlot/day=2011-05-10' as
> >> >> >
> >> >>
> >>
> (event_dt_ht:chararray,event_dt_ut:chararray,event_rec_num:int,event_type:int,
> >> >> >
> >> >>
> >>
> client_ip_addr:long,hub_id:int,is_cookied_user:int,local_ontology_node_id:int,
> >> >> >
> >> >>
> >>
> page_type_id:int,content_id:int,product_id:int,referrer_edition_id:int,page_nu
> >> >> >
> >> >>
> >>
> mber:int,is_iab_robot:int,browser_id:int,os_id:int,dw_pubsys_id:int,refresh:in
> >> >> >
> >> >>
> >>
> t,asset_id:int,asset_type_id:int,content_type_id:int,product_type_id:int,outbo
> >> >> >
> >> >>
> >>
> und_email_id:long,gbal_clc:int,mtype:int,user_action_id:int,referring_partner_
> >> >> >
> >> >>
> >>
> id:int,ontology_node_id:int,content_namespace_id:int,product_namespace_id:int,
> >> >> >
> >> >>
> >>
> transparent_edition_id:int,default_edition_id:int,event_seq_num:int,is_last_pa
> >> >> >
> >> >>
> >>
> ge:int,is_new_user:int,page_duration:int,page_seq_num:int,session_id:long,time
> >> >> >
> >> >>