Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - java.lang.OutOfMemoryError while running Pig Job


Copy link to this message
-
Re: java.lang.OutOfMemoryError while running Pig Job
Thejas M Nair 2011-05-13, 23:46
The stack trace shows that the OOM error is happening when the distinct is
being applied. It looks like in some record(s) of the relation group_it, one
more of the following bags is very large - logic.c_users,  logic.nc_users or
logic.registered_users;

Try setting the property pig.cachedbag.memusage to 0.1 or lower (
-Dpig.cachedbag.memusage=0.1 on java command line). It controls the memory
used by pig internal bags, including those used by distinct.

If that does not work, you can try computing count-distinct for each type of
user separately and then combining the result.
You might want to have a look at this way of optimizing count-distinct
queries where skew can be a problem -
https://issues.apache.org/jira/browse/PIG-1846

-thejas

On 5/12/11 10:43 AM, "sonia gehlot" <[EMAIL PROTECTED]> wrote:

> Hi Guys,
>
> I am running following Pig script in Pig 0.8 version
>
> page_events = LOAD '/user/sgehlot/day=2011-05-10' as
> (event_dt_ht:chararray,event_dt_ut:chararray,event_rec_num:int,event_type:int,
> client_ip_addr:long,hub_id:int,is_cookied_user:int,local_ontology_node_id:int,
> page_type_id:int,content_id:int,product_id:int,referrer_edition_id:int,page_nu
> mber:int,is_iab_robot:int,browser_id:int,os_id:int,dw_pubsys_id:int,refresh:in
> t,asset_id:int,asset_type_id:int,content_type_id:int,product_type_id:int,outbo
> und_email_id:long,gbal_clc:int,mtype:int,user_action_id:int,referring_partner_
> id:int,ontology_node_id:int,content_namespace_id:int,product_namespace_id:int,
> transparent_edition_id:int,default_edition_id:int,event_seq_num:int,is_last_pa
> ge:int,is_new_user:int,page_duration:int,page_seq_num:int,session_id:long,time
> _since_sess_start:int,reg_cookie:chararray,urs_app_id:int,is_reg_user:int,edit
> ion_id:int,user_agent_id:int,page_type_key:int,referrer_id:int,channel_id:int,
> level2_id:int,level3_id:int,brand_id:int,content_key:int,product_key:int,editi
> on_key:int,partner_key:int,business_unit_id:int,anon_cookie:chararray,machine_
> name:chararray,pagehost:chararray,filenameextension:chararray,referrerpath:cha
> rarray,referrerhost:chararray,referring_oid:chararray,referring_legacy_oid:cha
> rarray,ctype:chararray,cval:chararray,link_tag:chararray,link_type:chararray,s
> ticky_tag:chararray,page_url:chararray,search_category:chararray,partner_subje
> ct:chararray,referring_partner_name:chararray,robot_pattern:chararray,browser:
> chararray,browser_major_version:chararray,browser_minor_version:chararray,os:c
> hararray,os_family:chararray,ttag:chararray,dest_oid:chararray,global_id:chara
> rray,hostname:chararray,path:chararray,filename:chararray,extension:chararray,
> query:chararray,user_agent:chararray,xrq:chararray,xref:chararray,page_guid:ch
> ararray,test_name:chararray,test_group:chararray,test_version:chararray,page_v
> ersion:chararray,o_sticky_tag:chararray,new_referring_oid:chararray,day:charar
> ray,network_ip:int,site_id:int,search_phrase:chararray,search_attributes:chara
> rray,web_search_phrase:chararray,ip_address:chararray,is_pattern_match_robot:i
> nt,protocol:chararray,skc_title:chararray,skc_url:chararray,has_site_search_ph
> rase:int,has_site_search_attribs:int,has_web_search_phrase:int,title_id:charar
> ray,url_id:chararray,network_rev:int);
>
> referrer_group_map = LOAD '/user/sgehlot/oozie/db_data/referrer_group_map'
> as
> (referrer_id:int, has_web_search_phrase:int, hostname:chararray,
> referral_type_id:int,
> referral_type_name:chararray,
> referrer_group_id:int,referrer_group_name:chararray,referrer_group_cat_id:int,
> referrer_group_cat:chararray);
>
> filter_pe = FILTER page_events BY is_iab_robot == 0 AND
> is_pattern_match_robot == 0 AND day == '2011-05-10';
>
> select_pe_col = FOREACH filter_pe GENERATE day, is_cookied_user,
> anon_cookie, reg_cookie, referrer_id, has_web_search_phrase,
> business_unit_id;
>
> select_ref_col = FOREACH referrer_group_map GENERATE referrer_id,
> has_web_search_phrase, referral_type_id;
>
> jn = JOIN select_ref_col BY (referrer_id, has_web_search_phrase),
org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:33>
)