Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - About SpillableMemoryManager


Copy link to this message
-
Re: About SpillableMemoryManager
Dmitriy Ryaboy 2012-11-02, 16:07
mapred.child.java.opts should be in the gigabytes, 200M is way too low.
Check this stack overflow thread for comments on how to ensure your setting
actually takes effect -- it's possible you are not propagating it to the
job. If you change it in the hadoop config files, you need to restart the
MR daemons (JT and TTs).
http://stackoverflow.com/questions/8464048/out-of-memory-error-in-hadoop

I'll take a look at your script next time I have a few minutes, but try
this first -- 200M is definitely too low to get much done in Hadoop.
On Fri, Nov 2, 2012 at 3:17 AM, W W <[EMAIL PROTECTED]> wrote:

> hi Dmitriy
> Thanks for your explanation!
> I think split on $2 is not easy because what I am doing is actually
> rolling-up a table,which means they can not be get by join.
> Here is the whole script with schema although I omitted many FLATTENs .
>
> IDF_VALID= LOAD '/user/hadoop/idf.dat'
> USING PigStorage('^A') AS (
>
>   ast_id : int,
>   value :chararray,
>   pro_id : int,
>   pag_id  : int ,
>   bgr_id : int,
>
> );
>
> grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;
>
> rollup= FOREACH grouped_recs {
>
>         bombay_code= FILTER IDF_VALID BY $2 == 76 ;
>         singapore_code= FILTER IDF_VALID BY $2 == 90 ;
>
> GENERATE
>
>         FLATTEN(group) as nda_id,
>         FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS bombay_code
> ,
>   FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
> singapore_code;
>
> }
>
> STORE rollup INTO 'idf-out-full' USING PigStorage('^A');
>
>
>
> Besides,  how can I "  increase the amount of available heap". I've changed
> mapred.child.java.opts   from -Xmx200m  to -Xmx1024m .  It seems it doesn't
> help. And that threshold value is still the same.
> when I monitor the java process by top command, it seems the setting of
> mapred.child.java.opts have NO influence on both VIRT and RES, it seems
>  mapred.child.java.opts has been overrided by pig.
>  Do you have any idea about that ?
>
> Thanks and Regards
> Xingbang
>
>
>
> 2012/11/2 Dmitriy Ryaboy <[EMAIL PROTECTED]>
>
> > Rather than increase memory, rewrite the script so it does not need so
> much
> > ram to begin with.
> > You can split on $2, group and generate what you need, then join things
> > back.
> > Hard to tell what exactly you are going for without schemas and expected
> > inputs/outputs.
> >
> > If the hadoop configs are the same, the fact that it's the powerful
> machine
> > that fails doesn't mean anything -- you are running out of RAM, and you
> > gave all machines the same amount of RAM for the reduce processes. It
> just
> > happens to be the one that a big group is hashing to.
> >
> > The threshold you are asking about is the threshold after which Pig will
> > try to spill what it can, since GC is imminent. It's defined as 70% of
> the
> > largest memory pool found on the jvm. This threshold itself is not what
> you
> > want to increase -- you want to increase the amount of available heap if
> > possible.
> >
> > You can set pig.spill.gc.activation.size (invoke GC if we managed to
> spill
> > at least this much) and pig.spill.size.threshold (how big a spill must be
> > before it makes sense to spill anything) if you want.
> >
> > D
> >
> >
> >
> >
> > On Thu, Nov 1, 2012 at 2:59 AM, W W <[EMAIL PROTECTED]> wrote:
> >
> > > hello
> > >
> > > I just have came across a problem with SpillableMemoryManager.
> > > I've searched lots of discussion contained this key, but they are all
> > > different from my problem.
> > >
> > > The problem is
> > >
> > > When I run a pig script,it takes longer to finish the same task on the
> > > powerful machine. And the log(the part that is not clear to me )  of
> the
> > > task node is
> > >
> > > Week Node:
> > >
> > > 2001-06-28 04:04:39,356 INFO
> > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > call - Collection threshold init = 86048768(84032K) used > > > 86048752(84031K) committed = 125304832(122368K) max > > > 139853824(136576K)
> > > 2001-06-28 04:04:39,940 INFO