Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> About SpillableMemoryManager


Copy link to this message
-
About SpillableMemoryManager
hello

I just have came across a problem with SpillableMemoryManager.
I've searched lots of discussion contained this key, but they are all
different from my problem.

The problem is

When I run a pig script,it takes longer to finish the same task on the
powerful machine. And the log(the part that is not clear to me )  of the
task node is

Week Node:

2001-06-28 04:04:39,356 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call - Collection threshold init = 86048768(84032K) used 86048752(84031K) committed = 125304832(122368K) max 139853824(136576K)
2001-06-28 04:04:39,940 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call- Usage threshold init = 86048768(84032K) used = 98041880(95744K)
committed = 125304832(122368K) max = 139853824(136576K)
2001-06-28 04:06:10,048 INFO org.apache.hadoop.mapred.Task:
Task:attempt_201211010504_0007_r_000018_0 is done. And is in the
process of commiting
Powerful Node:

2012-11-01 06:12:56,801 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call- Usage threshold init = 139853824(136576K) used 99240424(96914K) committed = 139853824(136576K) max 139853824(136576K)
2012-11-01 06:13:22,733 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call - Collection threshold init = 139853824(136576K) used 77466824(75651K) committed = 139853824(136576K) max 139853824(136576K)
2012-11-01 06:15:41,178 INFO org.apache.hadoop.mapred.Task:
Task:attempt_201211010504_0007_r_000014_0 is done. And is in the
process of commiting
My question is how to control the number following  those like  the  "Usage
threshold init" , It seems I can't set them in the config files.
Are they default to some hardware parameters?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
The description of the cluster

I have a heterogeneous cluster with
 6 virtual machines with 4-core and 8G memory for each.
 4 physical machines with 24-core and 32Gmemory for each.

The hadoop configs are all the same for all nodes(I assigned the same slots
for M/R to the powerful machines even there is a waste)
The pig script that cause the problem:

grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;

rollup= FOREACH grouped_recs {

        bombay_code= FILTER IDF_VALID BY $2 == 76 ;
        singapore_code= FILTER IDF_VALID BY $2 == 90 ;

GENERATE

        FLATTEN(group) as nda_id,
        FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS bombay_code ,
  FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
singapore_code;

}

Thanks&Regards
Xingbang