|
|
+
W W 2012-11-01, 09:59
+
Dmitriy Ryaboy 2012-11-02, 00:00
+
W W 2012-11-02, 10:17
+
Dmitriy Ryaboy 2012-11-02, 16:07
+
W W 2012-11-07, 10:06
-
Re: About SpillableMemoryManagerW W 2012-11-18, 16:32
The problem has been solved, it's related with the Bug PIG-2923 (PIG-2917,
PIG-2918). ( refer to [EMAIL PROTECTED]) Dmitriy has actually fixed 2 months ago, when I use pig-0.11, my problem has gone, and the GC time falls from 80s to 0.5s . Thanks for your effort,Dmitriy. Xingbang.Wang 2012/11/7 W W <[EMAIL PROTECTED]> > Thanks for your help Dmitriy! > > I've found the problem of the powerful machine being slow th an weak > machine. > > The heap size is not the answer to the problem of powerful machine being > slower than week one. > > It's the because the GC time on the powerful machine is more than twice > on those week ones. > In my case, JVM by default assign the powerful machine 18 GC threads(there > are 24 cores on one Node) while on the weak machine only 4(only 4 cores > on the Node) threads of GC. And the memory are the same, so the overhead > of GC on the powerful machine dominates. > > I think that's the main reason of my problem. > > Besides,I think the SurvivorRatio of Java heap also contributes to that. > My guess is for pig, most of the data on the flow will be somehow garbage > collected, so if the Survivor area too bigger(given that the New Generation > in JVM is constant), it means Eden area is smaller. Then more gc is needed. > There should be a pivotal point for the SurvivorRation. > > > my solution is add the following to mapred-site.xml. > <property> > <name>mapred.child.java.opts</name> > <value> -XX:ParallelGCThreads=4 > -XX:SurvivorRatio=20</value> > </property> > > > Thanks > Regards > Xingbang Wang > > 2012/11/3 Dmitriy Ryaboy <[EMAIL PROTECTED]> > >> mapred.child.java.opts should be in the gigabytes, 200M is way too low. >> Check this stack overflow thread for comments on how to ensure your >> setting >> actually takes effect -- it's possible you are not propagating it to the >> job. If you change it in the hadoop config files, you need to restart the >> MR daemons (JT and TTs). >> http://stackoverflow.com/questions/8464048/out-of-memory-error-in-hadoop >> >> I'll take a look at your script next time I have a few minutes, but try >> this first -- 200M is definitely too low to get much done in Hadoop. >> >> >> On Fri, Nov 2, 2012 at 3:17 AM, W W <[EMAIL PROTECTED]> wrote: >> >> > hi Dmitriy >> > Thanks for your explanation! >> > I think split on $2 is not easy because what I am doing is actually >> > rolling-up a table,which means they can not be get by join. >> > Here is the whole script with schema although I omitted many FLATTENs . >> > >> > IDF_VALID= LOAD '/user/hadoop/idf.dat' >> > USING PigStorage('^A') AS ( >> > >> > ast_id : int, >> > value :chararray, >> > pro_id : int, >> > pag_id : int , >> > bgr_id : int, >> > >> > ); >> > >> > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40; >> > >> > rollup= FOREACH grouped_recs { >> > >> > bombay_code= FILTER IDF_VALID BY $2 == 76 ; >> > singapore_code= FILTER IDF_VALID BY $2 == 90 ; >> > >> > GENERATE >> > >> > FLATTEN(group) as nda_id, >> > FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS >> bombay_code >> > , >> > FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS >> > singapore_code; >> > >> > } >> > >> > STORE rollup INTO 'idf-out-full' USING PigStorage('^A'); >> > >> > >> > >> > Besides, how can I " increase the amount of available heap". I've >> changed >> > mapred.child.java.opts from -Xmx200m to -Xmx1024m . It seems it >> doesn't >> > help. And that threshold value is still the same. >> > when I monitor the java process by top command, it seems the setting of >> > mapred.child.java.opts have NO influence on both VIRT and RES, it seems >> > mapred.child.java.opts has been overrided by pig. >> > Do you have any idea about that ? >> > >> > Thanks and Regards >> > Xingbang >> > >> > >> > >> > 2012/11/2 Dmitriy Ryaboy <[EMAIL PROTECTED]> >> > >> > > Rather than increase memory, rewrite the script so it does not need so |