|
Yang
2012-01-12, 02:12
Prashant Kommireddi
2012-01-12, 02:21
Yang
2012-01-12, 02:27
Prashant Kommireddi
2012-01-12, 03:46
Dmitriy Ryaboy
2012-01-12, 05:52
Yang
2012-01-17, 20:46
Yang
2012-01-17, 20:53
Dmitriy Ryaboy
2012-01-17, 21:15
Yang
2012-01-17, 21:20
Yang
2012-01-17, 21:28
|
-
how to control the number of mappers?Yang 2012-01-12, 02:12
I have a pig script that does basically a map-only job:
raw = LOAD 'input.txt' ; processed = FOREACH raw GENERATE convert_somehow($1,$2...); store processed into 'output.txt'; I have many nodes on my cluster, so I want PIG to process the input in more mappers. but it generates only 2 part-m-xxxxx files, i.e. using 2 mappers. in hadoop job it's possible to pass mapper count and -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL keyword only works for reducers Thanks Yang
-
Re: how to control the number of mappers?Prashant Kommireddi 2012-01-12, 02:21
Hi Yang,
You cannot really control the number of mappers directly (depends on input splits), but surely can spawn more mappers in various ways, such as reducing the block size or setting pig.splitCombination to false (this *might* create more maps). Level of parallelization depends on how much data the 2 mappers are handling. You would not want a lot of maps handling too little data. For eg, if your input data set is only a few MB it would not be a good idea to have more than 1 or 2 maps. Thanks, Prashant Sent from my iPhone On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote: > I have a pig script that does basically a map-only job: > > raw = LOAD 'input.txt' ; > > processed = FOREACH raw GENERATE convert_somehow($1,$2...); > > store processed into 'output.txt'; > > > > I have many nodes on my cluster, so I want PIG to process the input in > more mappers. but it generates only 2 part-m-xxxxx files, i.e. > using 2 mappers. > > in hadoop job it's possible to pass mapper count and > -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL > keyword only works for reducers > > > Thanks > Yang
-
Re: how to control the number of mappers?Yang 2012-01-12, 02:27
Prashant:
thanks. by "reducing the block size", do you mean split size ? ---- block size is fixed on a hadoop hdfs. my application is not really data heavy, each line of input takes a long while to process. as a result, the input size is small, but total processing time is long, and the potential parallelism is high Yang On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: > Hi Yang, > > You cannot really control the number of mappers directly (depends on > input splits), but surely can spawn more mappers in various ways, such > as reducing the block size or setting pig.splitCombination to false > (this *might* create more maps). > > Level of parallelization depends on how much data the 2 mappers are > handling. You would not want a lot of maps handling too little data. > For eg, if your input data set is only a few MB it would not be a good > idea to have more than 1 or 2 maps. > > Thanks, > Prashant > > Sent from my iPhone > > On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote: > >> I have a pig script that does basically a map-only job: >> >> raw = LOAD 'input.txt' ; >> >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); >> >> store processed into 'output.txt'; >> >> >> >> I have many nodes on my cluster, so I want PIG to process the input in >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. >> using 2 mappers. >> >> in hadoop job it's possible to pass mapper count and >> -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL >> keyword only works for reducers >> >> >> Thanks >> Yang
-
Re: how to control the number of mappers?Prashant Kommireddi 2012-01-12, 03:46
By block size I mean the actual HDFS block size. Based on your requirement
it seems like the input files are extremely small and reducing the block size is not an option. Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). Your job is more CPU intensive than I/O. I can think of splitting your files into multiple input files (equal to # of map tasks on your cluster) and turning off splitCombination (pig.splitCombination=false). Though this is generally a terrible MR practice! Another thing you could try is to give more memory to your map tasks by increasing "mapred.child.java.opts" to a higher value. Thanks, Prashant On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote: > Prashant: > > thanks. > > by "reducing the block size", do you mean split size ? ---- block size > is fixed on a hadoop hdfs. > > my application is not really data heavy, each line of input takes a > long while to process. as a result, the input size is small, but total > processing time is long, and the potential parallelism is high > > Yang > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi > <[EMAIL PROTECTED]> wrote: > > Hi Yang, > > > > You cannot really control the number of mappers directly (depends on > > input splits), but surely can spawn more mappers in various ways, such > > as reducing the block size or setting pig.splitCombination to false > > (this *might* create more maps). > > > > Level of parallelization depends on how much data the 2 mappers are > > handling. You would not want a lot of maps handling too little data. > > For eg, if your input data set is only a few MB it would not be a good > > idea to have more than 1 or 2 maps. > > > > Thanks, > > Prashant > > > > Sent from my iPhone > > > > On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote: > > > >> I have a pig script that does basically a map-only job: > >> > >> raw = LOAD 'input.txt' ; > >> > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); > >> > >> store processed into 'output.txt'; > >> > >> > >> > >> I have many nodes on my cluster, so I want PIG to process the input in > >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. > >> using 2 mappers. > >> > >> in hadoop job it's possible to pass mapper count and > >> -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL > >> keyword only works for reducers > >> > >> > >> Thanks > >> Yang >
-
Re: how to control the number of mappers?Dmitriy Ryaboy 2012-01-12, 05:52
Yes, you can use the "set" keyword to set such properties in the script.
On Jan 11, 2012, at 6:12 PM, Yang <[EMAIL PROTECTED]> wrote: > I have a pig script that does basically a map-only job: > > raw = LOAD 'input.txt' ; > > processed = FOREACH raw GENERATE convert_somehow($1,$2...); > > store processed into 'output.txt'; > > > > I have many nodes on my cluster, so I want PIG to process the input in > more mappers. but it generates only 2 part-m-xxxxx files, i.e. > using 2 mappers. > > in hadoop job it's possible to pass mapper count and > -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL > keyword only works for reducers > > > Thanks > Yang
-
Re: how to control the number of mappers?Yang 2012-01-17, 20:46
thanks, but from http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#set
it looks the params that can be 'set' is very limited, and does not contain the min split size and mapper count that I want On Wed, Jan 11, 2012 at 9:52 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Yes, you can use the "set" keyword to set such properties in the script. > > On Jan 11, 2012, at 6:12 PM, Yang <[EMAIL PROTECTED]> wrote: > > > I have a pig script that does basically a map-only job: > > > > raw = LOAD 'input.txt' ; > > > > processed = FOREACH raw GENERATE convert_somehow($1,$2...); > > > > store processed into 'output.txt'; > > > > > > > > I have many nodes on my cluster, so I want PIG to process the input in > > more mappers. but it generates only 2 part-m-xxxxx files, i.e. > > using 2 mappers. > > > > in hadoop job it's possible to pass mapper count and > > -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL > > keyword only works for reducers > > > > > > Thanks > > Yang >
-
Re: how to control the number of mappers?Yang 2012-01-17, 20:53
Prashant:
I tried splitting the input files, yes that worked, and multiple mappers were indeed created. but then I would have to create a separate stage simply to split the input files, so that is a bit cumbersome. it would be nice if there is some control to directly limit map file input size etc. Thanks Yang On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > By block size I mean the actual HDFS block size. Based on your requirement > it seems like the input files are extremely small and reducing the block > size is not an option. > > Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). > > Your job is more CPU intensive than I/O. I can think of splitting your > files into multiple input files (equal to # of map tasks on your cluster) > and turning off splitCombination (pig.splitCombination=false). Though this > is generally a terrible MR practice! > > Another thing you could try is to give more memory to your map tasks by > increasing "mapred.child.java.opts" to a higher value. > > Thanks, > Prashant > > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote: > > > Prashant: > > > > thanks. > > > > by "reducing the block size", do you mean split size ? ---- block size > > is fixed on a hadoop hdfs. > > > > my application is not really data heavy, each line of input takes a > > long while to process. as a result, the input size is small, but total > > processing time is long, and the potential parallelism is high > > > > Yang > > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi > > <[EMAIL PROTECTED]> wrote: > > > Hi Yang, > > > > > > You cannot really control the number of mappers directly (depends on > > > input splits), but surely can spawn more mappers in various ways, such > > > as reducing the block size or setting pig.splitCombination to false > > > (this *might* create more maps). > > > > > > Level of parallelization depends on how much data the 2 mappers are > > > handling. You would not want a lot of maps handling too little data. > > > For eg, if your input data set is only a few MB it would not be a good > > > idea to have more than 1 or 2 maps. > > > > > > Thanks, > > > Prashant > > > > > > Sent from my iPhone > > > > > > On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote: > > > > > >> I have a pig script that does basically a map-only job: > > >> > > >> raw = LOAD 'input.txt' ; > > >> > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); > > >> > > >> store processed into 'output.txt'; > > >> > > >> > > >> > > >> I have many nodes on my cluster, so I want PIG to process the input in > > >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. > > >> using 2 mappers. > > >> > > >> in hadoop job it's possible to pass mapper count and > > >> -Dmapred.min.split.size= , would this also work for PIG? the PARALLEL > > >> keyword only works for reducers > > >> > > >> > > >> Thanks > > >> Yang > > >
-
Re: how to control the number of mappers?Dmitriy Ryaboy 2012-01-17, 21:15
http://pig.apache.org/docs/r0.9.1/cmds.html#set
"All Pig and Hadoop properties can be set, either in the Pig script or via the Grunt command line." On Tue, Jan 17, 2012 at 12:53 PM, Yang <[EMAIL PROTECTED]> wrote: > Prashant: > > I tried splitting the input files, yes that worked, and multiple mappers > were indeed created. > > but then I would have to create a separate stage simply to split the input > files, so that is a bit cumbersome. it would be nice if there is some > control to directly limit map file input size etc. > > Thanks > Yang > > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > By block size I mean the actual HDFS block size. Based on your > requirement > > it seems like the input files are extremely small and reducing the block > > size is not an option. > > > > Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). > > > > Your job is more CPU intensive than I/O. I can think of splitting your > > files into multiple input files (equal to # of map tasks on your cluster) > > and turning off splitCombination (pig.splitCombination=false). Though > this > > is generally a terrible MR practice! > > > > Another thing you could try is to give more memory to your map tasks by > > increasing "mapred.child.java.opts" to a higher value. > > > > Thanks, > > Prashant > > > > > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote: > > > > > Prashant: > > > > > > thanks. > > > > > > by "reducing the block size", do you mean split size ? ---- block size > > > is fixed on a hadoop hdfs. > > > > > > my application is not really data heavy, each line of input takes a > > > long while to process. as a result, the input size is small, but total > > > processing time is long, and the potential parallelism is high > > > > > > Yang > > > > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi > > > <[EMAIL PROTECTED]> wrote: > > > > Hi Yang, > > > > > > > > You cannot really control the number of mappers directly (depends on > > > > input splits), but surely can spawn more mappers in various ways, > such > > > > as reducing the block size or setting pig.splitCombination to false > > > > (this *might* create more maps). > > > > > > > > Level of parallelization depends on how much data the 2 mappers are > > > > handling. You would not want a lot of maps handling too little data. > > > > For eg, if your input data set is only a few MB it would not be a > good > > > > idea to have more than 1 or 2 maps. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > Sent from my iPhone > > > > > > > > On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote: > > > > > > > >> I have a pig script that does basically a map-only job: > > > >> > > > >> raw = LOAD 'input.txt' ; > > > >> > > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...); > > > >> > > > >> store processed into 'output.txt'; > > > >> > > > >> > > > >> > > > >> I have many nodes on my cluster, so I want PIG to process the input > in > > > >> more mappers. but it generates only 2 part-m-xxxxx files, i.e. > > > >> using 2 mappers. > > > >> > > > >> in hadoop job it's possible to pass mapper count and > > > >> -Dmapred.min.split.size= , would this also work for PIG? the > PARALLEL > > > >> keyword only works for reducers > > > >> > > > >> > > > >> Thanks > > > >> Yang > > > > > >
-
Re: how to control the number of mappers?Yang 2012-01-17, 21:20
weird
I tried # head a.pg set job.name 'blah'; SET mapred.map.tasks.speculative.execution false; set mapred.min.split.size 10000; set mapred.tasktracker.map.tasks.maximum 10000; [root@]# pig a.pg 2012-01-17 16:19:18,407 [main] INFO org.apache.pig.Main - Logging error messages to: /mnt/pig_1326835158407.log 2012-01-17 16:19:18,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs:// ec2-107-22-118-169.compute-1.amazonaws.com:8020/ 2012-01-17 16:19:18,749 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: ec2-107-22-118-169.compute-1.amazonaws.com:8021 2012-01-17 16:19:18,858 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Unrecognized set key: mapred.map.tasks.speculative.execution Details at logfile: /mnt/pig_1326835158407.log Pig Stack Trace --------------- ERROR 1000: Error during parsing. Unrecognized set key: mapred.map.tasks.speculative.execution org.apache.pig.tools.pigscript.parser.ParseException: Unrecognized set key: mapred.map.tasks.speculative.execution at org.apache.pig.tools.grunt.GruntParser.processSet(GruntParser.java:459) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:429) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) =============================================================================== so the job.name param is accepted, but the next one mapred.map...... was unrecognized. but that is the one I pasted from the docs page On Tue, Jan 17, 2012 at 1:15 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > http://pig.apache.org/docs/r0.9.1/cmds.html#set > > "All Pig and Hadoop properties can be set, either in the Pig script or via > the Grunt command line." > > On Tue, Jan 17, 2012 at 12:53 PM, Yang <[EMAIL PROTECTED]> wrote: > > > Prashant: > > > > I tried splitting the input files, yes that worked, and multiple mappers > > were indeed created. > > > > but then I would have to create a separate stage simply to split the > input > > files, so that is a bit cumbersome. it would be nice if there is some > > control to directly limit map file input size etc. > > > > Thanks > > Yang > > > > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi < > [EMAIL PROTECTED] > > >wrote: > > > > > By block size I mean the actual HDFS block size. Based on your > > requirement > > > it seems like the input files are extremely small and reducing the > block > > > size is not an option. > > > > > > Specifying "mapred.min.split.size" would not work for both Hadoop/Java > MR > > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). > > > > > > Your job is more CPU intensive than I/O. I can think of splitting your > > > files into multiple input files (equal to # of map tasks on your > cluster) > > > and turning off splitCombination (pig.splitCombination=false). Though > > this > > > is generally a terrible MR practice! > > > > > > Another thing you could try is to give more memory to your map tasks by > > > increasing "mapred.child.java.opts" to a higher value. > > > > > > Thanks, > > > Prashant > > > > > > > > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote: > > > > > > > Prashant: > > > > > > > > thanks. > > > > > > > > by "reducing the block size", do you mean split size ? ---- block > size > > > > is fixed on a hadoop hdfs. > > > > > > > > my application is not really data heavy, each line of input takes a > > > > long while to process. as a result, the input size is small, but > total > > > > processing time is long, and the potential parallelism is high > > > > > > > > Yang > > > > > > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
-
Re: how to control the number of mappers?Yang 2012-01-17, 21:28
ok, I see, I was using pig 0.5
tried 0.9, works now thanks! On Tue, Jan 17, 2012 at 1:20 PM, Yang <[EMAIL PROTECTED]> wrote: > weird > > I tried > > # head a.pg > > set job.name 'blah'; > SET mapred.map.tasks.speculative.execution false; > set mapred.min.split.size 10000; > > set mapred.tasktracker.map.tasks.maximum 10000; > > > [root@]# pig a.pg > 2012-01-17 16:19:18,407 [main] INFO org.apache.pig.Main - Logging error > messages to: /mnt/pig_1326835158407.log > 2012-01-17 16:19:18,564 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to hadoop file system at: hdfs:// > ec2-107-22-118-169.compute-1.amazonaws.com:8020/ > 2012-01-17 16:19:18,749 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to map-reduce job tracker at: > ec2-107-22-118-169.compute-1.amazonaws.com:8021 > 2012-01-17 16:19:18,858 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1000: Error during parsing. Unrecognized set key: > mapred.map.tasks.speculative.execution > Details at logfile: /mnt/pig_1326835158407.log > > > Pig Stack Trace > --------------- > ERROR 1000: Error during parsing. Unrecognized set key: > mapred.map.tasks.speculative.execution > > org.apache.pig.tools.pigscript.parser.ParseException: Unrecognized set > key: mapred.map.tasks.speculative.execution > at > org.apache.pig.tools.grunt.GruntParser.processSet(GruntParser.java:459) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:429) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:397) > > ===============================================================================> > > so the job.name param is accepted, but the next one mapred.map...... was > unrecognized. > but that is the one I pasted from the docs page > > > On Tue, Jan 17, 2012 at 1:15 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: > >> http://pig.apache.org/docs/r0.9.1/cmds.html#set >> >> "All Pig and Hadoop properties can be set, either in the Pig script or via >> the Grunt command line." >> >> On Tue, Jan 17, 2012 at 12:53 PM, Yang <[EMAIL PROTECTED]> wrote: >> >> > Prashant: >> > >> > I tried splitting the input files, yes that worked, and multiple mappers >> > were indeed created. >> > >> > but then I would have to create a separate stage simply to split the >> input >> > files, so that is a bit cumbersome. it would be nice if there is some >> > control to directly limit map file input size etc. >> > >> > Thanks >> > Yang >> > >> > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi < >> [EMAIL PROTECTED] >> > >wrote: >> > >> > > By block size I mean the actual HDFS block size. Based on your >> > requirement >> > > it seems like the input files are extremely small and reducing the >> block >> > > size is not an option. >> > > >> > > Specifying "mapred.min.split.size" would not work for both >> Hadoop/Java MR >> > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize). >> > > >> > > Your job is more CPU intensive than I/O. I can think of splitting your >> > > files into multiple input files (equal to # of map tasks on your >> cluster) >> > > and turning off splitCombination (pig.splitCombination=false). Though >> > this >> > > is generally a terrible MR practice! >> > > >> > > Another thing you could try is to give more memory to your map tasks >> by >> > > increasing "mapred.child.java.opts" to a higher value. >> > > >> > > Thanks, >> > > Prashant >> > > >> > > >> > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote: >> > > >> > > > Prashant: >> > > > >> > > > thanks. >> > > > >> > > > by "reducing the block size", do you mean split size ? ---- block >> size >> > > > is fixed on a hadoop hdfs. |