Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> how to control the number of mappers?


Copy link to this message
-
Re: how to control the number of mappers?
weird

I tried

# head a.pg

set job.name 'blah';
SET mapred.map.tasks.speculative.execution false;
set mapred.min.split.size 10000;

set mapred.tasktracker.map.tasks.maximum 10000;
[root@]# pig a.pg
2012-01-17 16:19:18,407 [main] INFO  org.apache.pig.Main - Logging error
messages to: /mnt/pig_1326835158407.log
2012-01-17 16:19:18,564 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://
ec2-107-22-118-169.compute-1.amazonaws.com:8020/
2012-01-17 16:19:18,749 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at:
ec2-107-22-118-169.compute-1.amazonaws.com:8021
2012-01-17 16:19:18,858 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Unrecognized set key:
mapred.map.tasks.speculative.execution
Details at logfile: /mnt/pig_1326835158407.log
Pig Stack Trace
---------------
ERROR 1000: Error during parsing. Unrecognized set key:
mapred.map.tasks.speculative.execution

org.apache.pig.tools.pigscript.parser.ParseException: Unrecognized set key:
mapred.map.tasks.speculative.execution
        at
org.apache.pig.tools.grunt.GruntParser.processSet(GruntParser.java:459)
        at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:429)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
        at org.apache.pig.Main.main(Main.java:397)
===============================================================================

so the job.name param is accepted, but the next one mapred.map...... was
unrecognized.
but that is the one I pasted from the docs page
On Tue, Jan 17, 2012 at 1:15 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> http://pig.apache.org/docs/r0.9.1/cmds.html#set
>
> "All Pig and Hadoop properties can be set, either in the Pig script or via
> the Grunt command line."
>
> On Tue, Jan 17, 2012 at 12:53 PM, Yang <[EMAIL PROTECTED]> wrote:
>
> > Prashant:
> >
> > I tried splitting the input files, yes that worked, and multiple mappers
> > were indeed created.
> >
> > but then I would have to create a separate stage simply to split the
> input
> > files, so that is a bit cumbersome. it would be nice if there is some
> > control to directly limit map file input size etc.
> >
> > Thanks
> > Yang
> >
> > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > By block size I mean the actual HDFS block size. Based on your
> > requirement
> > > it seems like the input files are extremely small and reducing the
> block
> > > size is not an option.
> > >
> > > Specifying "mapred.min.split.size" would not work for both Hadoop/Java
> MR
> > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
> > >
> > > Your job is more CPU intensive than I/O. I can think of splitting your
> > > files into multiple input files (equal to # of map tasks on your
> cluster)
> > > and turning off splitCombination (pig.splitCombination=false). Though
> > this
> > > is generally a terrible MR practice!
> > >
> > > Another thing you could try is to give more memory to your map tasks by
> > > increasing "mapred.child.java.opts" to a higher value.
> > >
> > > Thanks,
> > > Prashant
> > >
> > >
> > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote:
> > >
> > > > Prashant:
> > > >
> > > > thanks.
> > > >
> > > > by "reducing the block size", do you mean split size ? ---- block
> size
> > > > is fixed on a hadoop hdfs.
> > > >
> > > > my application is not really data heavy, each line of input takes a
> > > > long while to process. as a result, the input size is small, but
> total
> > > > processing time is long, and the potential parallelism is high
> > > >
> > > > Yang
> > > >
> > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB