Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> limit map tasks for load function


Copy link to this message
-
Re: limit map tasks for load function
Thanks for your answer! How can I set the mapred.tasktracker.map.tasks.
maxiumum value only for this speficic job? For example the pig script is
creating 8 jobs, and I only want to modify this value for the first job? I
think there is no option in PigLatin to influence this value?

kind regards
2013/11/4 Pradeep Gollakota <[EMAIL PROTECTED]>

> I think you’re misunderstanding how HBaseStorage works. HBaseStorage uses
> the HBaseInputFormat underneath the hood. The number of map tasks that are
> spawned is dependent on the number of regions you have. The map tasks are
> spawned such that the tasks are local to the regions they’re reading from.
> You will typically not have to worry about problems such as this with
> MapReduce. If you do have some performance concerns, you can set the
> mapred.tasktracker.map.tasks.maxiumum setting in the job conf and it will
> not affect all the other jobs.
>
>
> On Sun, Nov 3, 2013 at 3:04 PM, John <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > is it possible to limit the number of map slots used for the load
> function?
> > For example I have 5 nodes with 10 map slots (each node has 2 slots for
> > every cpu) I want only one map task for every node. Is there a way to set
> > this only for the load function? I know there is a option called
> > "mapred.tasktracker.map.tasks.maximum",
> > but this would influence every MapReduce job. I want to influence the
> > number only for this specific job.
> >
> > My use case is the following: I'm using a modified version of the
> > HBaseStorage function. I try to load for example 10 different rowkeys
> with
> > very big column sizes and join them afterwords. Since the columns all
> have
> > the same column family every row can be stored to a different server. For
> > example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on the
> > other nodes. So If I create a Pig script to load the 10 keys and join
> them
> > afterwards this will end up in 1 MapReduce Job with 10 map task and some
> > reduce tasks (depends on the parallel factor).  The problem is that there
> > will be created 2 map tasks on node1, because there are 2 slots
> available.
> > This means every task is reading simultaneously a large number of columns
> > from the local hard drive. Maybe I'm wrong, but this should be a
> > performance issue?! It should be faster if to read each rowkey one after
> > another!?
> >
> > kind regards
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB