Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - limit map tasks for load function


Copy link to this message
-
Re: limit map tasks for load function
Pradeep Gollakota 2013-11-03, 23:58
I think you’re misunderstanding how HBaseStorage works. HBaseStorage uses
the HBaseInputFormat underneath the hood. The number of map tasks that are
spawned is dependent on the number of regions you have. The map tasks are
spawned such that the tasks are local to the regions they’re reading from.
You will typically not have to worry about problems such as this with
MapReduce. If you do have some performance concerns, you can set the
mapred.tasktracker.map.tasks.maxiumum setting in the job conf and it will
not affect all the other jobs.
On Sun, Nov 3, 2013 at 3:04 PM, John <[EMAIL PROTECTED]> wrote:

> Hi,
>
> is it possible to limit the number of map slots used for the load function?
> For example I have 5 nodes with 10 map slots (each node has 2 slots for
> every cpu) I want only one map task for every node. Is there a way to set
> this only for the load function? I know there is a option called
> "mapred.tasktracker.map.tasks.maximum",
> but this would influence every MapReduce job. I want to influence the
> number only for this specific job.
>
> My use case is the following: I'm using a modified version of the
> HBaseStorage function. I try to load for example 10 different rowkeys with
> very big column sizes and join them afterwords. Since the columns all have
> the same column family every row can be stored to a different server. For
> example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on the
> other nodes. So If I create a Pig script to load the 10 keys and join them
> afterwards this will end up in 1 MapReduce Job with 10 map task and some
> reduce tasks (depends on the parallel factor).  The problem is that there
> will be created 2 map tasks on node1, because there are 2 slots available.
> This means every task is reading simultaneously a large number of columns
> from the local hard drive. Maybe I'm wrong, but this should be a
> performance issue?! It should be faster if to read each rowkey one after
> another!?
>
> kind regards
>