-limit map tasks for load function
John 2013-11-03, 23:04
is it possible to limit the number of map slots used for the load function?
For example I have 5 nodes with 10 map slots (each node has 2 slots for
every cpu) I want only one map task for every node. Is there a way to set
this only for the load function? I know there is a option called
but this would influence every MapReduce job. I want to influence the
number only for this specific job.
My use case is the following: I'm using a modified version of the
HBaseStorage function. I try to load for example 10 different rowkeys with
very big column sizes and join them afterwords. Since the columns all have
the same column family every row can be stored to a different server. For
example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on the
other nodes. So If I create a Pig script to load the 10 keys and join them
afterwards this will end up in 1 MapReduce Job with 10 map task and some
reduce tasks (depends on the parallel factor). The problem is that there
will be created 2 map tasks on node1, because there are 2 slots available.
This means every task is reading simultaneously a large number of columns
from the local hard drive. Maybe I'm wrong, but this should be a
performance issue?! It should be faster if to read each rowkey one after