i have a matrix that i am performing operations on. it is 10,000 rows by
5,000 columns. the total size of the file is just under 30 MB. my HDFS
block size is set to 64 MB. from what i understand, the number of mappers
is roughly equal to the number of HDFS blocks used in the input. i.e. if my
input data spans 1 block, then only 1 mapper is created, if my data spans 2
blocks, then 2 mappers will be created, etc...
so, with my 1 matrix file of 15 MB, this won't fill up a block of data, and
being as such, only 1 mapper will be called upon the data. is this
if so, what i want to happen is for more than one mapper (let's say 10) to
work on the data, even though it remains on 1 block. my analysis (or
map/reduce job) is such that +1 mappers can work on different parts of the
matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 can
work on the next 500 rows, etc... how can i set up multiple mappers (+1
mapper) to work on a file that resides only one block (or a file whose size
is smaller than the HDFS block size).
can i split the matrix into (let's say) 10 files? that will mean 30 MB / 10
= 3 MB per file. then put each 3 MB file onto HDFS ? will this increase the
chance of having multiple mappers work simultaneously on the data/matrix?
if i can increase the number of mappers, i think (pretty sure) my
implementation will improve in speed linearly.
any help is appreciated.