|
|
-
how can i increase the number of mappers?
Jane Wayne 2012-03-21, 06:07
i have a matrix that i am performing operations on. it is 10,000 rows by 5,000 columns. the total size of the file is just under 30 MB. my HDFS block size is set to 64 MB. from what i understand, the number of mappers is roughly equal to the number of HDFS blocks used in the input. i.e. if my input data spans 1 block, then only 1 mapper is created, if my data spans 2 blocks, then 2 mappers will be created, etc...
so, with my 1 matrix file of 15 MB, this won't fill up a block of data, and being as such, only 1 mapper will be called upon the data. is this understanding correct?
if so, what i want to happen is for more than one mapper (let's say 10) to work on the data, even though it remains on 1 block. my analysis (or map/reduce job) is such that +1 mappers can work on different parts of the matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 can work on the next 500 rows, etc... how can i set up multiple mappers (+1 mapper) to work on a file that resides only one block (or a file whose size is smaller than the HDFS block size).
can i split the matrix into (let's say) 10 files? that will mean 30 MB / 10 = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase the chance of having multiple mappers work simultaneously on the data/matrix? if i can increase the number of mappers, i think (pretty sure) my implementation will improve in speed linearly.
any help is appreciated.
-
Re: how can i increase the number of mappers?
Anil Gupta 2012-03-21, 06:37
Have a look at NLineInputFormat class in Hadoop. That class will solve your purpose.
Best Regards, Anil
On Mar 20, 2012, at 11:07 PM, Jane Wayne <[EMAIL PROTECTED]> wrote:
> i have a matrix that i am performing operations on. it is 10,000 rows by > 5,000 columns. the total size of the file is just under 30 MB. my HDFS > block size is set to 64 MB. from what i understand, the number of mappers > is roughly equal to the number of HDFS blocks used in the input. i.e. if my > input data spans 1 block, then only 1 mapper is created, if my data spans 2 > blocks, then 2 mappers will be created, etc... > > so, with my 1 matrix file of 15 MB, this won't fill up a block of data, and > being as such, only 1 mapper will be called upon the data. is this > understanding correct? > > if so, what i want to happen is for more than one mapper (let's say 10) to > work on the data, even though it remains on 1 block. my analysis (or > map/reduce job) is such that +1 mappers can work on different parts of the > matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 can > work on the next 500 rows, etc... how can i set up multiple mappers (+1 > mapper) to work on a file that resides only one block (or a file whose size > is smaller than the HDFS block size). > > can i split the matrix into (let's say) 10 files? that will mean 30 MB / 10 > = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase the > chance of having multiple mappers work simultaneously on the data/matrix? > if i can increase the number of mappers, i think (pretty sure) my > implementation will improve in speed linearly. > > any help is appreciated.
-
Re: how can i increase the number of mappers?
Jane Wayne 2012-03-21, 07:33
as i understand, that class does not exist for new API in hadoop v0.20.2 (which is what i am using). if i am mistaken, where is it?
i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i wonder if i can simply copy/paste this into my project.
On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta <[EMAIL PROTECTED]> wrote:
> Have a look at NLineInputFormat class in Hadoop. That class will solve > your purpose. > > Best Regards, > Anil > > On Mar 20, 2012, at 11:07 PM, Jane Wayne <[EMAIL PROTECTED]> wrote: > > > i have a matrix that i am performing operations on. it is 10,000 rows by > > 5,000 columns. the total size of the file is just under 30 MB. my HDFS > > block size is set to 64 MB. from what i understand, the number of mappers > > is roughly equal to the number of HDFS blocks used in the input. i.e. if > my > > input data spans 1 block, then only 1 mapper is created, if my data > spans 2 > > blocks, then 2 mappers will be created, etc... > > > > so, with my 1 matrix file of 15 MB, this won't fill up a block of data, > and > > being as such, only 1 mapper will be called upon the data. is this > > understanding correct? > > > > if so, what i want to happen is for more than one mapper (let's say 10) > to > > work on the data, even though it remains on 1 block. my analysis (or > > map/reduce job) is such that +1 mappers can work on different parts of > the > > matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 > can > > work on the next 500 rows, etc... how can i set up multiple mappers (+1 > > mapper) to work on a file that resides only one block (or a file whose > size > > is smaller than the HDFS block size). > > > > can i split the matrix into (let's say) 10 files? that will mean 30 MB / > 10 > > = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase > the > > chance of having multiple mappers work simultaneously on the data/matrix? > > if i can increase the number of mappers, i think (pretty sure) my > > implementation will improve in speed linearly. > > > > any help is appreciated. >
-
Re: how can i increase the number of mappers?
Jane Wayne 2012-03-21, 16:10
if anyone is facing the same problem, here's what i did. i took anil's advice to use NLineInputFormat (because that approach would scale out my mappers).
however, i am using the new mapreduce package/API in hadoop v0.20.2. i notice that you cannot use NLineInputFormat from the old package/API (mapred).
when i took a look at hadoop v1.0.1, there is a NLineInputFormat class for the new API. i simply copied and pasted this file into my project. i got 4 errors associated with import statements and annotations. when i removed the 2 import statements and corresponding 2 annotations, the class compiled successfully. after this modification, running NLineInputFormat of v1.0.1 on a cluster based on v0.20.2, works.
one mini-problem solved, many more to go.
thanks for the help.
On Wed, Mar 21, 2012 at 3:33 AM, Jane Wayne <[EMAIL PROTECTED]>wrote:
> as i understand, that class does not exist for new API in hadoop v0.20.2 > (which is what i am using). if i am mistaken, where is it? > > i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i > wonder if i can simply copy/paste this into my project. > > > On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta <[EMAIL PROTECTED]> wrote: > >> Have a look at NLineInputFormat class in Hadoop. That class will solve >> your purpose. >> >> Best Regards, >> Anil >> >> On Mar 20, 2012, at 11:07 PM, Jane Wayne <[EMAIL PROTECTED]> >> wrote: >> >> > i have a matrix that i am performing operations on. it is 10,000 rows by >> > 5,000 columns. the total size of the file is just under 30 MB. my HDFS >> > block size is set to 64 MB. from what i understand, the number of >> mappers >> > is roughly equal to the number of HDFS blocks used in the input. i.e. >> if my >> > input data spans 1 block, then only 1 mapper is created, if my data >> spans 2 >> > blocks, then 2 mappers will be created, etc... >> > >> > so, with my 1 matrix file of 15 MB, this won't fill up a block of data, >> and >> > being as such, only 1 mapper will be called upon the data. is this >> > understanding correct? >> > >> > if so, what i want to happen is for more than one mapper (let's say 10) >> to >> > work on the data, even though it remains on 1 block. my analysis (or >> > map/reduce job) is such that +1 mappers can work on different parts of >> the >> > matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 >> can >> > work on the next 500 rows, etc... how can i set up multiple mappers (+1 >> > mapper) to work on a file that resides only one block (or a file whose >> size >> > is smaller than the HDFS block size). >> > >> > can i split the matrix into (let's say) 10 files? that will mean 30 MB >> / 10 >> > = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase >> the >> > chance of having multiple mappers work simultaneously on the >> data/matrix? >> > if i can increase the number of mappers, i think (pretty sure) my >> > implementation will improve in speed linearly. >> > >> > any help is appreciated. >> > >
-
Re: how can i increase the number of mappers?
Wei Shung Chung 2012-03-21, 17:12
Great info :)
Sent from my iPhone
On Mar 21, 2012, at 9:10 AM, Jane Wayne <[EMAIL PROTECTED]> wrote:
> if anyone is facing the same problem, here's what i did. i took anil's > advice to use NLineInputFormat (because that approach would scale out my > mappers). > > however, i am using the new mapreduce package/API in hadoop v0.20.2. i > notice that you cannot use NLineInputFormat from the old package/API > (mapred). > > when i took a look at hadoop v1.0.1, there is a NLineInputFormat class for > the new API. i simply copied and pasted this file into my project. i got 4 > errors associated with import statements and annotations. when i removed > the 2 import statements and corresponding 2 annotations, the class compiled > successfully. after this modification, running NLineInputFormat of v1.0.1 > on a cluster based on v0.20.2, works. > > one mini-problem solved, many more to go. > > thanks for the help. > > On Wed, Mar 21, 2012 at 3:33 AM, Jane Wayne <[EMAIL PROTECTED]>wrote: > >> as i understand, that class does not exist for new API in hadoop v0.20.2 >> (which is what i am using). if i am mistaken, where is it? >> >> i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i >> wonder if i can simply copy/paste this into my project. >> >> >> On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta <[EMAIL PROTECTED]> wrote: >> >>> Have a look at NLineInputFormat class in Hadoop. That class will solve >>> your purpose. >>> >>> Best Regards, >>> Anil >>> >>> On Mar 20, 2012, at 11:07 PM, Jane Wayne <[EMAIL PROTECTED]> >>> wrote: >>> >>>> i have a matrix that i am performing operations on. it is 10,000 rows by >>>> 5,000 columns. the total size of the file is just under 30 MB. my HDFS >>>> block size is set to 64 MB. from what i understand, the number of >>> mappers >>>> is roughly equal to the number of HDFS blocks used in the input. i.e. >>> if my >>>> input data spans 1 block, then only 1 mapper is created, if my data >>> spans 2 >>>> blocks, then 2 mappers will be created, etc... >>>> >>>> so, with my 1 matrix file of 15 MB, this won't fill up a block of data, >>> and >>>> being as such, only 1 mapper will be called upon the data. is this >>>> understanding correct? >>>> >>>> if so, what i want to happen is for more than one mapper (let's say 10) >>> to >>>> work on the data, even though it remains on 1 block. my analysis (or >>>> map/reduce job) is such that +1 mappers can work on different parts of >>> the >>>> matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 >>> can >>>> work on the next 500 rows, etc... how can i set up multiple mappers (+1 >>>> mapper) to work on a file that resides only one block (or a file whose >>> size >>>> is smaller than the HDFS block size). >>>> >>>> can i split the matrix into (let's say) 10 files? that will mean 30 MB >>> / 10 >>>> = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase >>> the >>>> chance of having multiple mappers work simultaneously on the >>> data/matrix? >>>> if i can increase the number of mappers, i think (pretty sure) my >>>> implementation will improve in speed linearly. >>>> >>>> any help is appreciated. >>> >> >>
|
|