Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> how can i increase the number of mappers?


+
Jane Wayne 2012-03-21, 06:07
+
Anil Gupta 2012-03-21, 06:37
+
Jane Wayne 2012-03-21, 07:33
Copy link to this message
-
Re: how can i increase the number of mappers?
if anyone is facing the same problem, here's what i did. i took anil's
advice to use NLineInputFormat (because that approach would scale out my
mappers).

however, i am using the new mapreduce package/API in hadoop v0.20.2. i
notice that you cannot use NLineInputFormat from the old package/API
(mapred).

when i took a look at hadoop v1.0.1, there is a NLineInputFormat class for
the new API. i simply copied and pasted this file into my project. i got 4
errors associated with import statements and annotations. when i removed
the 2 import statements and corresponding 2 annotations, the class compiled
successfully. after this modification, running NLineInputFormat of v1.0.1
on a cluster based on v0.20.2, works.

one mini-problem solved, many more to go.

thanks for the help.

On Wed, Mar 21, 2012 at 3:33 AM, Jane Wayne <[EMAIL PROTECTED]>wrote:

> as i understand, that class does not exist for new API in hadoop v0.20.2
> (which is what i am using). if i am mistaken, where is it?
>
> i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i
> wonder if i can simply copy/paste this into my project.
>
>
> On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta <[EMAIL PROTECTED]> wrote:
>
>> Have a look at NLineInputFormat class in Hadoop. That class will solve
>> your purpose.
>>
>> Best Regards,
>> Anil
>>
>> On Mar 20, 2012, at 11:07 PM, Jane Wayne <[EMAIL PROTECTED]>
>> wrote:
>>
>> > i have a matrix that i am performing operations on. it is 10,000 rows by
>> > 5,000 columns. the total size of the file is just under 30 MB. my HDFS
>> > block size is set to 64 MB. from what i understand, the number of
>> mappers
>> > is roughly equal to the number of HDFS blocks used in the input. i.e.
>> if my
>> > input data spans 1 block, then only 1 mapper is created, if my data
>> spans 2
>> > blocks, then 2 mappers will be created, etc...
>> >
>> > so, with my 1 matrix file of 15 MB, this won't fill up a block of data,
>> and
>> > being as such, only 1 mapper will be called upon the data. is this
>> > understanding correct?
>> >
>> > if so, what i want to happen is for more than one mapper (let's say 10)
>> to
>> > work on the data, even though it remains on 1 block. my analysis (or
>> > map/reduce job) is such that +1 mappers can work on different parts of
>> the
>> > matrix. for example, mapper 1 can work on the first 500 rows, mapper 2
>> can
>> > work on the next 500 rows, etc... how can i set up multiple mappers (+1
>> > mapper) to work on a file that resides only one block (or a file whose
>> size
>> > is smaller than the HDFS block size).
>> >
>> > can i split the matrix into (let's say) 10 files? that will mean 30 MB
>> / 10
>> > = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase
>> the
>> > chance of having multiple mappers work simultaneously on the
>> data/matrix?
>> > if i can increase the number of mappers, i think (pretty sure) my
>> > implementation will improve in speed linearly.
>> >
>> > any help is appreciated.
>>
>
>
+
Wei Shung Chung 2012-03-21, 17:12
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB