Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Help me with architecture of a somewhat non-trivial mapreduce implementation


Copy link to this message
-
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
You could also use the NLineInputFormat which will launch 1 mapper for every N (configurable) lines of input.
On 4/20/12 9:48 AM, "Sky" <[EMAIL PROTECTED]> wrote:

Thanks! That helped!

-----Original Message-----
From: Michael Segel
Sent: Thursday, April 19, 2012 9:38 PM
To: [EMAIL PROTECTED]
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce
implementation

If the file is small enough you could read it in to a java object like a
list and write your own input format that takes a list object as its input
and then lets you specify the number of mappers.

On Apr 19, 2012, at 11:34 PM, Sky wrote:

> My file for the input to mapper is very small - as all it has is urls to
> list of manifests. The task for mappers is to fetch each manifest, and
> then fetch files using urls from the manifests and then process them.
> Besides passing around lists of files, I am not really accessing the disk.
> It should be RAM, network, and CPU (unzip, parsexml,extract attributes).
>
> So is my only choice to break the input file and submit multiple files (if
> I have 15 cores, I should split the file with urls to 15 files? also how
> does it look in code?)? The two drawbacks are - some cores might finish
> early and stay idle, and I don't know how to deal with dynamically
> increasing/decreasing cores.
>
> Thx
> - Sky
>
> -----Original Message----- From: Michael Segel
> Sent: Thursday, April 19, 2012 8:49 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce
> implementation
>
> How 'large' or rather in this case small is your file?
>
> If you're on a default system, the block sizes are 64MB. So if your file
> ~<= 64MB, you end up with 1 block, and you will only have 1 mapper.
>
>
> On Apr 19, 2012, at 10:10 PM, Sky wrote:
>
>> Thanks for your reply.  After I sent my email, I found a fundamental
>> defect - in my understanding of how MR is distributed. I discovered that
>> even though I was firing off 15 COREs, the map job - which is the most
>> expensive part of my processing was run only on 1 core.
>>
>> To start my map job, I was creating a single file with following data:
>> 1 storage:/root/1.manif.txt
>> 2 storage:/root/2.manif.txt
>> 3 storage:/root/3.manif.txt
>> ...
>> 4000 storage:/root/4000.manif.txt
>>
>> I thought that each of the available COREs will be assigned a map job
>> from top down from the same file one at a time, and as soon as one CORE
>> is done, it would get the next map job. However, it looks like I need to
>> split the file into the number of times. Now while that's clearly trivial
>> to do, I am not sure how I can detect at runtime how many splits I need
>> to do, and also to deal with adding new CORES at runtime. Any
>> suggestions? (it doesn't have to be a file, it can be a list, etc).
>>
>> This all would be much easier to debug, if somehow I could get my log4j
>> logs for my mappers and reducers. I can see log4j for my main launcher,
>> but not sure how to enable it for mappers and reducers.
>>
>> Thx
>> - Akash
>>
>>
>> -----Original Message----- From: Robert Evans
>> Sent: Thursday, April 19, 2012 2:08 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: Help me with architecture of a somewhat non-trivial
>> mapreduce implementation
>>
>> From what I can see your implementation seems OK, especially from a
>> performance perspective. Depending on what storage: is it is likely to be
>> your bottlekneck, not the hadoop computations.
>>
>> Because you are writing files directly instead of relying on Hadoop to do
>> it for you, you may need to deal with error cases that Hadoop will
>> normally hide from you, and you will not be able to turn on speculative
>> execution. Just be aware that a map or reduce task may have problems in
>> the middle, and be relaunched.  So when you are writing out your updated
>> manifest be careful to not replace the old one until the new one is
>> completely ready and will not fail, or you may lose data.  You may also
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB