-Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
Michael Segel 2012-04-20, 03:49
How 'large' or rather in this case small is your file?
If you're on a default system, the block sizes are 64MB. So if your file ~<= 64MB, you end up with 1 block, and you will only have 1 mapper.
On Apr 19, 2012, at 10:10 PM, Sky wrote:
> Thanks for your reply. After I sent my email, I found a fundamental defect - in my understanding of how MR is distributed. I discovered that even though I was firing off 15 COREs, the map job - which is the most expensive part of my processing was run only on 1 core.
> To start my map job, I was creating a single file with following data:
> 1 storage:/root/1.manif.txt
> 2 storage:/root/2.manif.txt
> 3 storage:/root/3.manif.txt
> 4000 storage:/root/4000.manif.txt
> I thought that each of the available COREs will be assigned a map job from top down from the same file one at a time, and as soon as one CORE is done, it would get the next map job. However, it looks like I need to split the file into the number of times. Now while that’s clearly trivial to do, I am not sure how I can detect at runtime how many splits I need to do, and also to deal with adding new CORES at runtime. Any suggestions? (it doesn't have to be a file, it can be a list, etc).
> This all would be much easier to debug, if somehow I could get my log4j logs for my mappers and reducers. I can see log4j for my main launcher, but not sure how to enable it for mappers and reducers.
> - Akash
> -----Original Message----- From: Robert Evans
> Sent: Thursday, April 19, 2012 2:08 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
> From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations.
> Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical.
> --Bobby Evans
> On 4/18/12 4:56 PM, "Sky USC" <[EMAIL PROTECTED]> wrote:
> Please help me architect the design of my first significant MR task beyond "word count". My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom.
> Project description in an abstract sense (written in java):
> * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt
> * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS (range could be 10000 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk
> So we are talking about millions of ebooks
> My task is to:
> 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version).
> 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01)
> 3. Create a output file such that the named "<publisher>_<year>_<ebook-version>" contains a list of all "ebook urls" that met that criteria.
> File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains:
> and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains: