So the thing that just doesn't click for me yet is this:
On a typical computer, if I try to read two huge files off disk
simultaneously it'll just kill the disk performance. This seems like a risk.
What's preventing such disk contention in Hadoop? Is HDFS smart enough to
serialize major disk access?
From: Michael Segel [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, October 24, 2012 6:51 PM
To: [EMAIL PROTECTED]
Subject: Re: How do map tasks get assigned efficiently?
Data locality only works when you actually have data on the cluster itself.
Otherwise how can the data be local.
Assuming 3X replication, and you're not doing a custom split and your input
file is splittable...
You will split along the block delineation. So if your input file has 5
blocks, you will have 5 mappers.
Since there are 3 copies of the block, its possible that for that map task
to run on the DN which has a copy of that block.
So its pretty straight forward to a point.
When your cluster starts to get a lot of jobs and a slot opens up, your job
may not be data local.
With HBase... YMMV
With S3 the data isn't local so it doesn't matter which Data Node gets the
On Oct 24, 2012, at 1:10 AM, David Parks <[EMAIL PROTECTED]> wrote:
Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.
They depend on splits right?
But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.
So I've got some concern as to how the map tasks will be distributed to
handle the data acquisition.
Can I do anything to ensure that I don't let the cluster go idle processing
slow HTTP downloads when the boxes could simultaneously be doing HTTP
downloads for one job and reading large files off HDFS for another job?
I'm imagining a scenario where the only map tasks that are running are all
blocking on splits requiring HTTP downloads and the splits coming from HDFS
are all queuing up behind it, when they'd run more efficiently in parallel