Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: How do map tasks get assigned efficiently?


Copy link to this message
-
Re: How do map tasks get assigned efficiently?
Hi David,

Two things help avoid this, I think:

1. Blocks are small in size. Usually ranging in 128 MB to mostly less
than a GB on the whole. Reading this contiguous but limited chunk of
data per process doesn't take too much time (or if it does, its not
your disk to blame sometimes).

2. DNs support multiple-disks (and we recommend using JBOD configs),
via dfs.datanode.data.dir config prop, and use round-robin block
placement to store blocks (when writing) across these disks. In this
case, although it is possible to have several tasks reading from the
same disk, the occurrence is rare in runtime.

Even if you store a huge file, you still end up
reading it efficiently as the blocks are well distributed across the
cluster and across disks in each machine.

Regarding the original query on how splits really work, for HDFS, the
NN provides a list of hostnames to use to the MR framework when it
wants access to a specific block (a offset->length  in a whole file).
This helps MR schedule with a sense of data locality.

The data is shipped from NN to the MR framework in form of the
InputSplit classes, which have a InputSplit.getLocations() API. If you
had a non-HDFS source and you still needed locality hints (remember -
not enforcers, mere hints), you can write up your own InputFormat
class and return tweaked InputSplit objects with desired location
hostnames, via InputFormat#getSplits that gets called at the client
side by the framework. Hope this helps!

On Thu, Oct 25, 2012 at 8:19 AM, David Parks <[EMAIL PROTECTED]> wrote:
> So the thing that just doesn’t click for me yet is this:
>
>
>
> On a typical computer, if I try to read two huge files off disk
> simultaneously it’ll just kill the disk performance. This seems like a risk.
>
>
>
> What’s preventing such disk contention in Hadoop?  Is HDFS smart enough to
> serialize major disk access?
>
>
>
>
>
> From: Michael Segel [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, October 24, 2012 6:51 PM
> To: [EMAIL PROTECTED]
> Subject: Re: How do map tasks get assigned efficiently?
>
>
>
> So...
>
>
>
> Data locality only works when you actually have data on the cluster itself.
> Otherwise how can the data be local.
>
>
>
> Assuming 3X replication, and you're not doing a custom split and your input
> file is splittable...
>
>
>
> You will split along the block delineation.  So if your input file has 5
> blocks, you will have 5 mappers.
>
>
>
> Since there are 3 copies of the block, its possible that for that map task
> to run on the DN which has a copy of that block.
>
>
>
> So its pretty straight forward to a point.
>
>
>
> When your cluster starts to get a lot of jobs and a slot opens up, your job
> may not be data local.
>
>
>
> With HBase... YMMV
>
> With S3 the data isn't local so it doesn't matter which Data Node gets the
> job.
>
>
>
> HTH
>
>
>
> -Mike
>
>
>
> On Oct 24, 2012, at 1:10 AM, David Parks <[EMAIL PROTECTED]> wrote:
>
>
>
> Even after reading O’reillys book on hadoop I don’t feel like I have a clear
> vision of how the map tasks get assigned.
>
>
>
> They depend on splits right?
>
>
>
> But I have 3 jobs running. And splits will come from various sources: HDFS,
> S3, and slow HTTP sources.
>
>
>
> So I’ve got some concern as to how the map tasks will be distributed to
> handle the data acquisition.
>
>
>
> Can I do anything to ensure that I don’t let the cluster go idle processing
> slow HTTP downloads when the boxes could simultaneously be doing HTTP
> downloads for one job and reading large files off HDFS for another job?
>
>
>
> I’m imagining a scenario where the only map tasks that are running are all
> blocking on splits requiring HTTP downloads and the splits coming from HDFS
> are all queuing up behind it, when they’d run more efficiently in parallel
> per node.
>
>
>
>
>
>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB