Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How MapReduce selects data blocks for processing user request


Copy link to this message
-
Re: How MapReduce selects data blocks for processing user request
Hi Mehal,

> I am confused over how MapReduce tasks select data blocks for processing user requests ?

I suggest reading chapter 6 of Tom White's Hadoop: The Definitive
Guide, titled "How MapReduce Works". It explains almost everything you
need to know in very clear language, and should help you generally if
you get this or other such good books.

> As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ?

The first point to clear up is that MapReduce is not hard-tied to
HDFS. It generates splits on any FS and the splits are unique, based
on your given input path. Each split therefore relates to one task and
the task's input goal is hence defined at submit-time itself. Each
split is further defined by its path, start offset into the file and
length after offset to be processed - "uniquely" defining itself.

> How does it guarantees that no same block gets chosen twice or thrice for different mapper task.

See above - each "block" (or a "split" in MR terms), is defined by its
start-offset and length. No two splits generated for a single file
would be the same, as we generate it that way - and hence there won't
be such a case you're worried about.

On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel <[EMAIL PROTECTED]> wrote:
> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB