Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.


Copy link to this message
-
Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
You got that mostly right. And it doesn't differ much in Hadoop 1.* either.
With MR AM doing the work that was earlier done in JobTracker., the
JobClient and the task side doesn't change much.

FileInputFormat.getsplits() is called by client itself, so you should look
for logs on the client machine.

Each filesystem overrides getFileBlockLocations() and provides the correct
locations - like DFS internally uses the getBlockLocations() API on
Namenode. What you are seeing is the default implementation for local FS.

HTH,
+Vinod

On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[EMAIL PROTECTED]> wrote:

> Hi
>
> (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)
>
> I have a question regarding my assumptions on the Yarn-MR design,
> specially the InputSplit processing. Can someone confirm or point out my
> mistakes in my MR-Yarn design assumptions?
>
> These are my assumptions regarding design.
> 1. JobClient submits Job
> Create AppMaster etc.
> 2. Get number of splits // MR-AM, specially their hosts, so that a Task
> can be started on the same node, use *InputFormat.getSplts() { ...;
> FileSystem.getFileBlockLocations(); ...;}
> 3. Start N tasks // MR-AM
> 4. Each Task processes its (single) split (unless splitsNr >> tasksNr)
> with the use of InputFormat/RecordReader // MR-Task, from HERE InputFormat
> operates only on a single Split
> 5. Start RecordReader and process Split // MR-Task
> 5. MAP() // MR-Task
> 6. Do rest MR // MR-Task
> 7. Dump to HDFS/or other storage. // MR-Task
> 8. Report FINISH, free resources // MR-AM
>
> 2 quick bonus questions
>
> I have added additional log entry in the FileInputFormat.getSplits(),
> however I can not see it in log files. I am using WordCount example and
> INFO level. What might be the problem?
> In the FileSystem.getFileBlockLocations() the hostname is hard-coded as
> "localhost", where this is mapped to the actual host name, so that AM will
> know which nodes to request?
>
> Thanks for reply
>

--
+Vinod
Hortonworks Inc.
http://hortonworks.com/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB