Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.


Copy link to this message
-
Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
blah blah 2013-02-01, 14:24
Hi

(I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)

I have a question regarding my assumptions on the Yarn-MR design, specially
the InputSplit processing. Can someone confirm or point out my mistakes in
my MR-Yarn design assumptions?

These are my assumptions regarding design.
1. JobClient submits Job
Create AppMaster etc.
2. Get number of splits // MR-AM, specially their hosts, so that a Task can
be started on the same node, use *InputFormat.getSplts() { ...;
FileSystem.getFileBlockLocations(); ...;}
3. Start N tasks // MR-AM
4. Each Task processes its (single) split (unless splitsNr >> tasksNr) with
the use of InputFormat/RecordReader // MR-Task, from HERE InputFormat
operates only on a single Split
5. Start RecordReader and process Split // MR-Task
5. MAP() // MR-Task
6. Do rest MR // MR-Task
7. Dump to HDFS/or other storage. // MR-Task
8. Report FINISH, free resources // MR-AM

2 quick bonus questions

I have added additional log entry in the FileInputFormat.getSplits(),
however I can not see it in log files. I am using WordCount example and
INFO level. What might be the problem?
In the FileSystem.getFileBlockLocations() the hostname is hard-coded as
"localhost", where this is mapped to the actual host name, so that AM will
know which nodes to request?

Thanks for reply
+
Vinod Kumar Vavilapalli 2013-02-01, 20:58