Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.


Copy link to this message
-
Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
Hi

(I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)

I have a question regarding my assumptions on the Yarn-MR design, specially
the InputSplit processing. Can someone confirm or point out my mistakes in
my MR-Yarn design assumptions?

These are my assumptions regarding design.
1. JobClient submits Job
Create AppMaster etc.
2. Get number of splits // MR-AM, specially their hosts, so that a Task can
be started on the same node, use *InputFormat.getSplts() { ...;
FileSystem.getFileBlockLocations(); ...;}
3. Start N tasks // MR-AM
4. Each Task processes its (single) split (unless splitsNr >> tasksNr) with
the use of InputFormat/RecordReader // MR-Task, from HERE InputFormat
operates only on a single Split
5. Start RecordReader and process Split // MR-Task
5. MAP() // MR-Task
6. Do rest MR // MR-Task
7. Dump to HDFS/or other storage. // MR-Task
8. Report FINISH, free resources // MR-AM

2 quick bonus questions

I have added additional log entry in the FileInputFormat.getSplits(),
however I can not see it in log files. I am using WordCount example and
INFO level. What might be the problem?
In the FileSystem.getFileBlockLocations() the hostname is hard-coded as
"localhost", where this is mapped to the actual host name, so that AM will
know which nodes to request?

Thanks for reply
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB