Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.


Copy link to this message
-
Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
Thank you very much for answering my question. Is there any publicly
available Hadoop-MR-Yarn UML diagrams (class, activity etc), or some more
in-depth documentation, except the one on official site. I am interested in
implementation details/documentation of MR AM and MR containers (old
TaskTracker)?

regards
blah

2013/2/1 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>

> You got that mostly right. And it doesn't differ much in Hadoop 1.*
> either. With MR AM doing the work that was earlier done in JobTracker., the
> JobClient and the task side doesn't change much.
>
> FileInputFormat.getsplits() is called by client itself, so you should look
> for logs on the client machine.
>
> Each filesystem overrides getFileBlockLocations() and provides the correct
> locations - like DFS internally uses the getBlockLocations() API on
> Namenode. What you are seeing is the default implementation for local FS.
>
> HTH,
> +Vinod
>
>
>
> On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[EMAIL PROTECTED]> wrote:
>
>> Hi
>>
>> (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)
>>
>> I have a question regarding my assumptions on the Yarn-MR design,
>> specially the InputSplit processing. Can someone confirm or point out my
>> mistakes in my MR-Yarn design assumptions?
>>
>> These are my assumptions regarding design.
>> 1. JobClient submits Job
>> Create AppMaster etc.
>> 2. Get number of splits // MR-AM, specially their hosts, so that a Task
>> can be started on the same node, use *InputFormat.getSplts() { ...;
>> FileSystem.getFileBlockLocations(); ...;}
>> 3. Start N tasks // MR-AM
>> 4. Each Task processes its (single) split (unless splitsNr >> tasksNr)
>> with the use of InputFormat/RecordReader // MR-Task, from HERE InputFormat
>> operates only on a single Split
>> 5. Start RecordReader and process Split // MR-Task
>> 5. MAP() // MR-Task
>> 6. Do rest MR // MR-Task
>> 7. Dump to HDFS/or other storage. // MR-Task
>> 8. Report FINISH, free resources // MR-AM
>>
>> 2 quick bonus questions
>>
>> I have added additional log entry in the FileInputFormat.getSplits(),
>> however I can not see it in log files. I am using WordCount example and
>> INFO level. What might be the problem?
>> In the FileSystem.getFileBlockLocations() the hostname is hard-coded as
>> "localhost", where this is mapped to the actual host name, so that AM will
>> know which nodes to request?
>>
>> Thanks for reply
>>
>
>
>
> --
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
+
Marcos Ortiz 2013-02-04, 13:37
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB