Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.


Copy link to this message
-
Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
Thank you very much for answering my question. Is there any publicly
available Hadoop-MR-Yarn UML diagrams (class, activity etc), or some more
in-depth documentation, except the one on official site. I am interested in
implementation details/documentation of MR AM and MR containers (old
TaskTracker)?

regards
blah

2013/2/1 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]>

> You got that mostly right. And it doesn't differ much in Hadoop 1.*
> either. With MR AM doing the work that was earlier done in JobTracker., the
> JobClient and the task side doesn't change much.
>
> FileInputFormat.getsplits() is called by client itself, so you should look
> for logs on the client machine.
>
> Each filesystem overrides getFileBlockLocations() and provides the correct
> locations - like DFS internally uses the getBlockLocations() API on
> Namenode. What you are seeing is the default implementation for local FS.
>
> HTH,
> +Vinod
>
>
>
> On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[EMAIL PROTECTED]> wrote:
>
>> Hi
>>
>> (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)
>>
>> I have a question regarding my assumptions on the Yarn-MR design,
>> specially the InputSplit processing. Can someone confirm or point out my
>> mistakes in my MR-Yarn design assumptions?
>>
>> These are my assumptions regarding design.
>> 1. JobClient submits Job
>> Create AppMaster etc.
>> 2. Get number of splits // MR-AM, specially their hosts, so that a Task
>> can be started on the same node, use *InputFormat.getSplts() { ...;
>> FileSystem.getFileBlockLocations(); ...;}
>> 3. Start N tasks // MR-AM
>> 4. Each Task processes its (single) split (unless splitsNr >> tasksNr)
>> with the use of InputFormat/RecordReader // MR-Task, from HERE InputFormat
>> operates only on a single Split
>> 5. Start RecordReader and process Split // MR-Task
>> 5. MAP() // MR-Task
>> 6. Do rest MR // MR-Task
>> 7. Dump to HDFS/or other storage. // MR-Task
>> 8. Report FINISH, free resources // MR-AM
>>
>> 2 quick bonus questions
>>
>> I have added additional log entry in the FileInputFormat.getSplits(),
>> however I can not see it in log files. I am using WordCount example and
>> INFO level. What might be the problem?
>> In the FileSystem.getFileBlockLocations() the hostname is hard-coded as
>> "localhost", where this is mapped to the actual host name, so that AM will
>> know which nodes to request?
>>
>> Thanks for reply
>>
>
>
>
> --
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>