Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.


+
blah blah 2013-02-04, 09:35
Copy link to this message
-
Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
Marcos Ortiz 2013-02-04, 13:37
Regards, blah.
You can use these links:
MAPREDUCE-279: https://issues.apache.org/jira/browse/MAPREDUCE-279
MapReduce Next Gen:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen
and you can use the Cloudera's blogs posts about YARN:
http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/
http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/

There is a great document written by Arun, Owen and more people about
the architecture of YARN but I don't have it here right now.
Best wishes
On 02/04/2013 04:35 AM, blah blah wrote:
> Thank you very much for answering my question. Is there any publicly
> available Hadoop-MR-Yarn UML diagrams (class, activity etc), or some
> more in-depth documentation, except the one on official site. I am
> interested in implementation details/documentation of MR AM and MR
> containers (old TaskTracker)?
>
> regards
> blah
>
> 2013/2/1 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>>
>
>     You got that mostly right. And it doesn't differ much in Hadoop
>     1.* either. With MR AM doing the work that was earlier done in
>     JobTracker., the JobClient and the task side doesn't change much.
>
>     FileInputFormat.getsplits() is called by client itself, so you
>     should look for logs on the client machine.
>
>     Each filesystem overrides getFileBlockLocations() and provides the
>     correct locations - like DFS internally uses the
>     getBlockLocations() API on Namenode. What you are seeing is the
>     default implementation for local FS.
>
>     HTH,
>     +Vinod
>
>
>
>     On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>
>         Hi
>
>         (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)
>
>         I have a question regarding my assumptions on the Yarn-MR
>         design, specially the InputSplit processing. Can someone
>         confirm or point out my mistakes in my MR-Yarn design assumptions?
>
>         These are my assumptions regarding design.
>         1. JobClient submits Job
>         Create AppMaster etc.
>         2. Get number of splits // MR-AM, specially their hosts, so
>         that a Task can be started on the same node, use
>         *InputFormat.getSplts() { ...;
>         FileSystem.getFileBlockLocations(); ...;}
>         3. Start N tasks // MR-AM
>         4. Each Task processes its (single) split (unless splitsNr >>
>         tasksNr) with the use of InputFormat/RecordReader // MR-Task,
>         from HERE InputFormat operates only on a single Split
>         5. Start RecordReader and process Split // MR-Task
>         5. MAP() // MR-Task
>         6. Do rest MR // MR-Task
>         7. Dump to HDFS/or other storage. // MR-Task
>         8. Report FINISH, free resources // MR-AM
>
>         2 quick bonus questions
>
>         I have added additional log entry in the
>         FileInputFormat.getSplits(), however I can not see it in log
>         files. I am using WordCount example and INFO level. What might
>         be the problem?
>         In the FileSystem.getFileBlockLocations() the hostname is
>         hard-coded as "localhost", where this is mapped to the actual
>         host name, so that AM will know which nodes to request?
>
>         Thanks for reply
>
>
>
>
>     --
>     +Vinod
>     Hortonworks Inc.
>     http://hortonworks.com/
>
>

--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>