Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.


+
blah blah 2013-02-04, 09:35
Copy link to this message
-
Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.
Regards, blah.
You can use these links:
MAPREDUCE-279: https://issues.apache.org/jira/browse/MAPREDUCE-279
MapReduce Next Gen:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen
and you can use the Cloudera's blogs posts about YARN:
http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/
http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/

There is a great document written by Arun, Owen and more people about
the architecture of YARN but I don't have it here right now.
Best wishes
On 02/04/2013 04:35 AM, blah blah wrote:
> Thank you very much for answering my question. Is there any publicly
> available Hadoop-MR-Yarn UML diagrams (class, activity etc), or some
> more in-depth documentation, except the one on official site. I am
> interested in implementation details/documentation of MR AM and MR
> containers (old TaskTracker)?
>
> regards
> blah
>
> 2013/2/1 Vinod Kumar Vavilapalli <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>>
>
>     You got that mostly right. And it doesn't differ much in Hadoop
>     1.* either. With MR AM doing the work that was earlier done in
>     JobTracker., the JobClient and the task side doesn't change much.
>
>     FileInputFormat.getsplits() is called by client itself, so you
>     should look for logs on the client machine.
>
>     Each filesystem overrides getFileBlockLocations() and provides the
>     correct locations - like DFS internally uses the
>     getBlockLocations() API on Namenode. What you are seeing is the
>     default implementation for local FS.
>
>     HTH,
>     +Vinod
>
>
>
>     On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>
>         Hi
>
>         (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)
>
>         I have a question regarding my assumptions on the Yarn-MR
>         design, specially the InputSplit processing. Can someone
>         confirm or point out my mistakes in my MR-Yarn design assumptions?
>
>         These are my assumptions regarding design.
>         1. JobClient submits Job
>         Create AppMaster etc.
>         2. Get number of splits // MR-AM, specially their hosts, so
>         that a Task can be started on the same node, use
>         *InputFormat.getSplts() { ...;
>         FileSystem.getFileBlockLocations(); ...;}
>         3. Start N tasks // MR-AM
>         4. Each Task processes its (single) split (unless splitsNr >>
>         tasksNr) with the use of InputFormat/RecordReader // MR-Task,
>         from HERE InputFormat operates only on a single Split
>         5. Start RecordReader and process Split // MR-Task
>         5. MAP() // MR-Task
>         6. Do rest MR // MR-Task
>         7. Dump to HDFS/or other storage. // MR-Task
>         8. Report FINISH, free resources // MR-AM
>
>         2 quick bonus questions
>
>         I have added additional log entry in the
>         FileInputFormat.getSplits(), however I can not see it in log
>         files. I am using WordCount example and INFO level. What might
>         be the problem?
>         In the FileSystem.getFileBlockLocations() the hostname is
>         hard-coded as "localhost", where this is mapped to the actual
>         host name, so that AM will know which nodes to request?
>
>         Thanks for reply
>
>
>
>
>     --
>     +Vinod
>     Hortonworks Inc.
>     http://hortonworks.com/
>
>

--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB