|
|
+
blah blah 2013-02-04, 09:35
-
Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.Marcos Ortiz 2013-02-04, 13:37
Regards, blah.
You can use these links: MAPREDUCE-279: https://issues.apache.org/jira/browse/MAPREDUCE-279 MapReduce Next Gen: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen and you can use the Cloudera's blogs posts about YARN: http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/ http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ There is a great document written by Arun, Owen and more people about the architecture of YARN but I don't have it here right now. Best wishes On 02/04/2013 04:35 AM, blah blah wrote: > Thank you very much for answering my question. Is there any publicly > available Hadoop-MR-Yarn UML diagrams (class, activity etc), or some > more in-depth documentation, except the one on official site. I am > interested in implementation details/documentation of MR AM and MR > containers (old TaskTracker)? > > regards > blah > > 2013/2/1 Vinod Kumar Vavilapalli <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> > > You got that mostly right. And it doesn't differ much in Hadoop > 1.* either. With MR AM doing the work that was earlier done in > JobTracker., the JobClient and the task side doesn't change much. > > FileInputFormat.getsplits() is called by client itself, so you > should look for logs on the client machine. > > Each filesystem overrides getFileBlockLocations() and provides the > correct locations - like DFS internally uses the > getBlockLocations() API on Namenode. What you are seeing is the > default implementation for local FS. > > HTH, > +Vinod > > > > On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Hi > > (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M) > > I have a question regarding my assumptions on the Yarn-MR > design, specially the InputSplit processing. Can someone > confirm or point out my mistakes in my MR-Yarn design assumptions? > > These are my assumptions regarding design. > 1. JobClient submits Job > Create AppMaster etc. > 2. Get number of splits // MR-AM, specially their hosts, so > that a Task can be started on the same node, use > *InputFormat.getSplts() { ...; > FileSystem.getFileBlockLocations(); ...;} > 3. Start N tasks // MR-AM > 4. Each Task processes its (single) split (unless splitsNr >> > tasksNr) with the use of InputFormat/RecordReader // MR-Task, > from HERE InputFormat operates only on a single Split > 5. Start RecordReader and process Split // MR-Task > 5. MAP() // MR-Task > 6. Do rest MR // MR-Task > 7. Dump to HDFS/or other storage. // MR-Task > 8. Report FINISH, free resources // MR-AM > > 2 quick bonus questions > > I have added additional log entry in the > FileInputFormat.getSplits(), however I can not see it in log > files. I am using WordCount example and INFO level. What might > be the problem? > In the FileSystem.getFileBlockLocations() the hostname is > hard-coded as "localhost", where this is mapped to the actual > host name, so that AM will know which nodes to request? > > Thanks for reply > > > > > -- > +Vinod > Hortonworks Inc. > http://hortonworks.com/ > > -- Marcos Ortiz Valmaseda, Product Manager && Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186> |