|
|
-
Re: How to add another file system in HadoopLing Kun 2013-02-22, 07:56
Dear Nikhil and all,
Your question is a bit complex to answer, and since I am not expert of Hadoop currently, the following answer may have some errors, any suggestion is welcome. 1. You MR command is issued by the client submitting a job to JobTracker of the Hadoop cluster. 2. The JobTracker will split the input file (Usually according to the blocksize of the underlying DFS), and then the jobtracker will have a number of map task and reduce task, usually each Map Task will eat one block, and write down some intermediate data. 3. JobTracker will schedule these tasks to different TaskTrackers according to the block location in the DFS. The block is the one which the map task will eat. If unfortunately the map task can not assign to the TaskTracker which have the block stored. The data of the block will be transferred to the node which the task will run ( This is done in the underlying DFS object, and this is where *getFileBlockLocations* take effect, and the MR framework will not realize it) 4.So, you see, your client will not collect all remote data to local, it only submit a job, tell the JobTracker: how to split the input file, how to do map, how to combine the intermediate data, how to do reduce, where the input file is in the DFS, and where to output the data in DFS. Maybe you should search for some blog post, or refer to the <Hadoop: The definitive guide> written by Tom White for more authoritative answer. yours, Ling Kun On Fri, Feb 22, 2013 at 1:05 PM, Agarwal, Nikhil <[EMAIL PROTECTED]>wrote: > Hi All,**** > > ** ** > > Thanks a lot for taking out your time to answer my question.**** > > ** ** > > Ling, thank you for directing me to glusterfs. I can surely take lot of > help from that but what I wanted to know is that in README.txt it is > mentioned :**** > > ** ** > > >> # ./bin/start-mapred.sh**** > > If the map/reduce job/task trackers are up, all I/O will be done to > GlusterFS.**** > > ** ** > > So, suppose my input files are scattered in different nodes(glusterfs > servers), how do I(hadoop client having glusterfs plugged in) issue a > Mapreduce command?**** > > Moreover, after issuing a Mapreduce command would my hadoop client fetch > all the data from different servers to my local machine and then do a > Mapreduce or would it start the TaskTracker daemons on the machine(s) where > the input file(s) are located and perform a Mapreduce there?**** > > Please rectify me if I am wrong but I suppose that the location of input > files top Mapreduce is being returned by the function * > getFileBlockLocations* *(*FileStatus file*,* *long* start*,* *long* len*). > ***** > > ** ** > > Thank you very much for your time and helping me out.**** > > ** ** > > Regards,**** > > Nikhil**** > > ** ** > > *From:* Agarwal, Nikhil > *Sent:* Thursday, February 21, 2013 4:19 PM > *To:* '[EMAIL PROTECTED]' > *Subject:* How to add another file system in Hadoop**** > > ** ** > > Hi,**** > > ** ** > > I am planning to add a file system called CDMI under org.apache.hadoop.fs > in Hadoop, something similar to KFS or S3 which are already there under > org.apache.hadoop.fs. I wanted to ask that say, I write my file system for > CDMI and add the package under fs but then how do I tell the core-site.xml > or other configuration files to use CDMI file system. Where all do I need > to make changes to enable CDMI file system become a part of Hadoop ?**** > > ** ** > > Thanks a lot in advance.**** > > ** ** > > Regards,**** > > Nikhil > > -- > http://www.lingcc.com > |