Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: How to add another file system in Hadoop


Copy link to this message
-
Re: How to add another file system in Hadoop
Dear Nikhil and all,
    Your question is a bit complex to answer, and since I am not expert of
Hadoop currently, the following answer may have some errors, any suggestion
is welcome.

1.  You MR command is issued by the client submitting a job to JobTracker
of the Hadoop cluster.
2. The JobTracker will split the input file (Usually according to the
blocksize of the underlying DFS), and then the jobtracker will have a
number of map task and reduce task, usually  each Map Task will eat one
block, and write down some intermediate data.
3. JobTracker will schedule these tasks to different TaskTrackers according
to the block location in the DFS. The block is the one which the map task
will eat. If unfortunately the map task can not assign to the TaskTracker
which have the block stored. The data of the block will be transferred to
the node which the task will run ( This is done in the underlying DFS
object, and this is where *getFileBlockLocations* take effect,  and the MR
framework will not realize it)

4.So, you see, your client will not collect all remote data to local, it
only submit a job, tell the JobTracker: how to split the input file, how to
do map, how to combine the intermediate data,  how to do reduce, where the
input file is in the DFS, and where to output the data in DFS.
Maybe you should search for some blog post, or refer to  the <Hadoop: The
definitive guide> written by Tom White for more authoritative answer.

yours,
Ling Kun
On Fri, Feb 22, 2013 at 1:05 PM, Agarwal, Nikhil
<[EMAIL PROTECTED]>wrote:

>  Hi All,****
>
> ** **
>
> Thanks a lot for taking out your time to answer my question.****
>
> ** **
>
> Ling, thank you for directing me to glusterfs. I can surely take lot of
> help from that but what I wanted to know is that in README.txt it is
> mentioned :****
>
> ** **
>
> >> # ./bin/start-mapred.sh****
>
>   If the map/reduce job/task trackers are up, all I/O will be done to
> GlusterFS.****
>
> ** **
>
> So, suppose my input files are scattered in different nodes(glusterfs
> servers), how do I(hadoop client having glusterfs plugged in) issue a
> Mapreduce command?****
>
> Moreover, after issuing a Mapreduce command would my hadoop client fetch
> all the data from different servers to my local machine and then do a
> Mapreduce or would it start the TaskTracker daemons on the machine(s) where
> the input file(s) are located and perform a Mapreduce there?****
>
> Please rectify me if I am wrong but I suppose that the location of input
> files top Mapreduce is being returned by the function *
> getFileBlockLocations* *(*FileStatus file*,* *long* start*,* *long* len*).
> *****
>
> ** **
>
> Thank you very much for your time and helping me out.****
>
> ** **
>
> Regards,****
>
> Nikhil****
>
> ** **
>
> *From:* Agarwal, Nikhil
> *Sent:* Thursday, February 21, 2013 4:19 PM
> *To:* '[EMAIL PROTECTED]'
> *Subject:* How to add another file system in Hadoop****
>
> ** **
>
> Hi,****
>
> ** **
>
> I am planning to add a file system called CDMI under org.apache.hadoop.fs
> in Hadoop, something similar to KFS or S3 which are already there under
> org.apache.hadoop.fs. I wanted to ask that say, I write my file system for
> CDMI and add the package under fs but then how do I tell the core-site.xml
> or other configuration files to use CDMI file system. Where all do I need
> to make changes to enable CDMI file system become a part of Hadoop ?****
>
> ** **
>
> Thanks a lot in advance.****
>
> ** **
>
> Regards,****
>
> Nikhil ****
>

--
http://www.lingcc.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB