-Re: Need help on accessing datanodes local filesystem using hadoop map reduce framework
On Sat, Oct 23, 2010 at 1:44 AM, Burhan Uddin <[EMAIL PROTECTED]> wrote:
> I am a beginner with hadoop framework. I am trying create a distributed
> crawling application. I have googled a lot. but the resources are too low.
> Can anyone please help me on the following topics.
I suppose you know already (since you mention Lucene), but Nutch
(nutch.apache.org) is a crawler built on Hadoop.
> 1. I want to access the local file system of datanode. Suppose i have
> crawled to site a and b. is it somehow possible using hadoop api to control
> which datanode will be used to store it. like i want to store site a on
> datanode 1 and site b on datanode 2 or just the way i wish. is it some how
No. HDFS decides where to place data based on factors such as
balancing the storage load in the cluster, run time health of the
datanodes, etc. IMHO, this is a good thing. If you wanted to control
data placement yourselves, you'd have to worry about all this
difficult cases too. To turn the question around slightly though, why
do you need to control where the data should be placed ?
> 2. when i will create map reduce for lucene indexing, if map process on
> datanode 1 requires data from data node 2 will all of it come through master
> node?? since i need to access them with hdfs://master:port. does it mean it
> will exchange all data through master node?
Nope. Only metadata such as which nodes store which blocks is accessed
via the master. The actual data I/O happens between the client and the
> 3. how can i make it sure that a map process (like lucene indexing on
> crawled data) is running right on the data node that contains the data.
> (may be i could not explain it well. its like i really dont want to datanode
> 2 ( storing site b) is indexing site a (which is stored on datanode 1), so
> that it consumes up a lot of network traffic)
This is referred to as data locality, and Hadoop handles this in the
framework. When a Map task is scheduled, Hadoop's schedulers try and
schedule tasks on nodes where the data they need is already available.
> Please anyone reply me as early as possible.
I'd recommend you go over the documentation on HDFS and Map/Reduce
available off hadoop.apache.org - which cover these concepts in more
detail, as do several other sources like Tom White's book on Hadoop -