-Re: running map tasks in remote node
Harsh J 2013-08-25, 15:06
In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.
On Sun, Aug 25, 2013 at 7:59 PM, rab ra <[EMAIL PROTECTED]> wrote:
> Dear Yong,
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
> In my usecase, I am not allowed to use any shared storage system. The reason
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I do
> agree that we still need this data on the slave node while doing processing
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave nodes
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing various
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage as
> I explained in my last email. I was thinking that having the input file in
> the master node (job tracker) is sufficient and it will send portion of the
> input file to the map process in the second node (slave). But this was not
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one. All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual processing
> I shall end up transferring the required data files to the slave nodes.
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
> with thanks and regards
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <[EMAIL PROTECTED]>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource in
>> your cluster, instead of the IO.
>> You may want to do some research about what is the HDFS's role in Hadoop.
>> First but not least, it provides a central storage for all the files will be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluster.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
>> by this way:
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS location.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine for
>> each task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure