Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> running map tasks in remote node


Copy link to this message
-
Re: running map tasks in remote node
In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.

On Sun, Aug 25, 2013 at 7:59 PM, rab ra <[EMAIL PROTECTED]> wrote:
> Dear Yong,
>
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
>
> In my usecase, I am not allowed to use any shared storage system. The reason
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I do
> agree that we still need this data on the slave node while doing processing
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave nodes
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing various
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage as
> I explained in my last email. I was thinking that having the input file in
> the master node (job tracker) is sufficient and it will send portion of the
> input file to the map process in the second node (slave). But this was not
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one.  All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual processing
> I shall end up transferring the required data files to the slave nodes.
>
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
>
>
> with thanks and regards
> rabmdu
>
>
>
>
>
>
>
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <[EMAIL PROTECTED]>
> wrote:
>>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource in
>> your cluster, instead of the IO.
>>
>> You may want to do some research about what is the HDFS's role in Hadoop.
>> First but not least, it provides a central storage for all the files will be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluster.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>>
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>>
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
>> by this way:
>>
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS location.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine for
>> each  task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure

Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB