Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - RE: running map tasks in remote node


Copy link to this message
-
RE: running map tasks in remote node
java8964 java8964 2013-08-22, 11:10
If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.
Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.
Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

Hello,
Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.
 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.
 hdfs-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. -->
<configuration><property>
          <name>dfs.replication</name>          <value>1</value>
</property><property>
  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>
</property><property>
  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>
</property><property>
     <name>mapred.job.tracker</name>    <value>A:9001</value>
</property>
</configuration> mapred-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration><property>
            <name>mapred.job.tracker</name>            <value>A:9001</value>
</property><property>
          <name>mapreduce.tasktracker.map.tasks.maximum</name>           <value>1</value>
</property></configuration>
 core-site.xml
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration>
         <property>                <name>fs.default.name</name>
                <value>hdfs://A:9000</value>        </property>
</configuration>
 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.
 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.
 Can anyone highlight where I am going wrong?
  regardsrab