|
|
-
Mapper runs only on one machine
praveen.peddi@... 2010-11-16, 17:24
Hi all, I have been trying to figure out why all mappers run only on one machine when I have 4 node cluster. Ruduce part is running fine on all 4 nodes correctly. I am using 0.20.2. My input file is a large single file (10GB)
Here is my config in mapred-site.xml. I specified map.tasks as 30 but I only se one map task and that too only on one machine. Are there any other parameters I need to set in order to control uniform distribution of map job? <configuration> <property> <name>mapred.job.tracker</name> <value>master-hadoop:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx4096m</value> <description>map heap size for child task</description> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>5</value> <description></description> </property> <property> <name>mapred.map.tasks</name> <value>30</value> <description></description> </property> <property> <name>mapred.reduce.tasks</name> <value>6</value> <description></description> </property> </configuration>
-
Re: Mapper runs only on one machine
Steve Lewis 2010-11-16, 17:33
Are you sure your input file is splittable - many files (say gzip) are not and such files must be processed on a single machine
On Tue, Nov 16, 2010 at 9:24 AM, <[EMAIL PROTECTED]> wrote:
> Hi all, > I have been trying to figure out why all mappers run only on one machine > when I have 4 node cluster. Ruduce part is running fine on all 4 nodes > correctly. I am using 0.20.2. My input file is a large single file (10GB) > > Here is my config in mapred-site.xml. I specified map.tasks as 30 but I > only se one map task and that too only on one machine. Are there any other > parameters I need to set in order to control uniform distribution of map > job? > <configuration> > <property> > <name>mapred.job.tracker</name> > <value>master-hadoop:54311</value> > <description>The host and port that the MapReduce job tracker > runs > at. If "local", then jobs are run in-process as a single map > and reduce task. > </description> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx4096m</value> > <description>map heap size for child task</description> > </property> > <property> > <name>mapred.reduce.parallel.copies</name> > <value>5</value> > <description></description> > </property> > <property> > <name>mapred.map.tasks</name> > <value>30</value> > <description></description> > </property> > <property> > <name>mapred.reduce.tasks</name> > <value>6</value> > <description></description> > </property> > </configuration> > >
-- Steven M. Lewis PhD 4221 105th Ave Ne Kirkland, WA 98033 206-384-1340 (cell) Institute for Systems Biology Seattle WA
-
RE: Mapper runs only on one machine
praveen.peddi@... 2010-11-16, 18:59
Thats a good point. I was indeed using gzip file that has a csv file in it. I uncompressed it and used csv file and now I can see many mappers running concurrently.
Thanks for the suggestion. This is an important piece of information many people will miss since compressed format is a more logical way of passing the data. Not sure if this is documented on Hadoop but I could not find it.
Praveen ________________________________ From: ext Steve Lewis [mailto:[EMAIL PROTECTED]] Sent: Tuesday, November 16, 2010 12:33 PM To: [EMAIL PROTECTED] Subject: Re: Mapper runs only on one machine
Are you sure your input file is splittable - many files (say gzip) are not and such files must be processed on a single machine
On Tue, Nov 16, 2010 at 9:24 AM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi all, I have been trying to figure out why all mappers run only on one machine when I have 4 node cluster. Ruduce part is running fine on all 4 nodes correctly. I am using 0.20.2. My input file is a large single file (10GB)
Here is my config in mapred-site.xml. I specified map.tasks as 30 but I only se one map task and that too only on one machine. Are there any other parameters I need to set in order to control uniform distribution of map job? <configuration> <property> <name>mapred.job.tracker</name> <value>master-hadoop:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx4096m</value> <description>map heap size for child task</description> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>5</value> <description></description> </property> <property> <name>mapred.map.tasks</name> <value>30</value> <description></description> </property> <property> <name>mapred.reduce.tasks</name> <value>6</value> <description></description> </property> </configuration> -- Steven M. Lewis PhD 4221 105th Ave Ne Kirkland, WA 98033 206-384-1340 (cell) Institute for Systems Biology Seattle WA
-
Re: Mapper runs only on one machine
Harsh J 2010-11-16, 19:42
Hi,
On Wed, Nov 17, 2010 at 12:29 AM, <[EMAIL PROTECTED]> wrote: > Thanks for the suggestion. This is an important piece of information many > people will miss since compressed format is a more logical way of passing > the data. Not sure if this is documented on Hadoop but I could not find it.
The problem is with the gzip algorithm itself. gzip cannot decompress starting from a random point in a file (its not block compressed, if you compare it to lzo).
There was some work done for enabling gzip splits to happen too, much like how lzo splitting is done via the indexing, but its not been active for a while now. See MAPREDUCE-491 and HADOOP-6153 for the patches.
-- Harsh J www.harshj.com
|
|