Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Problems with MR Job running really slowly


Copy link to this message
-
Re: Problems with MR Job running really slowly
1)  I am varying both the number of mappers and reducers trying to
determine three things
     a) What are the options I need reducers and mappers to
              -  Not have mappers or reducers killed with GC overhead limit
exceeded
- Minimize execution time for the cluster
I use a custom Splitter and can adjust the block size for anywhere from 1
Mapper to hundreds of mappers -
For an 8 node cluster I am trying 8,16 and 24 reducers

2) I have been playing with
io,sort,factor - using 100 for now
io.sort.mb is 400 - use mich higher values and the job will not run

3) I set child,vm.opts to Xmx3000m ( getting simulat results to using 1300

4) My mappreads a single file about 1 GB in size. Each item the splitter
delivers (about 1KB) generates tens of thousands of Key Value pair
(<100bytes per value) I can do all the work of generating the output on one
machine (but not the shuffle and sort) in about an hour on one box but my
job is running
for a many hours without completing. I also got a lot of
after seeing in other tasks

Lost task tracker:
tracker_glados4.systemsbiology.net:localhost.localdomain/127.0.0.1:32790

Caused by: java.lang.NullPointerException
at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionOutputStream.write(BZip2Codec.java:200)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:41)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:263)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:243)
at org.apache.hadoop.mapred.IFile$Writer.close(IFile.java:126)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1242)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:648)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1135)
The same job runs well with a smaller data set.  Most of the reason for
moving to hadoop is to allow solutions to scale and I am very concerned at
how
badly my larger cases are doing. The documentation does not say about about
tuning parameters for my larger jobs without running into swap hell or
GC overhead limit exceeeded.

On Sun, Nov 6, 2011 at 7:11 AM, Florin P <[EMAIL PROTECTED]> wrote:

> Hello!
>
>   How many reducers you are using?
>   Regarding the performance parameters, fist you can increase the size of
> the io.sort.mb parameter.
>  It seems that you are sending a lot of amount of data to the reducer. By
> increasing the value of this parameter, in the shuffle phase, the framework
> will not be forced to write/spill data on the HDD that could be a reason
> for slowing the process.
>  If you are using one reducer, then the whole data is sent over HTTP to
> that reducer. Another  thing that you have to think about it.
>  Just for a curiosity, try increase also the dfs.block.size to 128 MB. It
> seems that you are using the default 64 MB. You'll get less mapper tasks.
>  Also, depending what configuration you have on the machine how many cores
> do you have on CPU, you can increase the values for
> mapred.tasktracker.{map|reduce}.tasks.maximum    The maximum number of
> Map/Reduce tasks, which are run simultaneously on a given TaskTracker,
> individually.      Defaults to 2 (2 maps and 2 reduces), but vary it
> depending on your hardware
>  You can have a look at
> http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html.
>  A good book for understanding tuning parameters is Hadoop Definitive
> Guide by Tom White.
>
>  Hope that the above helps.
>  Regards,
>  Florin
>
>
>
>
> --- On Thu, 11/3/11, Steve Lewis <[EMAIL PROTECTED]> wrote:
>
> From: Steve Lewis <[EMAIL PROTECTED]>
> Subject: Problems with MR Job running really slowly
> To: "mapreduce-user" <[EMAIL PROTECTED]>
> Date: Thursday, November 3, 2011, 11:07 PM
>
> I have a job which takes an xml file - the splitter breaks the file into
> tags, the mapper parses each tag and sends the data to the reducer. I am
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB