Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Problems with MR Job running really slowly

Copy link to this message
Problems with MR Job running really slowly
I have a job which takes an xml file - the splitter breaks the file into
tags, the mapper parses each tag and sends the data to the
reducer. I am using a custom splitter which reads the file looking for
start and end tags.

When I run the code in the splitter and the mapper - generating separate
tags and parsing them
I can read a file sized at about  500MB containing 12000 tags on my local
system in 23 seconds

When I read a file on HDFS on a local cluster I can read and parse the file
in 38 seconds

When I run the same code on a eight node cluster I get 7 map tasks. The
mappers are taking 190 seconds to handle 100 tags of
which 200 millisec is parsing and almost all of the rest of the time is
in context.write. A mapper handling 1600 tags takes about 3 hours -
These are the statistics for a map task - it it true that one tag well be
sent to about 300 keys but still 3 hours to write 1,5 million records and
5Gb seems
way excessive

FILE_BYTES_READ 816,935,457
HDFS_BYTES_READ 439,554,860
FILE_BYTES_WRITTEN 1,667,745,197
TotalScoredScans 1,660
*Map-Reduce Framework*
Combine output records0Map input records 6,134
Spilled Records 1,690,063
Map output bytes 5,517,423,780
Combine input records 0
Map output records 571,475

Anyone want to offer suggestions on how to tune the job better

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com