Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Is perfect control over mapper num AND split distribution possible?


Copy link to this message
-
Is perfect control over mapper num AND split distribution possible?
I am running a job that takes no input from the mapper-input key/value interface.  Each job reads the same small file from the distributed cache and processes it independently (to generate Monte Carlo sampling of the problem space).  I am using MR purely to parallelize the otherwise redundant and separated sampling process.  To maximize parallelism, I want to set the number of mappers explicitly, such that 10 samples run in exact 1X time by perfectly distributing over 10 mappers.  I am accomplishing this by generating a dummy MR input file of nonvalue data.  Each row is identical so I know the exact row length of all rows.  I then simply set the split size to the row length with the intention that Hadoop perfectly assign the intended number of mappers.  This approach mostly works.  However, I get a few extraneous empty mappers.  Since they get no input, they do no work and exit almost immediately, so they aren't a serious drain on cluster resources, but I'm confused why I get extra mappers in the first place.

My working theory was that the end-lines of the input file must be accounted for when calculating split sizes (so my splits were too small and I got a few extra splits hanging off the end of the input file).  I attempted to fix this by adding one to the calculated split size (one greater than the actual row length now).  This works perfectly, generating exactly the intended number of mappers, exactly the same number as there are rows in the input file.  However, the labor distribution is not perfect.  Almost every single run produces one mapper which receives no input (and ends immediately) and another mapper which receives two inputs, thus triggering two "processing sessions" on that particular mapper such that it takes twice as long to complete as the other mappers.  Obviously, this wrecks the potential parallelism by literally doubling the overall job time.

Which split size is correct: row length without end-line or row length with end-line?  The former yields extra empty mappers while the latter yields exactly the right number.  However, if the latter is correct, why is the task distribution uneven (albeit NEARLY even) and what (if anything) can be done about it?

Thanks.

________________________________________________________________________________
Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
                                           --  Mark Twain
________________________________________________________________________________
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB