Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Multiple various streaming questions

Copy link to this message
Re: Multiple various streaming questions
On Feb 4, 2011, at 07:46 , Keith Wiley wrote:

> On Feb 3, 2011, at 6:41 PM, Allen Wittenauer wrote:
>> If all the task slots are in use, why would you care if they are queueing up?  Also keep in mind that if a node fails, that work will need to get re-done anyway.
> Because all slots are not in use.  It's a very larger cluster and it's excruciating that Hadoop partially serializes a job by piling multiple map tasks onto a single map in a queue even when the cluster is massively underutilized.  This occurs when the input records are significantly smaller than the block size (6MB vs 64MB in my case, give me about a 32x serialization cost!!!).  To put it differently, if I let Hadoop do it its own stupid way, the job takes 32 times longer than it should take if it evenly distributed the map tasks across the nodes.  Packing the input files into larger sequence fils does not help with this problem.  The input splits are calculated from the individual files and thus, I still get this undesirable packing effect.
Having reread my last paragraph, I am now reconsidering its tone.  I apologize.  I am entirely open to the possibility that there are smarter ways to achieve my desired goal of minimum job-turnaround time (maximum parallelism), perhaps via various configuration parameters which I have not learned how to use properly...and furthermore I am willing to admit that the seemingly frustrating and seemingly illogical partial serialism that I witnessed in my jobs using Hadoop's default configuration was not necessarily Hadoop's fault but rather originated from some ineptitude on my part w.r.t. configuring, programming, and using Hadoop properly.

In other words, I am perfectly willing to admit I might just not be using Hadoop correctly and that this problem is therefore basically my fault.


Keith Wiley               [EMAIL PROTECTED]               www.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
  -- Keith Wiley