Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # general - Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?


Copy link to this message
-
Re: Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
Steve Loughran 2011-10-07, 09:17
On 06/10/2011 17:49, [EMAIL PROTECTED] wrote:
> Steve,
>
>> Summary: I'm not sure that HDFS is the right FS in this world, as it
>> contains a lot of assumptions about system stability and HDD persistence
>> that aren't valid any more. With the ability to plug in new placers you
>> could do tricks like ensure 1 replica lives in a persistent blockstore
>> (and rely on it always being there), and add other replicas in transient
>> storage if the data is about to be needed in jobs.
>
> Can you please shed more light on the statement "... as it
> contains a lot of assumptions about system stability and HDD persistence
> that aren't valid any more..." ?
>
> I know that you were doing some analysis of disk failure modes sometime
> ago. Is this the result of that research ? I am very interested.

no, it's unrelated -experience in hosting virtual hadoop
infrastructures. Which is how my short-lived clusters exist today

-you don't know the hostname of the master nodes until allocated, so you
need to allocate them and dynamically push out configs to the workers

-the Datanodes spin when the namenode goes down, forever, rather than
checking somewhere to see if its changed. HDFS HA may fix that.

-It's dangerously easy to have >1 DN on the same physical host, losing
independence of that replica.

-It's possible for the entire cluster to go down without warning.

MR-layer issues

-again, the TaskTrackers spin when the JT goes down, rather than look to
see if its moved.

-Blacklisting isn't the right way to deal with task tracker failures:
termination of VM is.

-if the TT's are idle, VM termination may be the best action

Hadoop is optimised for large physical clusters. If you look at the
Stratosphere work at TuBerlin, they've designed something that includes
VM allocation in the execution plan.

you can improve Hadoop to make it more agile; my defunct Hadoop
lifecycle branch did a lot of that, but you have to have everyone else
using Hadoop to be willing to let the changes go in -and those changes
mustn't impose a cost or risk to the physical cluster model.