Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # general - Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?


Copy link to this message
-
Re: Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
Steve Loughran 2011-10-06, 09:48
On 06/10/11 03:00, Jagane Sundar wrote:
> Thanks for your input, Milind. It's very useful and interesting.
>
> In the interest of brevity, I have truncated most of it except for the point
> regarding 'cloud friendly'. I have done some research into this, and want to
> get some more community feedback.
>
>> 2. Built in support for the cloud. (Whirr is interesting. Ambari more so,
>>> but both fall short.)
>>
>> Not very sure. If by "support for the cloud" means ability to provision
>> atop a hypervisor, adding or removing instances etc, I think there are
>> other approaches proven in the industry.
>>
I've just started the wiki page on this topic:
http://wiki.apache.org/hadoop/Virtual%20Hadoop
> There are two aspects to cloud friendliness - deployment
> technologies/automation, and storage.

-agility to handle the failure modes of cloud infrastructure
-security in a shared infrastructure
-flexibility based on demand

> As far as deployment automation is concerned, I am eager to know what other
> approaches you are familiar with. Chef/Puppet et. al. are not interesting to
> me. I want this to have end user self-serve service characteristics, not
> 'end users file ticket, sysadmin runs [chef|puppet|other] script'.

done this with a web UI: ask for the #of machines, bring up NN/JT/single
DN master node, once that is up bring up the workers with a config that
includes the hostname of the master node.

Also deployed was a web UI for long haul job submission
http://www.slideshare.net/steve_l/long-haul-hadoop

That was originally all deployed with the SmartFrog framework and a
modified version of the Hadoop codebase for tighter integration; these
days I just untar and reconfigure a 0.20.20x .tar.gz file on the target
machines, patching in the late binding information

>
> Storage is very interesting. My own thoughts, from analyzing EC2 and EMR are
> as follows. (A lot of the following is speculation and educated guesswork,
> so I may be totally off, but here it is anyway):
>
> Amazon's philosophy is totally 'on-demand bring up when needed and tear down
> when done'. I like this philosophy a lot. However it does not work well for
> storage. Storage needs to be always up and available. Hence, they took
> Hadoop, stripped off HDFS and built a shim to S3, their object storage
> service. There is no posix there. Map Reduce jobs run in VMs that are
> brought up on demand, and access the S3 hosted files using the protocol s3n
> (n stands for native - that's native to s3 not native to Hadoop). When this
> turned out to be slow as sh**, they seem to have hacked the HDFS layer some
> more, in order to actually have a NameNode for metadata, but to use S3 for
> storing blocks. They have a protocol s3 to access this. Both of these
> approaches have one severe failing - they do not support Append and Hflush.
> ergo - no HBase on EMR. I am sure they are working furiously to address this
> shortcoming and add append/hflush support to s3n or s3, in order to make it
> possible to run HBase on EMR. In the meantime, anecdotal evidence suggests
> that at least half of Amazon's customers are opting to use Apache Hadoop on
> EC2 VMs with EBS storage (completely bypassing the EMR offering).

More expensive, but more flexible in terms of what you can run

 > EBS itself
> is an interesting storage technology. It is block storage offered over the
> ethernet network, from an occassionally sync'd local disk elsewhere. EBS has
> some storage resiliency built in, so the question of how many replicas when
> HDFS is built on top of this is very interesting.
>
>
> This problem of offering a cost effective Hadoop as an on-demand self
> service offering in the cloud is very interesting. This is a nut I want to
> crack....
>

Summary: I'm not sure that HDFS is the right FS in this world, as it
contains a lot of assumptions about system stability and HDD persistence
that aren't valid any more. With the ability to plug in new placers you
could do tricks like ensure 1 replica lives in a persistent blockstore
(and rely on it always being there), and add other replicas in transient
storage if the data is about to be needed in jobs.