Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?


Copy link to this message
-
Re: Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
On 06/10/11 03:00, Jagane Sundar wrote:
> Thanks for your input, Milind. It's very useful and interesting.
>
> In the interest of brevity, I have truncated most of it except for the point
> regarding 'cloud friendly'. I have done some research into this, and want to
> get some more community feedback.
>
>> 2. Built in support for the cloud. (Whirr is interesting. Ambari more so,
>>> but both fall short.)
>>
>> Not very sure. If by "support for the cloud" means ability to provision
>> atop a hypervisor, adding or removing instances etc, I think there are
>> other approaches proven in the industry.
>>
I've just started the wiki page on this topic:
http://wiki.apache.org/hadoop/Virtual%20Hadoop
> There are two aspects to cloud friendliness - deployment
> technologies/automation, and storage.

-agility to handle the failure modes of cloud infrastructure
-security in a shared infrastructure
-flexibility based on demand

> As far as deployment automation is concerned, I am eager to know what other
> approaches you are familiar with. Chef/Puppet et. al. are not interesting to
> me. I want this to have end user self-serve service characteristics, not
> 'end users file ticket, sysadmin runs [chef|puppet|other] script'.

done this with a web UI: ask for the #of machines, bring up NN/JT/single
DN master node, once that is up bring up the workers with a config that
includes the hostname of the master node.

Also deployed was a web UI for long haul job submission
http://www.slideshare.net/steve_l/long-haul-hadoop

That was originally all deployed with the SmartFrog framework and a
modified version of the Hadoop codebase for tighter integration; these
days I just untar and reconfigure a 0.20.20x .tar.gz file on the target
machines, patching in the late binding information

>
> Storage is very interesting. My own thoughts, from analyzing EC2 and EMR are
> as follows. (A lot of the following is speculation and educated guesswork,
> so I may be totally off, but here it is anyway):
>
> Amazon's philosophy is totally 'on-demand bring up when needed and tear down
> when done'. I like this philosophy a lot. However it does not work well for
> storage. Storage needs to be always up and available. Hence, they took
> Hadoop, stripped off HDFS and built a shim to S3, their object storage
> service. There is no posix there. Map Reduce jobs run in VMs that are
> brought up on demand, and access the S3 hosted files using the protocol s3n
> (n stands for native - that's native to s3 not native to Hadoop). When this
> turned out to be slow as sh**, they seem to have hacked the HDFS layer some
> more, in order to actually have a NameNode for metadata, but to use S3 for
> storing blocks. They have a protocol s3 to access this. Both of these
> approaches have one severe failing - they do not support Append and Hflush.
> ergo - no HBase on EMR. I am sure they are working furiously to address this
> shortcoming and add append/hflush support to s3n or s3, in order to make it
> possible to run HBase on EMR. In the meantime, anecdotal evidence suggests
> that at least half of Amazon's customers are opting to use Apache Hadoop on
> EC2 VMs with EBS storage (completely bypassing the EMR offering).

More expensive, but more flexible in terms of what you can run

 > EBS itself
> is an interesting storage technology. It is block storage offered over the
> ethernet network, from an occassionally sync'd local disk elsewhere. EBS has
> some storage resiliency built in, so the question of how many replicas when
> HDFS is built on top of this is very interesting.
>
>
> This problem of offering a cost effective Hadoop as an on-demand self
> service offering in the cloud is very interesting. This is a nut I want to
> crack....
>

Summary: I'm not sure that HDFS is the right FS in this world, as it
contains a lot of assumptions about system stability and HDD persistence
that aren't valid any more. With the ability to plug in new placers you
could do tricks like ensure 1 replica lives in a persistent blockstore
(and rely on it always being there), and add other replicas in transient
storage if the data is about to be needed in jobs.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB