|
Bradford Stephens
2009-05-05, 02:44
Andrew Purtell
2009-05-05, 07:37
Steve Loughran
2009-05-05, 14:00
Bradford Stephens
2009-05-05, 16:53
Ricky Ho
2009-05-05, 17:46
Edward Capriolo
2009-05-05, 17:58
Steve Loughran
2009-05-06, 10:00
|
-
What do we call Hadoop+HBase+Lucene+Zookeeper+etc....Bradford Stephens 2009-05-05, 02:44
Hey all,
I'm going to be speaking at OSCON about my company's experiences with Hadoop and Friends, but I'm having a hard time coming up with a name for the entire software ecosystem. I'm thinking of calling it the "Apache CloudStack". Does this sound legit to you all? :) Is there something more 'official'? Cheers, Bradford
-
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....Andrew Purtell 2009-05-05, 07:37
Hi Bradford, Your mail reminds me of something I recently came across: http://svn.apache.org/repos/asf/labs/clouds/apache_cloud_computing_edition.pdf Perhaps if you have slides accompanying your talk, you may consider to make them publicly available. I for one would love to see them. Best regards, - Andy > From: Bradford Stephens > Subject: What do we call Hadoop+HBase+Lucene+Zookeeper+etc.... > Date: Monday, May 4, 2009, 7:44 PM > Hey all, > > I'm going to be speaking at OSCON about my company's > experiences with Hadoop and Friends, but I'm having a > hard time coming up with a name for the entire software > ecosystem. I'm thinking of calling it the "Apache > CloudStack". Does this sound legit to you all? :) Is > there something more 'official'? > > Cheers, > Bradford
-
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....Steve Loughran 2009-05-05, 14:00
Bradford Stephens wrote:
> Hey all, > > I'm going to be speaking at OSCON about my company's experiences with > Hadoop and Friends, but I'm having a hard time coming up with a name > for the entire software ecosystem. I'm thinking of calling it the > "Apache CloudStack". Does this sound legit to you all? :) Is there > something more 'official'? We've been using "Apache Cloud Computing Edition" for this, to emphasise this is the successor to Java Enterprise Edition, and that it is cross language and being built at apache. If you use the same term, even if you put a different stack outline than us, it gives the idea more legitimacy. The slides that Andrew linked to are all in SVN under http://svn.apache.org/repos/asf/labs/clouds/ we have a space in the apache labs for "apache clouds", where we want to do more work integrating things, and bringing the idea of deploy and test on someone else's infrastructure mainstream across all the apache products. We would welcome your involvement -and if you send a draft of your slides out, will happily review them -steve
-
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....Bradford Stephens 2009-05-05, 16:53
I read through the deck and sent it around the company. Good stuff!
It's going to be a big help for trying to get the .NET Enterprise people wrapping their heads around web-scale data. I must admit "Apache Cloud Computing Edition" is sort of unwieldy to say verbally, and frankly "Java Enterprise Edition" is a taboo phrase at a lot of projects I've had. Guilt by association. I think I'll call it "Apache Cloud Stack", and reference "Apache Cloud Computing Edition" in my deck. When I think "Stack", I think of a suite of software that provides all the pieces I need to solve my problem :) On Tue, May 5, 2009 at 7:00 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > Bradford Stephens wrote: >> >> Hey all, >> >> I'm going to be speaking at OSCON about my company's experiences with >> Hadoop and Friends, but I'm having a hard time coming up with a name >> for the entire software ecosystem. I'm thinking of calling it the >> "Apache CloudStack". Does this sound legit to you all? :) Is there >> something more 'official'? > > We've been using "Apache Cloud Computing Edition" for this, to emphasise > this is the successor to Java Enterprise Edition, and that it is cross > language and being built at apache. If you use the same term, even if you > put a different stack outline than us, it gives the idea more legitimacy. > > The slides that Andrew linked to are all in SVN under > http://svn.apache.org/repos/asf/labs/clouds/ > > we have a space in the apache labs for "apache clouds", where we want to do > more work integrating things, and bringing the idea of deploy and test on > someone else's infrastructure mainstream across all the apache products. We > would welcome your involvement -and if you send a draft of your slides out, > will happily review them > > -steve >
-
RE: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....Ricky Ho 2009-05-05, 17:46
The slide deck talks about possible bundling of various existing Apache technologies in distributed systems as well as some Java API to access Amazon cloud services.
What hasn't been discussed is the difference between a "traditional distributed architecture" and "the cloud". They are "close" but not close enough to be treated the "same". In my opinion, some of the distributed technology in Apache need to be enhanced in order to fit into the cloud more effectively. Let me focus in some cloud characteristics that our existing Apache distributed technologies hasn't been paying attention to: Extreme elasticity, Trust boundary, and cost awareness. Extreme elasticity ==================Most distributed technologies treat machine shutdown/startup a relatively infrequent operation and hasn't tried hard to minimize the cost of handling this situations. Look at Hadoop as an example, although it can handle machine crashes gracefully, it doesn't handle cloud bursting scenario well (ie: when a lot of machines is added to Hadoop cluster). You need to run a data redistribution task in the background and slow down your existing job. Another example is that many scripts in Hadoop relies on config file that specify each cluster member's IP address. In a cloud environment, IP address is unstable so we need to have a discovery mechanism and also rework the scripts. Trust boundary ==============Most distributed technologies are assuming a homogeneous environment (every member has the same degree of trust), which is not the case in the cloud environment. Additional processing (cryptographic operation for data transfer and storage) may be necessary when dealing with machines running in the cloud. Cost awareness ==============Same reason as they are assuming a homogeneous environment, the scheduler is not aware of the involved cost when they move data across the cloud boundary (especially bandwidth cost is relatively high). The Hadoop MapReduce scheduler need to be more sophisticated when scheduling where to start the Mapper and Reducer. Similarly, when making the replica placement decision, HDFS needs to be aware of which machine is located in which cloud. That said, I am not discounting the existing Apache technology. In fact, we have already made a good step. We just need to go further. Rgds, Ricky -----Original Message----- From: Bradford Stephens [mailto:[EMAIL PROTECTED]] Sent: Tuesday, May 05, 2009 9:53 AM To: [EMAIL PROTECTED] Subject: Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc.... I read through the deck and sent it around the company. Good stuff! It's going to be a big help for trying to get the .NET Enterprise people wrapping their heads around web-scale data. I must admit "Apache Cloud Computing Edition" is sort of unwieldy to say verbally, and frankly "Java Enterprise Edition" is a taboo phrase at a lot of projects I've had. Guilt by association. I think I'll call it "Apache Cloud Stack", and reference "Apache Cloud Computing Edition" in my deck. When I think "Stack", I think of a suite of software that provides all the pieces I need to solve my problem :) On Tue, May 5, 2009 at 7:00 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > Bradford Stephens wrote: >> >> Hey all, >> >> I'm going to be speaking at OSCON about my company's experiences with >> Hadoop and Friends, but I'm having a hard time coming up with a name >> for the entire software ecosystem. I'm thinking of calling it the >> "Apache CloudStack". Does this sound legit to you all? :) Is there >> something more 'official'? > > We've been using "Apache Cloud Computing Edition" for this, to emphasise > this is the successor to Java Enterprise Edition, and that it is cross > language and being built at apache. If you use the same term, even if you > put a different stack outline than us, it gives the idea more legitimacy. > > The slides that Andrew linked to are all in SVN under > http://svn.apache.org/repos/asf/labs/clouds/ >
-
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....Edward Capriolo 2009-05-05, 17:58
'cloud computing' is a hot term. According to the definition provided
by wikipedia http://en.wikipedia.org/wiki/Cloud_computing, Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well. Hadoop is scalable, with HOD it is dynamically scalable. I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for 'utility computing'. as managing the stack and getting started is quite a complex process. Also this stack is best running on LAN network with high speed interlinks. Historically the "Cloud" is composed of WAN links. An implication of Cloud Computing is that different services would be running in different geographical locations which is not how hadoop is normally deployed. I believe 'Apache Grid Stack' would be a more fitting. http://en.wikipedia.org/wiki/Grid_computing Grid computing (or the use of computational grids) is the application of several computers to a single problem at the same time — usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data. Grid computing via the Wikipedia definition describes exactly what hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a 'cloud computing' IMHO
-
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....Steve Loughran 2009-05-06, 10:00
Edward Capriolo wrote:
> 'cloud computing' is a hot term. According to the definition provided > by wikipedia http://en.wikipedia.org/wiki/Cloud_computing, > Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well. > > Hadoop is scalable, with HOD it is dynamically scalable. > > I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for > 'utility computing'. as managing the stack and getting started is > quite a complex process. Exactly. Which is why the Apache Clouds proposal emphasises -Lightweight front end: low Wattage, stateless nodes for web GUI, bonded to the back end -instrumentation for liveness and load monitoring. Hadoop has a lot of this, I'm trying to add more, but we want it everywhere. -Resource Management: bringing up and tearing down nodes by asking the infrastructure. Some Apache projects have done this but only for EC2 and only for their layer of the stack. You need something that keeps track of everything and acts in your interests, not those of the datacentre provider -Packaging for fully automated install/deploy on Linux systems (=rpm and deb) -A development process in which the tools push the code out to a targeted infrastracture even for test runs Hadoop and friends are part of this, they are a very interesting foundation, but they are only part of the storing > > Also this stack is best running on LAN network with high speed > interlinks. Historically the "Cloud" is composed of WAN links. An > implication of Cloud Computing is that different services would be > running in different geographical locations which is not how hadoop is > normally deployed. > > I believe 'Apache Grid Stack' would be a more fitting. > > http://en.wikipedia.org/wiki/Grid_computing > > Grid computing (or the use of computational grids) is the application > of several computers to a single problem at the same time � usually to > a scientific or technical problem that requires a great number of > computer processing cycles or access to large amounts of data. Classic Grid computing - OGSi/OGSA is something I want to steer clear of. Historically, you end up in WS-* and computer management politics. Furthermore, OGSA never had a good use case except "rewrite your apps for the cloud and they will be better". They (lets be fair, we) also focused too much on CPU scheduling, not on storage. > Grid computing via the Wikipedia definition describes exactly what > hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a > 'cloud computing' IMHO To be precise: without a dynamic infrastructure provider that is more than just AWS: it could be Sun/Oracle, IBM/google, HP/Intel/Yahoo!, it could be your ops team and Eucalyptus. The other hardware/service vendors are working on this infrastructure. Apache doesn't work at that level, but if we provide the code to run on all of them, we give the users the independence of a particular infrastructure provider |