|
CubicDesign
2009-09-14, 21:03
Amandeep Khurana
2009-09-14, 21:06
CubicDesign
2009-09-15, 09:27
Omer Trajman
2009-09-14, 21:22
Ted Dunning
2009-09-14, 22:40
Jeff Hammerbacher
2009-09-15, 06:21
Ted Dunning
2009-09-15, 07:42
Jeff Hammerbacher
2009-09-15, 14:28
Edward Capriolo
2009-09-15, 15:34
Ted Dunning
2009-09-15, 16:23
Edward Capriolo
2009-09-15, 19:01
Ted Dunning
2009-09-15, 19:03
Ted Dunning
2009-09-15, 16:19
CubicDesign
2009-09-15, 17:16
Amr Awadallah
2009-09-16, 00:43
|
-
HadoopDB and similar stuffCubicDesign 2009-09-14, 21:03
Hi.
Anybody has experience a DB that can handle large amounts of data on top of Hadoop? HBase and Hive is nice but they also lack of some features. HadoopDB seems to bring some equilibrium. However, it seems to be still an infant project. Any thoughts? +
CubicDesign 2009-09-14, 21:03
-
Re: HadoopDB and similar stuffAmandeep Khurana 2009-09-14, 21:06
HadoopDB is not a DB on top of Hadoop. Its more like doing map reduce over
database instances rather than hdfs... HBase is the most stable structured storage layer available over Hadoop.. What kind of features are you looking for? On Mon, Sep 14, 2009 at 2:03 PM, CubicDesign <[EMAIL PROTECTED]> wrote: > Hi. > > Anybody has experience a DB that can handle large amounts of data on top of > Hadoop? > HBase and Hive is nice but they also lack of some features. HadoopDB seems > to bring some equilibrium. However, it seems to be still an infant project. > > Any thoughts? > > +
Amandeep Khurana 2009-09-14, 21:06
-
Re: HadoopDB and similar stuffCubicDesign 2009-09-15, 09:27
> What kind of features are you looking for? > Hi. We want to use Hadoop (Streaming) to run some tools to process over 1 million entries per job. Each tool will output one string so we will have 1 mil outputs also. Each string (probably 5KB to 50KB length) will be parsed and from this parsing will result about 25-30 columns). There may be several jobs per day. We need to collect the output of these tools and store it somewhere for later analysis. The results of one job need to be together - like in one table. So, we need a DB that can store over one million rows (hmm... or columns?) per table and support some nice (SQL) interrogations. A Hadoop-oriented DB will be nice because it can store safely data (fault tolerant) and because it is distributed we won't have bottlenecks like we have with the current MySQL DB. +
CubicDesign 2009-09-15, 09:27
-
RE: HadoopDB and similar stuffOmer Trajman 2009-09-14, 21:22
The closest thing that's stable may be DBInputFormat, which allows you
to Map/Reduce on data that's in a database and also query the same database via the native SQL interface. In this case the DB sits under or next to hadoop. [shameless-plug] Vertica has an optimized VerticaInput/OutputFormat based on DBInputFormat that can handle large amounts of data [/shameless-plug] -----Original Message----- From: CubicDesign [mailto:[EMAIL PROTECTED]] Sent: Monday, September 14, 2009 5:04 PM To: [EMAIL PROTECTED] Subject: HadoopDB and similar stuff Hi. Anybody has experience a DB that can handle large amounts of data on top of Hadoop? HBase and Hive is nice but they also lack of some features. HadoopDB seems to bring some equilibrium. However, it seems to be still an infant project. Any thoughts? +
Omer Trajman 2009-09-14, 21:22
-
Re: HadoopDB and similar stuffTed Dunning 2009-09-14, 22:40
You don't really say what you want here.
Do you want a database that lives in hadoop's file storage? Hbase is the closest for that. Do you want to be able to import or export data from hadoop to database? There is a db input/output format (or three) that could help with that. Cloudera has their sqoop software as well. Do you want to tightly integrate SQL and map-reduce? Asterdata has a product that might help you. Did you mean something else entirely? Ask again with more details about what you really want and I am sure somebody will help you out. On Mon, Sep 14, 2009 at 2:03 PM, CubicDesign <[EMAIL PROTECTED]> wrote: > Hi. > > Anybody has experience a DB that can handle large amounts of data on top of > Hadoop? > HBase and Hive is nice but they also lack of some features. HadoopDB seems > to bring some equilibrium. However, it seems to be still an infant project. > > Any thoughts? > > -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-09-14, 22:40
-
Re: HadoopDB and similar stuffJeff Hammerbacher 2009-09-15, 06:21
>
> Do you want to tightly integrate SQL and map-reduce? Asterdata has a > product that might help you. > As does Greenplum. You could also get this functionality from Pig or Hive, which are Apache 2.0-licensed subprojects of Hadoop. +
Jeff Hammerbacher 2009-09-15, 06:21
-
Re: HadoopDB and similar stuffTed Dunning 2009-09-15, 07:42
uhhh... neither pig nor hive are really SQL. Higher level of abstraction
than pure MR, but not SQL. You are right to include Greenplum, though. They slipped my mind, probably because they don't have a google ad running everything 30 seconds like Aster does. On Mon, Sep 14, 2009 at 11:21 PM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote: > > > > Do you want to tightly integrate SQL and map-reduce? Asterdata has a > > product that might help you. > > > > As does Greenplum. You could also get this functionality from Pig or Hive, > which are Apache 2.0-licensed subprojects of Hadoop. -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-09-15, 07:42
-
Re: HadoopDB and similar stuffJeff Hammerbacher 2009-09-15, 14:28
Hey Ted,
I don't want to derail this thread, but I would like to correct any misperceptions which may exist in the community. 1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB paper for more ( http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it stands today, a reasonable subset of SQL is already supported, and most users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in Hive today. 2) There's a patch for SQL support in Pig: http://issues.apache.org/jira/browse/PIG-824. Every database implements a different dialect of SQL (e.g. express a Top K query in your favorite database and compare to the rest), and the Pig and HiveQL dialects are as valid as any other. If you disagree, I'd love to hear your perspective on why these languages are "not SQL". Regards, Jeff On Tue, Sep 15, 2009 at 12:42 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > uhhh... neither pig nor hive are really SQL. Higher level of abstraction > than pure MR, but not SQL. > > You are right to include Greenplum, though. They slipped my mind, probably > because they don't have a google ad running everything 30 seconds like > Aster > does. > > On Mon, Sep 14, 2009 at 11:21 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > >wrote: > > > > > > > Do you want to tightly integrate SQL and map-reduce? Asterdata has a > > > product that might help you. > > > > > > > As does Greenplum. You could also get this functionality from Pig or > Hive, > > which are Apache 2.0-licensed subprojects of Hadoop. > > > > > -- > Ted Dunning, CTO > DeepDyve > +
Jeff Hammerbacher 2009-09-15, 14:28
-
Re: HadoopDB and similar stuffEdward Capriolo 2009-09-15, 15:34
On Tue, Sep 15, 2009 at 10:28 AM, Jeff Hammerbacher <[EMAIL PROTECTED]> wrote:
> Hey Ted, > I don't want to derail this thread, but I would like to correct any > misperceptions which may exist in the community. > > 1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB > paper for more ( > http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it > stands today, a reasonable subset of SQL is already supported, and most > users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in > Hive today. > > 2) There's a patch for SQL support in Pig: > http://issues.apache.org/jira/browse/PIG-824. > > Every database implements a different dialect of SQL (e.g. express a Top K > query in your favorite database and compare to the rest), and the Pig and > HiveQL dialects are as valid as any other. If you disagree, I'd love to hear > your perspective on why these languages are "not SQL". > > Regards, > Jeff > > On Tue, Sep 15, 2009 at 12:42 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> uhhh... neither pig nor hive are really SQL. Higher level of abstraction >> than pure MR, but not SQL. >> >> You are right to include Greenplum, though. They slipped my mind, probably >> because they don't have a google ad running everything 30 seconds like >> Aster >> does. >> >> On Mon, Sep 14, 2009 at 11:21 PM, Jeff Hammerbacher <[EMAIL PROTECTED] >> >wrote: >> >> > > >> > > Do you want to tightly integrate SQL and map-reduce? Asterdata has a >> > > product that might help you. >> > > >> > >> > As does Greenplum. You could also get this functionality from Pig or >> Hive, >> > which are Apache 2.0-licensed subprojects of Hadoop. >> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> > I notice we have mentioned greenplum and aster. I will speak to the fact that I have never used either product, but I have spoken to some sales reps over the years who are very helpful, I might add. * caveat: I am not saying that my price information is accurate or current But the major deal breaker at my old places of employment was always cost. Per TB pricing was a major deal breaker for US. We wanted to keep our data indefinitely but most reporting is month-over-month. So having to keep all our data (that we don't really use that much after two months) in a system that charges by TB was expensive and would become more expensive as our data set grows. In the solution space you get a lot of bank for your buck (hadoop+hive) vs (TeraData, GreenPlum, Aster), as you know the price of Hadoop+hive (0+0) plus hardware. Hive is not 100% SQL, but I would say join the Hive user list and be amazed. New types of joins, theta-join, etc have been added by user request. Most of the time if you can't do something you would expect to do in SQL there is a work around. The flip side is true as well, Hive has specific support that other databases don't :) +
Edward Capriolo 2009-09-15, 15:34
-
Re: HadoopDB and similar stuffTed Dunning 2009-09-15, 16:23
I don't need to be amazed. I am a strong proponent of map-reduce. People
forget, but I bought the beer at the first Hadoop summit at Gordon Biersch. I just don't think that selling Hive or Pig as SQL is fair to the buyer or the seller. They aren't the same and have very different virtues. On Tue, Sep 15, 2009 at 8:34 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > Hive is not 100% SQL, but I would say join the Hive user list and be > amazed. New types of joins, theta-join, etc have been added by user > request. Most of the time if you can't do something you would expect > to do in SQL there is a work around. > > The flip side is true as well, Hive has specific support that other > databases don't :) > -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-09-15, 16:23
-
Re: HadoopDB and similar stuffEdward Capriolo 2009-09-15, 19:01
On Tue, Sep 15, 2009 at 12:23 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> I don't need to be amazed. I am a strong proponent of map-reduce. People > forget, but I bought the beer at the first Hadoop summit at Gordon Biersch. > > I just don't think that selling Hive or Pig as SQL is fair to the buyer or > the seller. They aren't the same and have very different virtues. > > On Tue, Sep 15, 2009 at 8:34 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > >> Hive is not 100% SQL, but I would say join the Hive user list and be >> amazed. New types of joins, theta-join, etc have been added by user >> request. Most of the time if you can't do something you would expect >> to do in SQL there is a work around. >> >> The flip side is true as well, Hive has specific support that other >> databases don't :) >> > > > > -- > Ted Dunning, CTO > DeepDyve > Ted, I meant that I am more amazed by it, not that you should be amazed by it :) There have been several tickets opened up like "Let Hive do theta Join" and then sometimes, with in days, hive trunk supports it. That is pretty impressive to me. As for >>I just don't think that selling Hive or Pig as SQL is fair to the buyer or >>the seller. They aren't the same and have very different virtues. I agree. I think I qualified that as my opinion. However, I will say that I believe I have received some unsolicited from Aster. Likewise, I notice Aster will reply to blogs about hadoop and plug away, without really referencing the topic of the blog in any specific way. So, I would argue the precedent is set. Maybe, it is just my pet-peeve. The philosophical "What is SQL?" is hard to answer. Being different SQL standards like http://en.wikipedia.org/wiki/SQL-92 or SQL-89 may only be partially supported by a particular vendor. Every implementation adds/subtracts features. I often describe Hive query language in this way: "If you know SQL I can teach you Hive-QL rather quickly." +
Edward Capriolo 2009-09-15, 19:01
-
Re: HadoopDB and similar stuffTed Dunning 2009-09-15, 19:03
Great description.
On Tue, Sep 15, 2009 at 12:01 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > > I often describe Hive query language in this way: > > "If you know SQL I can teach you Hive-QL rather quickly." -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-09-15, 19:03
-
Re: HadoopDB and similar stuffTed Dunning 2009-09-15, 16:19
On Tue, Sep 15, 2009 at 7:28 AM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote:
> ... I would like to correct any > misperceptions which may exist in the community. > > 1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB > paper for more ( > http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it > stands today, a reasonable subset of SQL is already supported, and most > users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in > Hive today. > Note the key word "intends". That indicates future tense. As you say, it is a reasonable subset. I don't know the I am sure that there are wide swaths of SQL semantics that are not implemented. Transactions, rollback, fancy outer joins, exactly correct syntax for null, row updates and deletions are areas that I would expect deficiencies relative to SQL. Conversely, I doubt that there are many Hive programs that could run without major alterations on conventional SQL engines. The result is that HiveQL != SQL. It is more correct to say HiveQL =kindaSQL. 2) There's a patch for SQL support in Pig: > http://issues.apache.org/jira/browse/PIG-824. > More future tense. This is hardly part of Pig at this point. I expect that this will come closer to SQL than the current HiveQL, but it is likely to not have key semantic properties due to the properties of the substrate and also have some important additions. Every database implements a different dialect of SQL (e.g. express a Top K > query in your favorite database and compare to the rest), and the Pig and > HiveQL dialects are as valid as any other. This level of cultural relativism is a bit disingenuous. My point is that you are setting up unreasonable expectations. MR based systems are inherently very different from traditional databases (which is, of course, the POINT of having MR). SQL is very strongly tied to the underlying row update and transactional semantics of traditional databases. I am NOT saying that Hive and Pig are not useful. For many things, I prefer them to SQL-based systems. I am just saying that they are different animals. I am also NOT saying that Hive and Pig aren't a good way for SQL based programmers to transition to map-reduce. I am just saying that you should tell people that Hive and Pig are similar to SQL so you don't have their heads explode when they realize that it isn't really SQL. Remember that many, many people claimed that myIsam tables are not really SQL. Hive is a darned site further from SQL than that. +
Ted Dunning 2009-09-15, 16:19
-
Re: HadoopDB and similar stuffCubicDesign 2009-09-15, 17:16
Ted Dunning wrote: > I am just saying that you should tell people that Hive and Pig are similar to SQL so you don't have their heads explode when they realize that it isn't really SQL. Probably for SOME people that are starting a new project (like me) this is less relevant. I don't need to convert exiting code or framework to make it work with Hive/Pig/HBase. I need to build a new on from scratches. +
CubicDesign 2009-09-15, 17:16
-
Re: HadoopDB and similar stuffAmr Awadallah 2009-09-16, 00:43
Ted,
Just out of curiosity, did you use asterdata or greenplum before? Is their SQL 100% compliant with SQL92? (not to mention SQL2008) -- amr Ted Dunning wrote: > On Tue, Sep 15, 2009 at 7:28 AM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote: > > >> ... I would like to correct any >> misperceptions which may exist in the community. >> >> 1) HiveQL intends to include SQL as a subset of its syntax: see the VLDB >> paper for more ( >> http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009). As it >> stands today, a reasonable subset of SQL is already supported, and most >> users of MySQL, Oracle, or PostgreSQL will be able to work comfortably in >> Hive today. >> >> > > Note the key word "intends". That indicates future tense. > > As you say, it is a reasonable subset. I don't know the I am sure that > there are wide swaths of SQL semantics that are not implemented. > Transactions, rollback, fancy outer joins, exactly correct syntax for null, > row updates and deletions are areas that I would expect deficiencies > relative to SQL. Conversely, I doubt that there are many Hive programs that > could run without major alterations on conventional SQL engines. > > The result is that HiveQL != SQL. It is more correct to say HiveQL =kinda> SQL. > > 2) There's a patch for SQL support in Pig: > >> http://issues.apache.org/jira/browse/PIG-824. >> >> > > More future tense. This is hardly part of Pig at this point. I expect that > this will come closer to SQL than the current HiveQL, but it is likely to > not have key semantic properties due to the properties of the substrate and > also have some important additions. > > Every database implements a different dialect of SQL (e.g. express a Top K > >> query in your favorite database and compare to the rest), and the Pig and >> HiveQL dialects are as valid as any other. >> > > > This level of cultural relativism is a bit disingenuous. My point is that > you are setting up unreasonable expectations. MR based systems are > inherently very different from traditional databases (which is, of course, > the POINT of having MR). SQL is very strongly tied to the underlying row > update and transactional semantics of traditional databases. > > I am NOT saying that Hive and Pig are not useful. For many things, I prefer > them to SQL-based systems. I am just saying that they are different > animals. > > I am also NOT saying that Hive and Pig aren't a good way for SQL based > programmers to transition to map-reduce. I am just saying that you should > tell people that Hive and Pig are similar to SQL so you don't have their > heads explode when they realize that it isn't really SQL. > > Remember that many, many people claimed that myIsam tables are not really > SQL. Hive is a darned site further from SQL than that. > > +
Amr Awadallah 2009-09-16, 00:43
|