Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Stargate+hbase

Copy link to this message
Re: Stargate+hbase
Weishung Chung 2011-03-25, 17:38
+1 Thank you David for the great explanation. It's complicated.
I am pretty new to this BigData space and found it really interesting and
always want to learn more about it.  I will definitely look into OpenTSDB as
suggested. Thanks again :D

On Fri, Mar 25, 2011 at 12:18 PM, Buttler, David <[EMAIL PROTECTED]> wrote:

> Hmmm.... maybe my mental model is deficient.  How do you propose building a
> secondary index without a transaction?
> The reason indexes work is that they store the data in a different way than
> the primary table.  That implies a second, independent data storage.
>  Without a transaction you can't be guaranteed that the second data
> structure is always updated in sync with the primary table.
> I suppose you could roll the multiple data writes into the initial data
> write -- that would work if you have write-once data.  But if you partially
> update the data then you have the issue that you may not have enough
> information in the update to correctly write the key for the secondary data
> stores.  This would mean (in general) that you would have to read an entire
> row before you update any part of it so that you can maintain the secondary
> structures.  Do you see the performance problem here? (or that you are
> introducing a limited transactional / eventually consistent state into the
> data store)
> There may be optimizations where you could skip that part of the code if
> there were no indexes.  But now you are talking about greatly increasing the
> complexity of the codebase for a use case that is somewhat specialized.
>  Hence, you see that people who really care about secondary indexes /
> transaction hbase have separate packages.  The probably don't do the job as
> well as is ideally possible by rolling the code into hbase proper, but on
> the other hand, neither do they increase the complexity of the main code
> branch (hence they don't slow down the core development work).
> I will stand by my point that there are engineering trade-offs to be made.
>  Take the unix philosophy: small components, loosely coupled. If you need
> indexes, build it on top of HBase, not inside of HBase.  Using things like
> co-processors allows you to extend the capabilities of HBase in a way that
> does not impact the core product and hurt all of the other users. An example
> of this is OpenTSDB.  It is a time-series database that uses hbase under the
> covers, but it doesn't ask that hbase support its needs in some special way.
>  It is very instructive to see how it was constructed.
> Dave
> -----Original Message-----
> From: Weishung Chung [mailto:[EMAIL PROTECTED]]
> Sent: Friday, March 25, 2011 9:27 AM
> Subject: Re: Stargate+hbase
> Thank you so much for the informative info. It really helps me out.
> For secondary index, even without transaction, I would think one could
> still
> build a secondary index on another key especially if we have row level
> locking. Correct me if I am wrong.
> Also, I have read about clustered B-Tree used in InnoDB to implement
> secondary index but I know that B-Tree is the primary limitation when come
> to scalability and the main reason why NoSQL have discarded B-Tree. But it
> would be super nice to be able to build the secondary index without using
> another secondary table in HBase.
> I am not complaining but I would love to see HBase continues to be the top
> NoSQL solution out there :D
> Way to go HBase !
> On Fri, Mar 25, 2011 at 10:39 AM, Buttler, David <[EMAIL PROTECTED]>
> wrote:
> > Do you know what it means to make secondary indexing a feature?  There
> are
> > two reasonable outcomes:
> > 1) adding ACID semantics (and thus killing scalability)
> > 2) allowing the secondary index to be out of date (leading to every naïve
> > user claiming that there is a serious bug that must be fixed).
> >
> > Secondary indexes are basically another way of storing (part of) the
> data.
> >  E.g. another table, sorted on the field(s) that you want to search on.