Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - A general question on maxVersion handling when we have Secondary index tables


Copy link to this message
-
Re: A general question on maxVersion handling when we have Secondary index tables
Jonathan Hsieh 2012-08-29, 18:18
We should have a hbave dev meeting and bring this as one of the topics of
discussion to bring up.  I'll start another thread on that.

On Wed, Aug 29, 2012 at 10:03 AM, Jesse Yates <[EMAIL PROTECTED]>wrote:

> Client library style stuff is _nice_ but one of the things everyone asks of
> database is that we provide an index (cassandra has it, riak has it, mysql
> has it...hbase doesn't? Yes, different systems,etc.,etc., but the point is
> we could do it). Further, if we build it as a part of hbase, we can make it
> faster... though don't ask me the _how_ on that yet ;)
>
> My main concern is that there are many possible ways to have
an implementation that is good for one usecase / workload but will
absolutely terrible for others.
> Talking with Lars, we could provide a lot of the indexing infrastructure,
> but leave the actual indexing (convert row|cf|cq|ts|value to an index value
> and vice-versa) to a client library gives us a lot of the flexibility that
> people would need. And I take it that most people already have some form of
> indexing already (be it consistent or not), so we can do it 'the right way'
> in terms of queries, etc. and provide pluggable infrastructure (with a
> decent default) so people can roll in their own implementations.
>
> That said, I think we can do secondary indexing without too many changes to
> HBase (region co-location/pinning that Ted suggests would just be sweet
> overall)arguing for a client library. However, if we decide this is one of
> the things we want to support going forward as a project, then it makes
> more sense to do it as part of HBase, rather than pointing people to some
> guy/gal's website with the information (which may or may not be up to date)
> for how munge indexing in. Instead, it would be so much nicer to just flip
> a couple switches, maybe plug in a couple of classes and have indexing
> _just work_.
>
> Isn't that the rationale for coprocessors?  (just add something to a
config, start hbase?)

Also, with secondary indices, we'll potentially be adding new user exposed
apis.  I think this should be defineable in a way that can work accross
many algorithms.  We should figure out what they are so when there are
different implementations users can pick and choose between the
implementations that are good for them.
> Just my $0.02
>
> -Jesse
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com
>
>
> On Wed, Aug 29, 2012 at 9:19 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > For the secondary index based on state portion of address example, I
> wonder
> > if we can achieve comparable performance using scan with proper filter.
> >
> > Cheers
> >
> > On Wed, Aug 29, 2012 at 9:11 AM, Jonathan Hsieh <[EMAIL PROTECTED]>
> wrote:
> >
> > > Ted,
> > >
> > > Ram's summarizes the concern succinctly -- to answer the specific
> > question
> > > it isn't for versions -- it is for the case where a secondary index can
> > > point to many many primary rows.  (let's say we have a rowkey userid
> and
> > we
> > > want to have a 2ndary index based on the state portion of there address
> > >  --- we'll end up pointing to many many primary rows).
> > >
> > > Jon.
> > >
> > >
> > >
> > > On Wed, Aug 29, 2012 at 7:15 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > Thanks for the detailed response, Jon.
> > > >
> > > > bq. it would mean that a query based on secondary index would
> > > > potentially have to hit every region server that has a region in the
> > > > primary table.
> > > >
> > > > Can you elaborate on the above a little bit ?
> > > > Is this because secondary index would point us to more than one
> region
> > in
> > > > the data table because several versions are saved for the same row ?
> > > >
> > > > My thinking was to ease management of simultaneous (data and index)
> > > region
> > > > split through region colocation.
> > > >
> > > > Cheers
> > > >
> > > > On Wed, Aug 29, 2012 at 6:47 AM, Jonathan Hsieh <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > >
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]