Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Developer's Pow-wow.


Copy link to this message
-
Re: HBase Developer's Pow-wow.
See below

On Mon, Sep 10, 2012 at 10:51 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Jacques:
> Thanks for your sharing.
>
> bq. row-level sharding as opposed to term
>
> Please elaborate on the above a little more: what is term sharding ?
>

If an index is basically a value (or term) pointing back to a row, there
are two main ways that you can slice up the data to scale it.  Lets say you
have ten nodes and you want to index a column that stores values between 1
and 100.  This columns values are likely distributed throughout all the
regions.   The two options would look like:

Option 1 (term sharding): Each node/region holds all pointers for a single
value.  E.g. Node A holds 1-10, B 11-20, C:21-30, etc.  (A variation of
this is hashing the values to avoid distribution problems.)  The strength
of this approach is that if you know you only want values 1-5, you don't
have to have all the nodes evaluate their index.  The downsides are:  you
have to have some kind of cross node/region data approach and consistency
is hard.  You also have problems as your data scales: on a massive scale,
an index can takes a while to iterate through once it gets large you'll
bottleneck this problem to a single machine.

Option 2 (row-sharding): Each node/region holds all pointers for all the
rows that are on that node.  In this case, you have to consult all the
nodes before you get all the values.  More complicated on query time but
limitless scale and simpler consistency problems.
>
> bq. for what I will call a "local shadow family"
>
> I like this idea. User may request more than one index. Currently HBase is
> not so good at serving high number of families. So we may need to watch
> out.
>
> Yeah.  A simple approach could utilize two families, one in-memory and one
not.  No reason a family can't hold multiple indexes.  Just need to get a
little more tricky about how we use things like qualifiers.  Also makes
index dropping more convoluted.

> bq. GroupingScanner (a Scanner that does intersection and/or union of
> scanners for multi criteria queries)
>
> Do you think the following enhancement is related to your proposal above ?
> HBASE-5416 Improve performance of scans with some kind of filters
>

On first glance, I don't think this is really related.  A grouping scanner
would be used to take the secondary index scanners and merge them into a
single filter scanner to then be used when the primary scan is done.
>
> bq. and then "patch" the values to block locations as we
> flushed the primary family that we were indexing (ugh).
>
> Yeah. We also need to consider the effect of compaction.
>
Yeah... painful...
>
> bq. my intuition is it would be fine as long as we didn't fix HBASE-3149
>
> I was actually expecting someone to pick up the work of HBASE-3149 :-)
>

:P
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB