Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Developer's Pow-wow.

Copy link to this message
Re: HBase Developer's Pow-wow.
One sparse use case for us is rate limit detection.  We store user events
in an Event table whose primary key is a unique timestamp (sharded to avoid
hotspotting) and which has eventType and ipAddress columns.  We manually
keep a separate table (the index, also sharded) called EventByDateIpType
with row format [year/month/date/ipAddress/eventType/eventId].  Background
jobs are constantly scanning the index to count combinations of
ipAddress+eventType to hunt down the people that are doing things like
adding spam to the site.  Then we might dig up all the events for a suspect
ipAddress, where the absolute busiest ipAddress might account for .1% of
the events in a day, so pretty sparse.  A per-table index is a must-have

For this same Event table, there are also dense indexes like
EventByDateType whose row key is [year/month/date/eventType/eventId].
 There are only about 200 eventTypes.  If we have 1 million of a certain
eventType on a given day where we need to access the primary rows, we do a
scan on the EventByDateType index table and pull the rows out of the Event
table in batches.  One nice aspect of this is that we are getting the rows
in globally sorted order.  Either per-table or per-region indexes would
work here, but i guess i'm failing to see the read-time benefit of the
per-region index.

Seems like there are 3 categories of sparseness:
1) sparse indexes (like ipAddress) where a per-table approach is more
efficient for reads
2) dense indexes (like eventType) where there are likely values of every
index key on each region
3) very dense indexes (like male/female) where you should just be doing a
table scan anyway

Jacques, you say "If we're talking about a gender column on a user profile
table, you really want that
to be spread out among all regions".  Can you expand on that more?  I guess
i don't understand your read pattern.  If you have 5 million of each user,
you are probably not doing a single select of all males.  You will probably
have to iterate through them in small batches.  Why is the per-region
approach more beneficial than the per-table?  Is it because it's easier to
plug into hbase's existing per-region MapReduce splitter?  If so, could you
just as easily feed the separate per-table index into MapReduce?

Thanks for starting the important discussion.
On Mon, Sep 10, 2012 at 4:40 PM, Jacques <[EMAIL PROTECTED]> wrote:

> >
> > All of my use-cases would require Per-table indexes.  Per-region is
> easier
> > to keep consistent at write-time, but is seems useless to me for the
> large
> > tables that hbase is designed for (because you have to hit every region
> for
> > each read).
> >
> Can you expound on use cases?  The pros and cons are heavily dependent on
> the sparseness of the indexed values and the particular use case.  If we're
> talking about a gender column on a user profile table, you really want that
> to be spread out among all regions.  If we're talking about an email
> address... not so much.