Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Schema design question - Hot Key concerns


Copy link to this message
-
Schema design question - Hot Key concerns
Suraj Varma 2011-11-18, 17:33
I have an HBase schema design question that I wanted to discuss with the list.

Let's say we have a "wide" table design that has a table with one
column family containing "show bookings", say.

RowKey: SHOW_ID
Columns: SEATS_AVAILABLE, BOOKING_<#1>, BOOKING_<#2>, BOOKING_<#3>, etc
Values: <remaining available seats>, <seats booked>, <seats booked,
<seats booked>, etc

Each "SHOW_ID" will have variable number of columns.

Usage Pattern:
1) Multiple clients / threads are constantly
creating/updating/deleting "bookings" and this results in a column
being added /updated/deleted to the row.
2) The SEATS_AVAILABLE column needs to be atomically updated whenever
a corresponding BOOKING_<#> is added, updated or deleted.
3) Clients update their own unique BOOKING columns (i.e. clients
update their own mutually exclusive BOOKING_<#> columns.
4) Clients can concurrently update the SEATS_AVAILABLE column.
5) Some SHOW_ID will be harder hit than other SHOW_IDs
6) A TTL on the BOOKING columns will be set to expire them after some set time.
7) We want to  leverage the atomic update at "row level" that HBase
provides for atomically updating the related columns.

So - we are visualizing this as sort of an "equalizer" graphic on a
stereo where each row is constantly varying in terms of columns added
& removed. The SEATS_AVAILABLE value goes up & down correspondingly.

Questions / Notes:
1) Could this lead to a hot key / hot row scenario? The columns being
updated are mutually exclusive except for the SEATS_AVAILABLE. Or
would this be very low overhead given that only one column is really
being "updated" by multiple client threads?

2) The alternative we had explored was tall table where each BOOKING
is a separate row (SHOW_ID-BOOKING-<#> would be the key) ... however,
in this case, we won't be able to atomically update the
SEATS_AVAILABLE column at the same time.

3) In terms of "row locking", what is the granularity? i.e. when is
the row level lock engaged to make it atomic (i.e. are the column
updates made on the side and "swapped" in with the row level lock?) or
is the row level lock held for the full duration of the update.

4) I think the concern is whether this design is scalable as the
number of clients keep increasing over time ...

5) Any other suggestions on how hot row key scenario (if real) can be
sidestepped?

Thanks,
--Suraj