On Thu, Feb 2, 2012 at 4:47 PM, Bryan Beaudreault
<[EMAIL PROTECTED]> wrote:
> I'd love to hear from an expert on the pros and cons of big tables vs many
> tables, when access patterns and simplicity are not a concern. I
> haven't found much information regarding it, but I'd imagine the only
> benefit to many tables is the ability to configure each differently if that
> is helpful for the use case.
HBase doesn't offer a whole lot of configuration knobs per table.
Most table I come across have the same configuration: single family,
LZO compression, some form of Bloom filter. Maybe VERSIONS=>1.
If you need different configs, you can also consider using multiple
column families in a single table.
If you have somewhat related data and you're on the fence when trying
to decide whether you store everything in a single table or not, I
generally recommend to stick to a single table. From an operational
standpoint, it's easier to manage a single table for an application
than multiple ones. You also generally end up with fewer, bigger
regions, which is almost always better. This entails that your RS are
writing more data to fewer WALs, which leads to more sequential writes
across the board. You'll end up with fewer HLogs, which is also a
As others said, with a single table design, you can control data
locality, but as soon as you write to and read from multiple tables,
all bets are off.
If you use HBase's client (which is most likely the case as the only
other alternative is asynchbase), beware that you need to create one
HTable instance per table per thread in your application code. If you
build an application with many tables, this rapidly becomes unwieldy.
If you use asynchbase you don't have this problem because it uses a
single HBaseClient object for your entire cluster, and it's
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com