Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Scan performance on a big table as combination of multiple logic tables

Copy link to this message
Re: Scan performance on a big table as combination of multiple logic tables
You should be fine having multiple tables with high region counts.  I would
avoid making thousands of tables.  However, if you have three separate
business needs, make three different tables.

You seem to be starting with a perspective that there would be some kind of
issues with multiple tables.  Why do you think this exists?  You said
"Otherwise, runtime tuning seems to add quite amount of operational cost."
I'm not sure what you are thinking here and where your thoughts are coming
from.  Additionally, if you have separate tables, then you can modify them
differently (e.g. setting them to different region sizes if it makes
sense-- for example, some of our tables have smaller region sizes so we'll
have more maps rather than fewer when we run map reduce jobs).

Regarding region size: the HTable v1 format in 0.90 and below suffered from
taking a long time to transition as individual regions got too big.  With
0.92 and HTablev2 that isn't as much of a problem as I understand it.  If I
recall correctly, there are numerous organizations using 10gb regions with
sucess-- (among others, I believe this what Yahoo reported they were using
for their web crawl tables on their thousand node cluster).  While I
haven't run any stats, I believe that there is negligible scan performance
impact as region size grows.  There is definitely no  exponential negative
performance impact.

On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <[EMAIL PROTECTED]> wrote:

> Vladimire and Jacques, Thanks for the information! Unless Hbase well
> handles multiple big sized tables (relatively high region count) in one
> cluster, it seems to me that one big table is the way to go. Otherwise,
> runtime tuning seems to add quite amount of operational cost. That leads
> to another question. Do we see big region size as an issue? If so, what's
> the pivot point as region size grows further, the scan performance starts
> to degrade exponentially?
> On 2/15/12 4:11 PM, "Vladimir Rodionov" <[EMAIL PROTECTED]> wrote:
> >10 tables are fine. 1000 are not, especially when one does table
> >pre-splitting to increase write perf.
> >
> >Too many regions kill HBase.
> >
> >Best regards,
> >Vladimir Rodionov
> >Principal Platform Engineer
> >Carrier IQ, www.carrieriq.com
> >e-mail: [EMAIL PROTECTED]
> >
> >________________________________________
> >From: Jacques [[EMAIL PROTECTED]]
> >Sent: Wednesday, February 15, 2012 3:45 PM
> >Subject: Re: Scan performance on a big table as combination of multiple
> >logic tables
> >
> >Out of curiosity,  what do you perceive as the benefit to having only one
> >table?  Are there reasons that you think one table would perform better
> >than a few?
> >
> >If you're splitting data within a table because you'd otherwise have
> >millions of tables, I understand that and would concur with Vladimir's
> >approach below.  However, if you're really looking at 10 tables versus one
> >table, it seems like HBase is built exactly to make that work well (rather
> >than having to make all sorts of application level code to do what HBase
> >already does).
> >
> >thanks,
> >Jacques
> >
> >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> Since Hbase is tailored to handle one table very well, we are thinking
> >>to
> >> put multiple tables into one big table but on different column family
> >>sets.
> >> Our use case is full table scan against single column value filters. As
> >> records from different "logical tables" are at different column
> >>families,
> >> could we speed up the scan performance by simply checking the column
> >>family
> >> referenced by these single column value filters first before really
> >>going
> >> through all the underlying K-V pairs? It would be great if the Hbase
> >>code
> >> is already coded that way.
> >>
> >>
> >> $0.02,
> >> Thomas
> >>
> >>
> >
> >Confidentiality Notice:  The information contained in this message,
> >including any attachments hereto, may be confidential and is intended to