Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - column count guidelines


Copy link to this message
-
Re: column count guidelines
Marcos Ortiz 2013-02-08, 05:38
My recommendation is to keep updated with the last HBase release, and
wait for 0.96, which it
has a lot of improvements almost in every area. I talked about this in a
blog post.[1]

I think in your use-case, Coprocessors can be very helpful, although in
Lars's "HBase: The Definitive Guide" book,
he explained in Chapter 4 how to use Counters and Coprocessors. You
should read it.

A great introduction to Coprocessors was posted in HBase's blog, [2] and
a great example of HBase performance tuning, including Coprocessors's
use, was
posted by Hari Kumar from Ericsson Research on its Data and Knowledge
blog.[3]

Best wishes

[1] http://marcosluis2186.posterous.com/some-upcoming-features-in-hbase-096
[2] https://blogs.apache.org/hbase/entry/coprocessor_introduction
[3] http://labs.ericsson.com/blog/hbase-performance-tuners

On 02/07/2013 11:34 PM, Michael Ellery wrote:
> thanks for reminding me of the HBASE version in CDH4 - that's something we'll definitely take into consideration.
>
> -Mike
>
> On Feb 7, 2013, at 5:09 PM, Ted Yu wrote:
>
>> Thanks Michael for this information.
>>
>> FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two
>> features I cited below.
>>
>> On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[EMAIL PROTECTED]> wrote:
>>
>>> There is only one CF in this schema.
>>>
>>> Yes, we are looking at upgrading to CDH4, but it is not trivial since we
>>> cannot have cluster downtime. Our current upgrade plans involves additional
>>> hardware with side-by side clusters until everything is exported/imported.
>>>
>>> Thanks,
>>> Mike
>>>
>>> On Feb 7, 2013, at 4:34 PM, Ted Yu wrote:
>>>
>>>> How many column families are involved ?
>>>>
>>>> Have you considered upgrading to 0.94.4 where you would be able to
>>> benefit
>>>> from lazy seek, Data Block Encoding, etc ?
>>>>
>>>> Thanks
>>>>
>>>> On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]>
>>> wrote:
>>>>> I'm looking for some advice about per row CQ (column qualifier) count
>>>>> guidelines. Our current schema design means we have a HIGHLY variable CQ
>>>>> count per row -- some rows have one or two CQs and some rows have
>>> upwards
>>>>> of 1 million. Each CQ is on the order of 100 bytes (for round numbers)
>>> and
>>>>> the cell values are null.  We see highly variable and too often
>>>>> unacceptable read performance using this schema.  I don't know for a
>>> fact
>>>>> that the CQ count variability is the source of our problems, but I am
>>>>> suspicious.
>>>>>
>>>>> I'm curious about others' experience with CQ counts per row -- are there
>>>>> some best practices/guidelines about how to optimally size the number of
>>>>> CQs per row. The other obvious solution will involve breaking this data
>>>>> into finer grained rows, which means shifting from GETs to SCANs - are
>>>>> there performance trade-offs in such a change?
>>>>>
>>>>> We are currently using CDH3u4, if that is relevant. All of our loading
>>> is
>>>>> done via HFILE loading (bulk), so we have not had to tune write
>>> performance
>>>>> beyond using bulk loads. Any advice appreciated, including what metrics
>>> we
>>>>> should be looking at to further diagnose our read performance
>>> challenges.
>>>>> Thanks,
>>>>> Mike Ellery
>>>

--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>