Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> column count guidelines


Copy link to this message
-
Re: column count guidelines
Mike,

CDH4.2 will be out shortly, will be based on HBase 0.94, and will include
both of the features that Ted mentioned and more.

- Dave

On Thu, Feb 7, 2013 at 8:34 PM, Michael Ellery <[EMAIL PROTECTED]> wrote:

>
> thanks for reminding me of the HBASE version in CDH4 - that's something
> we'll definitely take into consideration.
>
> -Mike
>
> On Feb 7, 2013, at 5:09 PM, Ted Yu wrote:
>
> > Thanks Michael for this information.
> >
> > FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two
> > features I cited below.
> >
> > On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[EMAIL PROTECTED]>
> wrote:
> >
> >> There is only one CF in this schema.
> >>
> >> Yes, we are looking at upgrading to CDH4, but it is not trivial since we
> >> cannot have cluster downtime. Our current upgrade plans involves
> additional
> >> hardware with side-by side clusters until everything is
> exported/imported.
> >>
> >> Thanks,
> >> Mike
> >>
> >> On Feb 7, 2013, at 4:34 PM, Ted Yu wrote:
> >>
> >>> How many column families are involved ?
> >>>
> >>> Have you considered upgrading to 0.94.4 where you would be able to
> >> benefit
> >>> from lazy seek, Data Block Encoding, etc ?
> >>>
> >>> Thanks
> >>>
> >>> On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]>
> >> wrote:
> >>>
> >>>> I'm looking for some advice about per row CQ (column qualifier) count
> >>>> guidelines. Our current schema design means we have a HIGHLY variable
> CQ
> >>>> count per row -- some rows have one or two CQs and some rows have
> >> upwards
> >>>> of 1 million. Each CQ is on the order of 100 bytes (for round numbers)
> >> and
> >>>> the cell values are null.  We see highly variable and too often
> >>>> unacceptable read performance using this schema.  I don't know for a
> >> fact
> >>>> that the CQ count variability is the source of our problems, but I am
> >>>> suspicious.
> >>>>
> >>>> I'm curious about others' experience with CQ counts per row -- are
> there
> >>>> some best practices/guidelines about how to optimally size the number
> of
> >>>> CQs per row. The other obvious solution will involve breaking this
> data
> >>>> into finer grained rows, which means shifting from GETs to SCANs - are
> >>>> there performance trade-offs in such a change?
> >>>>
> >>>> We are currently using CDH3u4, if that is relevant. All of our loading
> >> is
> >>>> done via HFILE loading (bulk), so we have not had to tune write
> >> performance
> >>>> beyond using bulk loads. Any advice appreciated, including what
> metrics
> >> we
> >>>> should be looking at to further diagnose our read performance
> >> challenges.
> >>>>
> >>>> Thanks,
> >>>> Mike Ellery
> >>
> >>
>
>