Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> column count guidelines


Copy link to this message
-
Re: column count guidelines
Mike,

CDH4.2 will be out shortly, will be based on HBase 0.94, and will include
both of the features that Ted mentioned and more.

- Dave

On Thu, Feb 7, 2013 at 8:34 PM, Michael Ellery <[EMAIL PROTECTED]> wrote:

>
> thanks for reminding me of the HBASE version in CDH4 - that's something
> we'll definitely take into consideration.
>
> -Mike
>
> On Feb 7, 2013, at 5:09 PM, Ted Yu wrote:
>
> > Thanks Michael for this information.
> >
> > FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two
> > features I cited below.
> >
> > On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[EMAIL PROTECTED]>
> wrote:
> >
> >> There is only one CF in this schema.
> >>
> >> Yes, we are looking at upgrading to CDH4, but it is not trivial since we
> >> cannot have cluster downtime. Our current upgrade plans involves
> additional
> >> hardware with side-by side clusters until everything is
> exported/imported.
> >>
> >> Thanks,
> >> Mike
> >>
> >> On Feb 7, 2013, at 4:34 PM, Ted Yu wrote:
> >>
> >>> How many column families are involved ?
> >>>
> >>> Have you considered upgrading to 0.94.4 where you would be able to
> >> benefit
> >>> from lazy seek, Data Block Encoding, etc ?
> >>>
> >>> Thanks
> >>>
> >>> On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]>
> >> wrote:
> >>>
> >>>> I'm looking for some advice about per row CQ (column qualifier) count
> >>>> guidelines. Our current schema design means we have a HIGHLY variable
> CQ
> >>>> count per row -- some rows have one or two CQs and some rows have
> >> upwards
> >>>> of 1 million. Each CQ is on the order of 100 bytes (for round numbers)
> >> and
> >>>> the cell values are null.  We see highly variable and too often
> >>>> unacceptable read performance using this schema.  I don't know for a
> >> fact
> >>>> that the CQ count variability is the source of our problems, but I am
> >>>> suspicious.
> >>>>
> >>>> I'm curious about others' experience with CQ counts per row -- are
> there
> >>>> some best practices/guidelines about how to optimally size the number
> of
> >>>> CQs per row. The other obvious solution will involve breaking this
> data
> >>>> into finer grained rows, which means shifting from GETs to SCANs - are
> >>>> there performance trade-offs in such a change?
> >>>>
> >>>> We are currently using CDH3u4, if that is relevant. All of our loading
> >> is
> >>>> done via HFILE loading (bulk), so we have not had to tune write
> >> performance
> >>>> beyond using bulk loads. Any advice appreciated, including what
> metrics
> >> we
> >>>> should be looking at to further diagnose our read performance
> >> challenges.
> >>>>
> >>>> Thanks,
> >>>> Mike Ellery
> >>
> >>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB