|
Michael Ellery
2013-02-07, 23:47
Ted Yu
2013-02-08, 00:34
Michael Ellery
2013-02-08, 01:02
Marcos Ortiz Valmaseda
2013-02-08, 01:08
Ted Yu
2013-02-08, 01:09
Michael Ellery
2013-02-08, 04:34
Marcos Ortiz
2013-02-08, 05:38
Asaf Mesika
2013-02-08, 16:25
Dave Wang
2013-02-08, 16:58
Ted Yu
2013-02-08, 17:50
|
-
column count guidelinesMichael Ellery 2013-02-07, 23:47
I'm looking for some advice about per row CQ (column qualifier) count guidelines. Our current schema design means we have a HIGHLY variable CQ count per row -- some rows have one or two CQs and some rows have upwards of 1 million. Each CQ is on the order of 100 bytes (for round numbers) and the cell values are null. We see highly variable and too often unacceptable read performance using this schema. I don't know for a fact that the CQ count variability is the source of our problems, but I am suspicious.
I'm curious about others' experience with CQ counts per row -- are there some best practices/guidelines about how to optimally size the number of CQs per row. The other obvious solution will involve breaking this data into finer grained rows, which means shifting from GETs to SCANs - are there performance trade-offs in such a change? We are currently using CDH3u4, if that is relevant. All of our loading is done via HFILE loading (bulk), so we have not had to tune write performance beyond using bulk loads. Any advice appreciated, including what metrics we should be looking at to further diagnose our read performance challenges. Thanks, Mike Ellery
-
Re: column count guidelinesTed Yu 2013-02-08, 00:34
How many column families are involved ?
Have you considered upgrading to 0.94.4 where you would be able to benefit from lazy seek, Data Block Encoding, etc ? Thanks On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: > I'm looking for some advice about per row CQ (column qualifier) count > guidelines. Our current schema design means we have a HIGHLY variable CQ > count per row -- some rows have one or two CQs and some rows have upwards > of 1 million. Each CQ is on the order of 100 bytes (for round numbers) and > the cell values are null. We see highly variable and too often > unacceptable read performance using this schema. I don't know for a fact > that the CQ count variability is the source of our problems, but I am > suspicious. > > I'm curious about others' experience with CQ counts per row -- are there > some best practices/guidelines about how to optimally size the number of > CQs per row. The other obvious solution will involve breaking this data > into finer grained rows, which means shifting from GETs to SCANs - are > there performance trade-offs in such a change? > > We are currently using CDH3u4, if that is relevant. All of our loading is > done via HFILE loading (bulk), so we have not had to tune write performance > beyond using bulk loads. Any advice appreciated, including what metrics we > should be looking at to further diagnose our read performance challenges. > > Thanks, > Mike Ellery
-
Re: column count guidelinesMichael Ellery 2013-02-08, 01:02
There is only one CF in this schema.
Yes, we are looking at upgrading to CDH4, but it is not trivial since we cannot have cluster downtime. Our current upgrade plans involves additional hardware with side-by side clusters until everything is exported/imported. Thanks, Mike On Feb 7, 2013, at 4:34 PM, Ted Yu wrote: > How many column families are involved ? > > Have you considered upgrading to 0.94.4 where you would be able to benefit > from lazy seek, Data Block Encoding, etc ? > > Thanks > > On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: > >> I'm looking for some advice about per row CQ (column qualifier) count >> guidelines. Our current schema design means we have a HIGHLY variable CQ >> count per row -- some rows have one or two CQs and some rows have upwards >> of 1 million. Each CQ is on the order of 100 bytes (for round numbers) and >> the cell values are null. We see highly variable and too often >> unacceptable read performance using this schema. I don't know for a fact >> that the CQ count variability is the source of our problems, but I am >> suspicious. >> >> I'm curious about others' experience with CQ counts per row -- are there >> some best practices/guidelines about how to optimally size the number of >> CQs per row. The other obvious solution will involve breaking this data >> into finer grained rows, which means shifting from GETs to SCANs - are >> there performance trade-offs in such a change? >> >> We are currently using CDH3u4, if that is relevant. All of our loading is >> done via HFILE loading (bulk), so we have not had to tune write performance >> beyond using bulk loads. Any advice appreciated, including what metrics we >> should be looking at to further diagnose our read performance challenges. >> >> Thanks, >> Mike Ellery
-
Re: column count guidelinesMarcos Ortiz Valmaseda 2013-02-08, 01:08
I have the same advice that Ted Yu said to you.
You should upgrade to 0.94.4. There are a lot of good things which can be very benefitial for your use-case. ----- Mensaje original ----- De: "Michael Ellery" <[EMAIL PROTECTED]> Para: [EMAIL PROTECTED] Enviados: Jueves, 7 de Febrero 2013 20:02:18 Asunto: Re: column count guidelines There is only one CF in this schema. Yes, we are looking at upgrading to CDH4, but it is not trivial since we cannot have cluster downtime. Our current upgrade plans involves additional hardware with side-by side clusters until everything is exported/imported. Thanks, Mike On Feb 7, 2013, at 4:34 PM, Ted Yu wrote: > How many column families are involved ? > > Have you considered upgrading to 0.94.4 where you would be able to benefit > from lazy seek, Data Block Encoding, etc ? > > Thanks > > On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: > >> I'm looking for some advice about per row CQ (column qualifier) count >> guidelines. Our current schema design means we have a HIGHLY variable CQ >> count per row -- some rows have one or two CQs and some rows have upwards >> of 1 million. Each CQ is on the order of 100 bytes (for round numbers) and >> the cell values are null. We see highly variable and too often >> unacceptable read performance using this schema. I don't know for a fact >> that the CQ count variability is the source of our problems, but I am >> suspicious. >> >> I'm curious about others' experience with CQ counts per row -- are there >> some best practices/guidelines about how to optimally size the number of >> CQs per row. The other obvious solution will involve breaking this data >> into finer grained rows, which means shifting from GETs to SCANs - are >> there performance trade-offs in such a change? >> >> We are currently using CDH3u4, if that is relevant. All of our loading is >> done via HFILE loading (bulk), so we have not had to tune write performance >> beyond using bulk loads. Any advice appreciated, including what metrics we >> should be looking at to further diagnose our read performance challenges. >> >> Thanks, >> Mike Ellery -- Marcos Ortiz Valmaseda, Product Manager && Data Scientist at UCI Blog : http://marcosluis2186.posterous.com LinkedIn: http://www.linkedin.com/in/marcosluis2186 Twitter : @marcosluis2186
-
Re: column count guidelinesTed Yu 2013-02-08, 01:09
Thanks Michael for this information.
FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two features I cited below. On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: > There is only one CF in this schema. > > Yes, we are looking at upgrading to CDH4, but it is not trivial since we > cannot have cluster downtime. Our current upgrade plans involves additional > hardware with side-by side clusters until everything is exported/imported. > > Thanks, > Mike > > On Feb 7, 2013, at 4:34 PM, Ted Yu wrote: > > > How many column families are involved ? > > > > Have you considered upgrading to 0.94.4 where you would be able to > benefit > > from lazy seek, Data Block Encoding, etc ? > > > > Thanks > > > > On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> > wrote: > > > >> I'm looking for some advice about per row CQ (column qualifier) count > >> guidelines. Our current schema design means we have a HIGHLY variable CQ > >> count per row -- some rows have one or two CQs and some rows have > upwards > >> of 1 million. Each CQ is on the order of 100 bytes (for round numbers) > and > >> the cell values are null. We see highly variable and too often > >> unacceptable read performance using this schema. I don't know for a > fact > >> that the CQ count variability is the source of our problems, but I am > >> suspicious. > >> > >> I'm curious about others' experience with CQ counts per row -- are there > >> some best practices/guidelines about how to optimally size the number of > >> CQs per row. The other obvious solution will involve breaking this data > >> into finer grained rows, which means shifting from GETs to SCANs - are > >> there performance trade-offs in such a change? > >> > >> We are currently using CDH3u4, if that is relevant. All of our loading > is > >> done via HFILE loading (bulk), so we have not had to tune write > performance > >> beyond using bulk loads. Any advice appreciated, including what metrics > we > >> should be looking at to further diagnose our read performance > challenges. > >> > >> Thanks, > >> Mike Ellery > >
-
Re: column count guidelinesMichael Ellery 2013-02-08, 04:34
thanks for reminding me of the HBASE version in CDH4 - that's something we'll definitely take into consideration. -Mike On Feb 7, 2013, at 5:09 PM, Ted Yu wrote: > Thanks Michael for this information. > > FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two > features I cited below. > > On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: > >> There is only one CF in this schema. >> >> Yes, we are looking at upgrading to CDH4, but it is not trivial since we >> cannot have cluster downtime. Our current upgrade plans involves additional >> hardware with side-by side clusters until everything is exported/imported. >> >> Thanks, >> Mike >> >> On Feb 7, 2013, at 4:34 PM, Ted Yu wrote: >> >>> How many column families are involved ? >>> >>> Have you considered upgrading to 0.94.4 where you would be able to >> benefit >>> from lazy seek, Data Block Encoding, etc ? >>> >>> Thanks >>> >>> On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> >> wrote: >>> >>>> I'm looking for some advice about per row CQ (column qualifier) count >>>> guidelines. Our current schema design means we have a HIGHLY variable CQ >>>> count per row -- some rows have one or two CQs and some rows have >> upwards >>>> of 1 million. Each CQ is on the order of 100 bytes (for round numbers) >> and >>>> the cell values are null. We see highly variable and too often >>>> unacceptable read performance using this schema. I don't know for a >> fact >>>> that the CQ count variability is the source of our problems, but I am >>>> suspicious. >>>> >>>> I'm curious about others' experience with CQ counts per row -- are there >>>> some best practices/guidelines about how to optimally size the number of >>>> CQs per row. The other obvious solution will involve breaking this data >>>> into finer grained rows, which means shifting from GETs to SCANs - are >>>> there performance trade-offs in such a change? >>>> >>>> We are currently using CDH3u4, if that is relevant. All of our loading >> is >>>> done via HFILE loading (bulk), so we have not had to tune write >> performance >>>> beyond using bulk loads. Any advice appreciated, including what metrics >> we >>>> should be looking at to further diagnose our read performance >> challenges. >>>> >>>> Thanks, >>>> Mike Ellery >> >>
-
Re: column count guidelinesMarcos Ortiz 2013-02-08, 05:38
My recommendation is to keep updated with the last HBase release, and
wait for 0.96, which it has a lot of improvements almost in every area. I talked about this in a blog post.[1] I think in your use-case, Coprocessors can be very helpful, although in Lars's "HBase: The Definitive Guide" book, he explained in Chapter 4 how to use Counters and Coprocessors. You should read it. A great introduction to Coprocessors was posted in HBase's blog, [2] and a great example of HBase performance tuning, including Coprocessors's use, was posted by Hari Kumar from Ericsson Research on its Data and Knowledge blog.[3] Best wishes [1] http://marcosluis2186.posterous.com/some-upcoming-features-in-hbase-096 [2] https://blogs.apache.org/hbase/entry/coprocessor_introduction [3] http://labs.ericsson.com/blog/hbase-performance-tuners On 02/07/2013 11:34 PM, Michael Ellery wrote: > thanks for reminding me of the HBASE version in CDH4 - that's something we'll definitely take into consideration. > > -Mike > > On Feb 7, 2013, at 5:09 PM, Ted Yu wrote: > >> Thanks Michael for this information. >> >> FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two >> features I cited below. >> >> On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: >> >>> There is only one CF in this schema. >>> >>> Yes, we are looking at upgrading to CDH4, but it is not trivial since we >>> cannot have cluster downtime. Our current upgrade plans involves additional >>> hardware with side-by side clusters until everything is exported/imported. >>> >>> Thanks, >>> Mike >>> >>> On Feb 7, 2013, at 4:34 PM, Ted Yu wrote: >>> >>>> How many column families are involved ? >>>> >>>> Have you considered upgrading to 0.94.4 where you would be able to >>> benefit >>>> from lazy seek, Data Block Encoding, etc ? >>>> >>>> Thanks >>>> >>>> On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> >>> wrote: >>>>> I'm looking for some advice about per row CQ (column qualifier) count >>>>> guidelines. Our current schema design means we have a HIGHLY variable CQ >>>>> count per row -- some rows have one or two CQs and some rows have >>> upwards >>>>> of 1 million. Each CQ is on the order of 100 bytes (for round numbers) >>> and >>>>> the cell values are null. We see highly variable and too often >>>>> unacceptable read performance using this schema. I don't know for a >>> fact >>>>> that the CQ count variability is the source of our problems, but I am >>>>> suspicious. >>>>> >>>>> I'm curious about others' experience with CQ counts per row -- are there >>>>> some best practices/guidelines about how to optimally size the number of >>>>> CQs per row. The other obvious solution will involve breaking this data >>>>> into finer grained rows, which means shifting from GETs to SCANs - are >>>>> there performance trade-offs in such a change? >>>>> >>>>> We are currently using CDH3u4, if that is relevant. All of our loading >>> is >>>>> done via HFILE loading (bulk), so we have not had to tune write >>> performance >>>>> beyond using bulk loads. Any advice appreciated, including what metrics >>> we >>>>> should be looking at to further diagnose our read performance >>> challenges. >>>>> Thanks, >>>>> Mike Ellery >>> -- Marcos Ortiz Valmaseda, Product Manager && Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>
-
Re: column count guidelinesAsaf Mesika 2013-02-08, 16:25
Can you elaborate more on that features? I thought 4 was just for bug fixes.
Sent from my iPhone On 8 בפבר 2013, at 02:34, Ted Yu <[EMAIL PROTECTED]> wrote: How many column families are involved ? Have you considered upgrading to 0.94.4 where you would be able to benefit from lazy seek, Data Block Encoding, etc ? Thanks On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: I'm looking for some advice about per row CQ (column qualifier) count guidelines. Our current schema design means we have a HIGHLY variable CQ count per row -- some rows have one or two CQs and some rows have upwards of 1 million. Each CQ is on the order of 100 bytes (for round numbers) and the cell values are null. We see highly variable and too often unacceptable read performance using this schema. I don't know for a fact that the CQ count variability is the source of our problems, but I am suspicious. I'm curious about others' experience with CQ counts per row -- are there some best practices/guidelines about how to optimally size the number of CQs per row. The other obvious solution will involve breaking this data into finer grained rows, which means shifting from GETs to SCANs - are there performance trade-offs in such a change? We are currently using CDH3u4, if that is relevant. All of our loading is done via HFILE loading (bulk), so we have not had to tune write performance beyond using bulk loads. Any advice appreciated, including what metrics we should be looking at to further diagnose our read performance challenges. Thanks, Mike Ellery
-
Re: column count guidelinesDave Wang 2013-02-08, 16:58
Mike,
CDH4.2 will be out shortly, will be based on HBase 0.94, and will include both of the features that Ted mentioned and more. - Dave On Thu, Feb 7, 2013 at 8:34 PM, Michael Ellery <[EMAIL PROTECTED]> wrote: > > thanks for reminding me of the HBASE version in CDH4 - that's something > we'll definitely take into consideration. > > -Mike > > On Feb 7, 2013, at 5:09 PM, Ted Yu wrote: > > > Thanks Michael for this information. > > > > FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two > > features I cited below. > > > > On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[EMAIL PROTECTED]> > wrote: > > > >> There is only one CF in this schema. > >> > >> Yes, we are looking at upgrading to CDH4, but it is not trivial since we > >> cannot have cluster downtime. Our current upgrade plans involves > additional > >> hardware with side-by side clusters until everything is > exported/imported. > >> > >> Thanks, > >> Mike > >> > >> On Feb 7, 2013, at 4:34 PM, Ted Yu wrote: > >> > >>> How many column families are involved ? > >>> > >>> Have you considered upgrading to 0.94.4 where you would be able to > >> benefit > >>> from lazy seek, Data Block Encoding, etc ? > >>> > >>> Thanks > >>> > >>> On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> > >> wrote: > >>> > >>>> I'm looking for some advice about per row CQ (column qualifier) count > >>>> guidelines. Our current schema design means we have a HIGHLY variable > CQ > >>>> count per row -- some rows have one or two CQs and some rows have > >> upwards > >>>> of 1 million. Each CQ is on the order of 100 bytes (for round numbers) > >> and > >>>> the cell values are null. We see highly variable and too often > >>>> unacceptable read performance using this schema. I don't know for a > >> fact > >>>> that the CQ count variability is the source of our problems, but I am > >>>> suspicious. > >>>> > >>>> I'm curious about others' experience with CQ counts per row -- are > there > >>>> some best practices/guidelines about how to optimally size the number > of > >>>> CQs per row. The other obvious solution will involve breaking this > data > >>>> into finer grained rows, which means shifting from GETs to SCANs - are > >>>> there performance trade-offs in such a change? > >>>> > >>>> We are currently using CDH3u4, if that is relevant. All of our loading > >> is > >>>> done via HFILE loading (bulk), so we have not had to tune write > >> performance > >>>> beyond using bulk loads. Any advice appreciated, including what > metrics > >> we > >>>> should be looking at to further diagnose our read performance > >> challenges. > >>>> > >>>> Thanks, > >>>> Mike Ellery > >> > >> > >
-
Re: column count guidelinesTed Yu 2013-02-08, 17:50
The reason I mentioned 0.94.4 was that it is the most recent 0.94 release.
For the features, you can refer to the following JIRAs: HBASE-4465 Lazy-seek optimization for StoreFile scanners HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix compression) Cheers On Fri, Feb 8, 2013 at 8:25 AM, Asaf Mesika <[EMAIL PROTECTED]> wrote: > Can you elaborate more on that features? I thought 4 was just for bug > fixes. > > Sent from my iPhone > > On 8 בפבר 2013, at 02:34, Ted Yu <[EMAIL PROTECTED]> wrote: > > How many column families are involved ? > > Have you considered upgrading to 0.94.4 where you would be able to benefit > from lazy seek, Data Block Encoding, etc ? > > Thanks > > On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[EMAIL PROTECTED]> > wrote: > > I'm looking for some advice about per row CQ (column qualifier) count > > guidelines. Our current schema design means we have a HIGHLY variable CQ > > count per row -- some rows have one or two CQs and some rows have upwards > > of 1 million. Each CQ is on the order of 100 bytes (for round numbers) and > > the cell values are null. We see highly variable and too often > > unacceptable read performance using this schema. I don't know for a fact > > that the CQ count variability is the source of our problems, but I am > > suspicious. > > > I'm curious about others' experience with CQ counts per row -- are there > > some best practices/guidelines about how to optimally size the number of > > CQs per row. The other obvious solution will involve breaking this data > > into finer grained rows, which means shifting from GETs to SCANs - are > > there performance trade-offs in such a change? > > > We are currently using CDH3u4, if that is relevant. All of our loading is > > done via HFILE loading (bulk), so we have not had to tune write performance > > beyond using bulk loads. Any advice appreciated, including what metrics we > > should be looking at to further diagnose our read performance challenges. > > > Thanks, > > Mike Ellery > |