|
Jonathan Hsieh
2012-08-29, 18:30
Ted Yu
2012-08-29, 18:40
Devaraj Das
2012-08-29, 20:06
Jimmy Xiang
2012-08-29, 20:11
Andrew Purtell
2012-08-29, 20:15
Stack
2012-08-29, 20:32
Devaraj Das
2012-08-29, 21:43
Jonathan Hsieh
2012-08-29, 23:12
Ramkrishna.S.Vasudevan
2012-08-30, 04:21
Stack
2012-08-30, 04:25
Ramkrishna.S.Vasudevan
2012-08-30, 04:35
Stack
2012-08-30, 04:56
Devaraj Das
2012-08-30, 06:12
Andrew Purtell
2012-08-30, 06:58
Ramkrishna.S.Vasudevan
2012-08-30, 07:05
N Keywal
2012-08-30, 07:38
Ted Yu
2012-08-30, 17:20
Lars George
2012-08-30, 22:04
Devaraj Das
2012-08-30, 22:36
Stack
2012-08-30, 22:42
Stack
2012-08-31, 22:59
Stack
2012-09-03, 15:40
Ramkrishna.S.Vasudevan
2012-09-05, 04:18
Stack
2012-09-05, 04:36
Stack
2012-09-09, 22:08
Jesse Yates
2012-09-09, 22:11
Stack
2012-09-09, 22:21
Jesse Yates
2012-09-09, 22:25
Stack
2012-09-09, 22:44
Jacques
2012-09-10, 03:03
Stack
2012-09-10, 04:41
Jacques
2012-09-10, 07:03
Ted Yu
2012-09-10, 17:51
Andrew Purtell
2012-09-10, 17:58
Andrew Purtell
2012-09-10, 18:09
Matt Corgan
2012-09-10, 19:13
Jacques
2012-09-10, 20:45
Jacques
2012-09-10, 20:50
lars hofhansl
2012-09-10, 22:46
Jacques
2012-09-10, 23:40
Devaraj Das
2012-09-11, 00:21
Matt Corgan
2012-09-11, 01:20
Jacques
2012-09-11, 04:04
Andrew Purtell
2012-09-11, 04:22
Ramkrishna.S.Vasudevan
2012-09-11, 04:47
Matt Corgan
2012-09-11, 05:59
|
-
HBase Developer's Pow-wow.Jonathan Hsieh 2012-08-29, 18:30
There are a couple discussions brewing and major changes being it would be
good to have a face-to-face pow-wow to demo, to discuss designs, and to talk about project goals and policies. This would be mostly focused on project internals and maybe last half a day. Here are some suggestions for agenda items. * Jimmy on Major Assignment Manager refactor * Enis on integration testing infrastructure * Process change ideas: - Revisit check-in policies for trunk and sustaining branches. - Strategies for keeping Jenkins blue? Have a flaky test list that avoids running flaky tests? File a JIRA automatically on failure? - Holistic reviews of some subsystem's code (there are some convoluted evolved portions of code that could use some intelligent redesign) * Major features like Secondary Indexes - core, coproc, or external? The last few meetups and hackathons like this were at Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera to host either in its new SF "penthouse" office or in its PA office. Thoughts? Jon. -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: HBase Developer's Pow-wow.Ted Yu 2012-08-29, 18:40
I would vote for Cloudera PA office.
Thanks Jon for this initiative. On Wed, Aug 29, 2012 at 11:30 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > There are a couple discussions brewing and major changes being it would be > good to have a face-to-face pow-wow to demo, to discuss designs, and to > talk about project goals and policies. This would be mostly focused on > project internals and maybe last half a day. > > Here are some suggestions for agenda items. > > * Jimmy on Major Assignment Manager refactor > * Enis on integration testing infrastructure > * Process change ideas: > - Revisit check-in policies for trunk and sustaining branches. > - Strategies for keeping Jenkins blue? Have a flaky test list that > avoids running flaky tests? File a JIRA automatically on failure? > - Holistic reviews of some subsystem's code (there are some convoluted > evolved portions of code that could use some intelligent redesign) > * Major features like Secondary Indexes - core, coproc, or external? > > The last few meetups and hackathons like this were at > Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera to > host either in its new SF "penthouse" office or in its PA office. > > Thoughts? > Jon. > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED] >
-
Re: HBase Developer's Pow-wow.Devaraj Das 2012-08-29, 20:06
We could look at hosting here at Hortonworks, Sunnyvale. Thoughts?
On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > There are a couple discussions brewing and major changes being it would be > good to have a face-to-face pow-wow to demo, to discuss designs, and to > talk about project goals and policies. This would be mostly focused on > project internals and maybe last half a day. > > Here are some suggestions for agenda items. > > * Jimmy on Major Assignment Manager refactor > * Enis on integration testing infrastructure > * Process change ideas: > - Revisit check-in policies for trunk and sustaining branches. > - Strategies for keeping Jenkins blue? Have a flaky test list that > avoids running flaky tests? File a JIRA automatically on failure? > - Holistic reviews of some subsystem's code (there are some convoluted > evolved portions of code that could use some intelligent redesign) > * Major features like Secondary Indexes - core, coproc, or external? > > The last few meetups and hackathons like this were at > Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera to > host either in its new SF "penthouse" office or in its PA office. > > Thoughts? > Jon. > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED]
-
Re: HBase Developer's Pow-wow.Jimmy Xiang 2012-08-29, 20:11
+1
On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> wrote: > We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > > On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > >> There are a couple discussions brewing and major changes being it would be >> good to have a face-to-face pow-wow to demo, to discuss designs, and to >> talk about project goals and policies. This would be mostly focused on >> project internals and maybe last half a day. >> >> Here are some suggestions for agenda items. >> >> * Jimmy on Major Assignment Manager refactor >> * Enis on integration testing infrastructure >> * Process change ideas: >> - Revisit check-in policies for trunk and sustaining branches. >> - Strategies for keeping Jenkins blue? Have a flaky test list that >> avoids running flaky tests? File a JIRA automatically on failure? >> - Holistic reviews of some subsystem's code (there are some convoluted >> evolved portions of code that could use some intelligent redesign) >> * Major features like Secondary Indexes - core, coproc, or external? >> >> The last few meetups and hackathons like this were at >> Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera to >> host either in its new SF "penthouse" office or in its PA office. >> >> Thoughts? >> Jon. >> >> -- >> // Jonathan Hsieh (shay) >> // Software Engineer, Cloudera >> // [EMAIL PROTECTED]
-
Re: HBase Developer's Pow-wow.Andrew Purtell 2012-08-29, 20:15
+1
I can be up week of the 10th if that's convenient. - Andy On Wednesday, August 29, 2012, Jimmy Xiang wrote: > +1 > > On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]<javascript:;>> > wrote: > > We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > > > > On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]<javascript:;>> > wrote: > > > >> There are a couple discussions brewing and major changes being it would > be > >> good to have a face-to-face pow-wow to demo, to discuss designs, and to > >> talk about project goals and policies. This would be mostly focused on > >> project internals and maybe last half a day. > >> > >> Here are some suggestions for agenda items. > >> > >> * Jimmy on Major Assignment Manager refactor > >> * Enis on integration testing infrastructure > >> * Process change ideas: > >> - Revisit check-in policies for trunk and sustaining branches. > >> - Strategies for keeping Jenkins blue? Have a flaky test list that > >> avoids running flaky tests? File a JIRA automatically on failure? > >> - Holistic reviews of some subsystem's code (there are some convoluted > >> evolved portions of code that could use some intelligent redesign) > >> * Major features like Secondary Indexes - core, coproc, or external? > >> > >> The last few meetups and hackathons like this were at > >> Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera > to > >> host either in its new SF "penthouse" office or in its PA office. > >> > >> Thoughts? > >> Jon. > >> > >> -- > >> // Jonathan Hsieh (shay) > >> // Software Engineer, Cloudera > >> // [EMAIL PROTECTED] <javascript:;> > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
Re: HBase Developer's Pow-wow.Stack 2012-08-29, 20:32
On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > > On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > >> There are a couple discussions brewing and major changes being it would be >> good to have a face-to-face pow-wow to demo, to discuss designs, and to >> talk about project goals and policies. This would be mostly focused on >> project internals and maybe last half a day. >> >> Here are some suggestions for agenda items. >> >> * Jimmy on Major Assignment Manager refactor >> * Enis on integration testing infrastructure >> * Process change ideas: >> - Revisit check-in policies for trunk and sustaining branches. >> - Strategies for keeping Jenkins blue? Have a flaky test list that >> avoids running flaky tests? File a JIRA automatically on failure? >> - Holistic reviews of some subsystem's code (there are some convoluted >> evolved portions of code that could use some intelligent redesign) >> * Major features like Secondary Indexes - core, coproc, or external? >> >> The last few meetups and hackathons like this were at >> Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera to >> host either in its new SF "penthouse" office or in its PA office. >> Lets do it. I meant to say that Deveraj suggested a while back that HW could host next one. Seems fine by me (as long as the beer is as good as it was at SF Deveraj!). Weeks of 10th so Andrew is included sounds good to me too. St.Ack
-
Re: HBase Developer's Pow-wow.Devaraj Das 2012-08-29, 21:43
Cool! I'll make sure the beer is good :-)
On Aug 29, 2012, at 1:32 PM, Stack wrote: > On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> wrote: >> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? >> >> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: >> >>> There are a couple discussions brewing and major changes being it would be >>> good to have a face-to-face pow-wow to demo, to discuss designs, and to >>> talk about project goals and policies. This would be mostly focused on >>> project internals and maybe last half a day. >>> >>> Here are some suggestions for agenda items. >>> >>> * Jimmy on Major Assignment Manager refactor >>> * Enis on integration testing infrastructure >>> * Process change ideas: >>> - Revisit check-in policies for trunk and sustaining branches. >>> - Strategies for keeping Jenkins blue? Have a flaky test list that >>> avoids running flaky tests? File a JIRA automatically on failure? >>> - Holistic reviews of some subsystem's code (there are some convoluted >>> evolved portions of code that could use some intelligent redesign) >>> * Major features like Secondary Indexes - core, coproc, or external? >>> >>> The last few meetups and hackathons like this were at >>> Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera to >>> host either in its new SF "penthouse" office or in its PA office. >>> > > Lets do it. I meant to say that Deveraj suggested a while back that > HW could host next one. Seems fine by me (as long as the beer is as > good as it was at SF Deveraj!). > > Weeks of 10th so Andrew is included sounds good to me too. > St.Ack
-
Re: HBase Developer's Pow-wow.Jonathan Hsieh 2012-08-29, 23:12
Great!
If we are going down to Sunnyvale, I'd prefer a Tuesday. DD, could we schedule for a Tuesday 9/11 afternoon? Jon. On Wed, Aug 29, 2012 at 2:43 PM, Devaraj Das <[EMAIL PROTECTED]> wrote: > Cool! I'll make sure the beer is good :-) > > On Aug 29, 2012, at 1:32 PM, Stack wrote: > > > On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> > wrote: > >> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > >> > >> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > >> > >>> There are a couple discussions brewing and major changes being it > would be > >>> good to have a face-to-face pow-wow to demo, to discuss designs, and to > >>> talk about project goals and policies. This would be mostly focused on > >>> project internals and maybe last half a day. > >>> > >>> Here are some suggestions for agenda items. > >>> > >>> * Jimmy on Major Assignment Manager refactor > >>> * Enis on integration testing infrastructure > >>> * Process change ideas: > >>> - Revisit check-in policies for trunk and sustaining branches. > >>> - Strategies for keeping Jenkins blue? Have a flaky test list that > >>> avoids running flaky tests? File a JIRA automatically on failure? > >>> - Holistic reviews of some subsystem's code (there are some convoluted > >>> evolved portions of code that could use some intelligent redesign) > >>> * Major features like Secondary Indexes - core, coproc, or external? > >>> > >>> The last few meetups and hackathons like this were at > >>> Salesforce, eBay, Stumble and Cloudera. I can look into having > Cloudera to > >>> host either in its new SF "penthouse" office or in its PA office. > >>> > > > > Lets do it. I meant to say that Deveraj suggested a while back that > > HW could host next one. Seems fine by me (as long as the beer is as > > good as it was at SF Deveraj!). > > > > Weeks of 10th so Andrew is included sounds good to me too. > > St.Ack > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
RE: HBase Developer's Pow-wow.Ramkrishna.S.Vasudevan 2012-08-30, 04:21
I would be interested on this may be for the Secondary index related
discussion. Can I attend it over phone? Or someother way ? Regards Ram > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of > Stack > Sent: Thursday, August 30, 2012 2:03 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase Developer's Pow-wow. > > On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> > wrote: > > We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > > > > On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > > > >> There are a couple discussions brewing and major changes being it > would be > >> good to have a face-to-face pow-wow to demo, to discuss designs, and > to > >> talk about project goals and policies. This would be mostly focused > on > >> project internals and maybe last half a day. > >> > >> Here are some suggestions for agenda items. > >> > >> * Jimmy on Major Assignment Manager refactor > >> * Enis on integration testing infrastructure > >> * Process change ideas: > >> - Revisit check-in policies for trunk and sustaining branches. > >> - Strategies for keeping Jenkins blue? Have a flaky test list that > >> avoids running flaky tests? File a JIRA automatically on failure? > >> - Holistic reviews of some subsystem's code (there are some > convoluted > >> evolved portions of code that could use some intelligent redesign) > >> * Major features like Secondary Indexes - core, coproc, or external? > >> > >> The last few meetups and hackathons like this were at > >> Salesforce, eBay, Stumble and Cloudera. I can look into having > Cloudera to > >> host either in its new SF "penthouse" office or in its PA office. > >> > > Lets do it. I meant to say that Deveraj suggested a while back that > HW could host next one. Seems fine by me (as long as the beer is as > good as it was at SF Deveraj!). > > Weeks of 10th so Andrew is included sounds good to me too. > St.Ack
-
Re: HBase Developer's Pow-wow.Stack 2012-08-30, 04:25
On Wed, Aug 29, 2012 at 9:21 PM, Ramkrishna.S.Vasudevan
<[EMAIL PROTECTED]> wrote: > I would be interested on this may be for the Secondary index related > discussion. > > Can I attend it over phone? Or someother way ? > We would love to have you Ram. 5PM our time is 5AM your time? Thats kinda early Ram. St.Ack
-
RE: HBase Developer's Pow-wow.Ramkrishna.S.Vasudevan 2012-08-30, 04:35
It should be ok, atleast for one day :).
Regards Ram > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of > Stack > Sent: Thursday, August 30, 2012 9:56 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase Developer's Pow-wow. > > On Wed, Aug 29, 2012 at 9:21 PM, Ramkrishna.S.Vasudevan > <[EMAIL PROTECTED]> wrote: > > I would be interested on this may be for the Secondary index related > > discussion. > > > > Can I attend it over phone? Or someother way ? > > > > We would love to have you Ram. 5PM our time is 5AM your time? Thats > kinda early Ram. > St.Ack
-
Re: HBase Developer's Pow-wow.Stack 2012-08-30, 04:56
On Wed, Aug 29, 2012 at 9:35 PM, Ramkrishna.S.Vasudevan
<[EMAIL PROTECTED]> wrote: > It should be ok, atleast for one day :). > Yeah, for the next one, we promise that we'll all get up at 4AM so you can dial in at a reasonable 5PM (smile). St.Ack
-
Re: HBase Developer's Pow-wow.Devaraj Das 2012-08-30, 06:12
Sure, Jon (assuming it works for everyone else). I will start the ball
rolling on the logistics. Sent from my iPhone On Aug 29, 2012, at 4:13 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > Great! > > If we are going down to Sunnyvale, I'd prefer a Tuesday. DD, could we > schedule for a Tuesday 9/11 afternoon? > > Jon. > > On Wed, Aug 29, 2012 at 2:43 PM, Devaraj Das <[EMAIL PROTECTED]> wrote: > >> Cool! I'll make sure the beer is good :-) >> >> On Aug 29, 2012, at 1:32 PM, Stack wrote: >> >>> On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> >> wrote: >>>> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? >>>> >>>> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: >>>> >>>>> There are a couple discussions brewing and major changes being it >> would be >>>>> good to have a face-to-face pow-wow to demo, to discuss designs, and to >>>>> talk about project goals and policies. This would be mostly focused on >>>>> project internals and maybe last half a day. >>>>> >>>>> Here are some suggestions for agenda items. >>>>> >>>>> * Jimmy on Major Assignment Manager refactor >>>>> * Enis on integration testing infrastructure >>>>> * Process change ideas: >>>>> - Revisit check-in policies for trunk and sustaining branches. >>>>> - Strategies for keeping Jenkins blue? Have a flaky test list that >>>>> avoids running flaky tests? File a JIRA automatically on failure? >>>>> - Holistic reviews of some subsystem's code (there are some convoluted >>>>> evolved portions of code that could use some intelligent redesign) >>>>> * Major features like Secondary Indexes - core, coproc, or external? >>>>> >>>>> The last few meetups and hackathons like this were at >>>>> Salesforce, eBay, Stumble and Cloudera. I can look into having >> Cloudera to >>>>> host either in its new SF "penthouse" office or in its PA office. >>>>> >>> >>> Lets do it. I meant to say that Deveraj suggested a while back that >>> HW could host next one. Seems fine by me (as long as the beer is as >>> good as it was at SF Deveraj!). >>> >>> Weeks of 10th so Andrew is included sounds good to me too. >>> St.Ack >> >> > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED]
-
Re: HBase Developer's Pow-wow.Andrew Purtell 2012-08-30, 06:58
Thanks for being accommodating, guys. See you on the 11th. I'll have to
bring some elephant swag for the HW office. - Andy On Thu, Aug 30, 2012 at 9:12 AM, Devaraj Das <[EMAIL PROTECTED]> wrote: > Sure, Jon (assuming it works for everyone else). I will start the ball > rolling on the logistics. > > Sent from my iPhone > > On Aug 29, 2012, at 4:13 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > Great! > > > > If we are going down to Sunnyvale, I'd prefer a Tuesday. DD, could we > > schedule for a Tuesday 9/11 afternoon? > > > > Jon. > > > > On Wed, Aug 29, 2012 at 2:43 PM, Devaraj Das <[EMAIL PROTECTED]> > wrote: > > > >> Cool! I'll make sure the beer is good :-) > >> > >> On Aug 29, 2012, at 1:32 PM, Stack wrote: > >> > >>> On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> > >> wrote: > >>>> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > >>>> > >>>> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > >>>> > >>>>> There are a couple discussions brewing and major changes being it > >> would be > >>>>> good to have a face-to-face pow-wow to demo, to discuss designs, and > to > >>>>> talk about project goals and policies. This would be mostly focused > on > >>>>> project internals and maybe last half a day. > >>>>> > >>>>> Here are some suggestions for agenda items. > >>>>> > >>>>> * Jimmy on Major Assignment Manager refactor > >>>>> * Enis on integration testing infrastructure > >>>>> * Process change ideas: > >>>>> - Revisit check-in policies for trunk and sustaining branches. > >>>>> - Strategies for keeping Jenkins blue? Have a flaky test list that > >>>>> avoids running flaky tests? File a JIRA automatically on failure? > >>>>> - Holistic reviews of some subsystem's code (there are some > convoluted > >>>>> evolved portions of code that could use some intelligent redesign) > >>>>> * Major features like Secondary Indexes - core, coproc, or external? > >>>>> > >>>>> The last few meetups and hackathons like this were at > >>>>> Salesforce, eBay, Stumble and Cloudera. I can look into having > >> Cloudera to > >>>>> host either in its new SF "penthouse" office or in its PA office. > >>>>> > >>> > >>> Lets do it. I meant to say that Deveraj suggested a while back that > >>> HW could host next one. Seems fine by me (as long as the beer is as > >>> good as it was at SF Deveraj!). > >>> > >>> Weeks of 10th so Andrew is included sounds good to me too. > >>> St.Ack > >> > >> > > > > > > -- > > // Jonathan Hsieh (shay) > > // Software Engineer, Cloudera > > // [EMAIL PROTECTED] >
-
RE: HBase Developer's Pow-wow.Ramkrishna.S.Vasudevan 2012-08-30, 07:05
Hi
Latest improvement, Tuesday should be best for me. +1 for Tuesday. Regards Ram > -----Original Message----- > From: Devaraj Das [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 30, 2012 11:42 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase Developer's Pow-wow. > > Sure, Jon (assuming it works for everyone else). I will start the ball > rolling on the logistics. > > Sent from my iPhone > > On Aug 29, 2012, at 4:13 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > Great! > > > > If we are going down to Sunnyvale, I'd prefer a Tuesday. DD, could we > > schedule for a Tuesday 9/11 afternoon? > > > > Jon. > > > > On Wed, Aug 29, 2012 at 2:43 PM, Devaraj Das <[EMAIL PROTECTED]> > wrote: > > > >> Cool! I'll make sure the beer is good :-) > >> > >> On Aug 29, 2012, at 1:32 PM, Stack wrote: > >> > >>> On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> > >> wrote: > >>>> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > >>>> > >>>> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > >>>> > >>>>> There are a couple discussions brewing and major changes being it > >> would be > >>>>> good to have a face-to-face pow-wow to demo, to discuss designs, > and to > >>>>> talk about project goals and policies. This would be mostly > focused on > >>>>> project internals and maybe last half a day. > >>>>> > >>>>> Here are some suggestions for agenda items. > >>>>> > >>>>> * Jimmy on Major Assignment Manager refactor > >>>>> * Enis on integration testing infrastructure > >>>>> * Process change ideas: > >>>>> - Revisit check-in policies for trunk and sustaining branches. > >>>>> - Strategies for keeping Jenkins blue? Have a flaky test list > that > >>>>> avoids running flaky tests? File a JIRA automatically on > failure? > >>>>> - Holistic reviews of some subsystem's code (there are some > convoluted > >>>>> evolved portions of code that could use some intelligent > redesign) > >>>>> * Major features like Secondary Indexes - core, coproc, or > external? > >>>>> > >>>>> The last few meetups and hackathons like this were at > >>>>> Salesforce, eBay, Stumble and Cloudera. I can look into having > >> Cloudera to > >>>>> host either in its new SF "penthouse" office or in its PA office. > >>>>> > >>> > >>> Lets do it. I meant to say that Deveraj suggested a while back > that > >>> HW could host next one. Seems fine by me (as long as the beer is > as > >>> good as it was at SF Deveraj!). > >>> > >>> Weeks of 10th so Andrew is included sounds good to me too. > >>> St.Ack > >> > >> > > > > > > -- > > // Jonathan Hsieh (shay) > > // Software Engineer, Cloudera > > // [EMAIL PROTECTED]
-
Re: HBase Developer's Pow-wow.N Keywal 2012-08-30, 07:38
Hi,
I won't attend it, but would it be possible to share the agenda on an editable support (google doc or alike)? I could then add some comments on the points you will be working on... Thanks in advance, Nicolas On Thu, Aug 30, 2012 at 9:05 AM, Ramkrishna.S.Vasudevan < [EMAIL PROTECTED]> wrote: > Hi > > Latest improvement, Tuesday should be best for me. +1 for Tuesday. > > Regards > Ram > > > -----Original Message----- > > From: Devaraj Das [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, August 30, 2012 11:42 AM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase Developer's Pow-wow. > > > > Sure, Jon (assuming it works for everyone else). I will start the ball > > rolling on the logistics. > > > > Sent from my iPhone > > > > On Aug 29, 2012, at 4:13 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > > > Great! > > > > > > If we are going down to Sunnyvale, I'd prefer a Tuesday. DD, could we > > > schedule for a Tuesday 9/11 afternoon? > > > > > > Jon. > > > > > > On Wed, Aug 29, 2012 at 2:43 PM, Devaraj Das <[EMAIL PROTECTED]> > > wrote: > > > > > >> Cool! I'll make sure the beer is good :-) > > >> > > >> On Aug 29, 2012, at 1:32 PM, Stack wrote: > > >> > > >>> On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> > > >> wrote: > > >>>> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > > >>>> > > >>>> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> > > wrote: > > >>>> > > >>>>> There are a couple discussions brewing and major changes being it > > >> would be > > >>>>> good to have a face-to-face pow-wow to demo, to discuss designs, > > and to > > >>>>> talk about project goals and policies. This would be mostly > > focused on > > >>>>> project internals and maybe last half a day. > > >>>>> > > >>>>> Here are some suggestions for agenda items. > > >>>>> > > >>>>> * Jimmy on Major Assignment Manager refactor > > >>>>> * Enis on integration testing infrastructure > > >>>>> * Process change ideas: > > >>>>> - Revisit check-in policies for trunk and sustaining branches. > > >>>>> - Strategies for keeping Jenkins blue? Have a flaky test list > > that > > >>>>> avoids running flaky tests? File a JIRA automatically on > > failure? > > >>>>> - Holistic reviews of some subsystem's code (there are some > > convoluted > > >>>>> evolved portions of code that could use some intelligent > > redesign) > > >>>>> * Major features like Secondary Indexes - core, coproc, or > > external? > > >>>>> > > >>>>> The last few meetups and hackathons like this were at > > >>>>> Salesforce, eBay, Stumble and Cloudera. I can look into having > > >> Cloudera to > > >>>>> host either in its new SF "penthouse" office or in its PA office. > > >>>>> > > >>> > > >>> Lets do it. I meant to say that Deveraj suggested a while back > > that > > >>> HW could host next one. Seems fine by me (as long as the beer is > > as > > >>> good as it was at SF Deveraj!). > > >>> > > >>> Weeks of 10th so Andrew is included sounds good to me too. > > >>> St.Ack > > >> > > >> > > > > > > > > > -- > > > // Jonathan Hsieh (shay) > > > // Software Engineer, Cloudera > > > // [EMAIL PROTECTED] > >
-
Re: HBase Developer's Pow-wow.Ted Yu 2012-08-30, 17:20
For 'Strategies for keeping Jenkins blue', we can take this opportunity to
fix a few known flaky tests. Cheers On Wed, Aug 29, 2012 at 11:30 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > There are a couple discussions brewing and major changes being it would be > good to have a face-to-face pow-wow to demo, to discuss designs, and to > talk about project goals and policies. This would be mostly focused on > project internals and maybe last half a day. > > Here are some suggestions for agenda items. > > * Jimmy on Major Assignment Manager refactor > * Enis on integration testing infrastructure > * Process change ideas: > - Revisit check-in policies for trunk and sustaining branches. > - Strategies for keeping Jenkins blue? Have a flaky test list that > avoids running flaky tests? File a JIRA automatically on failure? > - Holistic reviews of some subsystem's code (there are some convoluted > evolved portions of code that could use some intelligent redesign) > * Major features like Secondary Indexes - core, coproc, or external? > > The last few meetups and hackathons like this were at > Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera to > host either in its new SF "penthouse" office or in its PA office. > > Thoughts? > Jon. > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED] >
-
Re: HBase Developer's Pow-wow.Lars George 2012-08-30, 22:04
Bummer, I will be in PA the week after, i.e. 9/17. It would have been great to see you all again.
Lars On Aug 29, 2012, at 10:15 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > +1 > > I can be up week of the 10th if that's convenient. > > - Andy > > On Wednesday, August 29, 2012, Jimmy Xiang wrote: > >> +1 >> >> On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]<javascript:;>> >> wrote: >>> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? >>> >>> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]<javascript:;>> >> wrote: >>> >>>> There are a couple discussions brewing and major changes being it would >> be >>>> good to have a face-to-face pow-wow to demo, to discuss designs, and to >>>> talk about project goals and policies. This would be mostly focused on >>>> project internals and maybe last half a day. >>>> >>>> Here are some suggestions for agenda items. >>>> >>>> * Jimmy on Major Assignment Manager refactor >>>> * Enis on integration testing infrastructure >>>> * Process change ideas: >>>> - Revisit check-in policies for trunk and sustaining branches. >>>> - Strategies for keeping Jenkins blue? Have a flaky test list that >>>> avoids running flaky tests? File a JIRA automatically on failure? >>>> - Holistic reviews of some subsystem's code (there are some convoluted >>>> evolved portions of code that could use some intelligent redesign) >>>> * Major features like Secondary Indexes - core, coproc, or external? >>>> >>>> The last few meetups and hackathons like this were at >>>> Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera >> to >>>> host either in its new SF "penthouse" office or in its PA office. >>>> >>>> Thoughts? >>>> Jon. >>>> >>>> -- >>>> // Jonathan Hsieh (shay) >>>> // Software Engineer, Cloudera >>>> // [EMAIL PROTECTED] <javascript:;> >> > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White)
-
Re: HBase Developer's Pow-wow.Devaraj Das 2012-08-30, 22:36
Should we move it to that week to accommodate Lars?
On Aug 30, 2012, at 3:04 PM, Lars George wrote: > Bummer, I will be in PA the week after, i.e. 9/17. It would have been great to see you all again. > > Lars > > On Aug 29, 2012, at 10:15 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > >> +1 >> >> I can be up week of the 10th if that's convenient. >> >> - Andy >> >> On Wednesday, August 29, 2012, Jimmy Xiang wrote: >> >>> +1 >>> >>> On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]<javascript:;>> >>> wrote: >>>> We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? >>>> >>>> On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]<javascript:;>> >>> wrote: >>>> >>>>> There are a couple discussions brewing and major changes being it would >>> be >>>>> good to have a face-to-face pow-wow to demo, to discuss designs, and to >>>>> talk about project goals and policies. This would be mostly focused on >>>>> project internals and maybe last half a day. >>>>> >>>>> Here are some suggestions for agenda items. >>>>> >>>>> * Jimmy on Major Assignment Manager refactor >>>>> * Enis on integration testing infrastructure >>>>> * Process change ideas: >>>>> - Revisit check-in policies for trunk and sustaining branches. >>>>> - Strategies for keeping Jenkins blue? Have a flaky test list that >>>>> avoids running flaky tests? File a JIRA automatically on failure? >>>>> - Holistic reviews of some subsystem's code (there are some convoluted >>>>> evolved portions of code that could use some intelligent redesign) >>>>> * Major features like Secondary Indexes - core, coproc, or external? >>>>> >>>>> The last few meetups and hackathons like this were at >>>>> Salesforce, eBay, Stumble and Cloudera. I can look into having Cloudera >>> to >>>>> host either in its new SF "penthouse" office or in its PA office. >>>>> >>>>> Thoughts? >>>>> Jon. >>>>> >>>>> -- >>>>> // Jonathan Hsieh (shay) >>>>> // Software Engineer, Cloudera >>>>> // [EMAIL PROTECTED] <javascript:;> >>> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >
-
Re: HBase Developer's Pow-wow.Stack 2012-08-30, 22:42
On Thu, Aug 30, 2012 at 3:36 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
> Should we move it to that week to accommodate Lars? > We could. We had it set for the week of 10th so Andrew could come. Andrew could you come the following week? St.Ack
-
Re: HBase Developer's Pow-wow.Stack 2012-08-31, 22:59
On Thu, Aug 30, 2012 at 3:42 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Thu, Aug 30, 2012 at 3:36 PM, Devaraj Das <[EMAIL PROTECTED]> wrote: >> Should we move it to that week to accommodate Lars? >> > > We could. We had it set for the week of 10th so Andrew could come. > Andrew could you come the following week? An off-list exchange has it that Andrew can't make the following week so I'd say, because LarsG showed up on the thread later, lets stick w/ the original proposal of 9/11. What time would suit? 6pm? Max 20? 30? I'll put a post up on meetup.com for bay area hbase. St.Ack
-
Re: HBase Developer's Pow-wow.Stack 2012-09-03, 15:40
On Fri, Aug 31, 2012 at 3:59 PM, Stack <[EMAIL PROTECTED]> wrote:
> I'll put a post up on meetup.com for bay area hbase. I put the meetup up here: http://www.meetup.com/hbaseusergroup/events/80621872/ (2pm at HWX). Let me know if any of the details are off (Thanks to Jon for the bulk of the text). St.Ack
-
RE: HBase Developer's Pow-wow.Ramkrishna.S.Vasudevan 2012-09-05, 04:18
Stack, I may not be able to join seeing the time 2pm which is 2AM over here.
Anyway I can share my thoughts after the discussions are drafted in a writeup. Regards Ram > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of > Stack > Sent: Monday, September 03, 2012 9:11 PM > To: [EMAIL PROTECTED] > Subject: Re: HBase Developer's Pow-wow. > > On Fri, Aug 31, 2012 at 3:59 PM, Stack <[EMAIL PROTECTED]> wrote: > > I'll put a post up on meetup.com for bay area hbase. > > I put the meetup up here: > http://www.meetup.com/hbaseusergroup/events/80621872/ (2pm at HWX). > Let me know if any of the details are off (Thanks to Jon for the bulk > of the text). > St.Ack
-
Re: HBase Developer's Pow-wow.Stack 2012-09-05, 04:36
On Tue, Sep 4, 2012 at 9:18 PM, Ramkrishna.S.Vasudevan
<[EMAIL PROTECTED]> wrote: > Stack, I may not be able to join seeing the time 2pm which is 2AM over here. > Anyway I can share my thoughts after the discussions are drafted in a > writeup. > Understood (Pardon our insensitivity arriving at a 2AM, for you, start time Ram). St.Ack
-
Re: HBase Developer's Pow-wow.Stack 2012-09-09, 22:08
On Mon, Sep 3, 2012 at 8:40 AM, Stack <[EMAIL PROTECTED]> wrote:
> On Fri, Aug 31, 2012 at 3:59 PM, Stack <[EMAIL PROTECTED]> wrote: >> I'll put a post up on meetup.com for bay area hbase. > > I put the meetup up here: > http://www.meetup.com/hbaseusergroup/events/80621872/ (2pm at HWX). > Let me know if any of the details are off (Thanks to Jon for the bulk > of the text). Regards Tuesdays' meetup: + We have our Jimmy Xiang to do an overview on recent AssignmentManager changes and discussion of what we should do in AM-land over the near future + Mighty Enis will talk up his fat Integration Tests addition + ChaosMonkey messer that is about to be committed and how we can now check in a new class of tests. We are missing fellas to lead a chat on process change ideas (How to have it so Jenkins is more blue than red; How do we enforce more rigor around what gets committer, etc.). Anyone want to volunteer? I'd volunteer LarsH since he was last to float these eternally recurring notions but I believe he will be up on Half Dome looking down on us when the meeting goes off. Anyone else want to lead the discussion (Jon? Andrew?)? Anyone want to lead a discussion on whats next? Post 0.96? Anything else that folks want to talk about? (I'll post above on the meetup too). St.Ack
-
Re: HBase Developer's Pow-wow.Jesse Yates 2012-09-09, 22:11
>
> We are missing fellas to lead a chat on process change ideas (How to > have it so Jenkins is more blue than red; How do we enforce more rigor > around what gets committer, etc.). Anyone want to volunteer? I'd > volunteer LarsH since he was last to float these eternally recurring > notions but I believe he will be up on Half Dome looking down on us > when the meeting goes off. Anyone else want to lead the discussion > (Jon? Andrew?)? > I thought Lars would be be back by the meetup, but lets get a second talker on it too :) Anyone want to lead a discussion on whats next? Post 0.96? > > Anything else that folks want to talk about? > I think we talked about wanting to do secondary indexing as well, as least what that means for HBase (and maybe some of the _how_ it would work too). -Jesse ------------------- Jesse Yates @jesse_yates jyates.github.com
-
Re: HBase Developer's Pow-wow.Stack 2012-09-09, 22:21
On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <[EMAIL PROTECTED]> wrote:
> I think we talked about wanting to do secondary indexing as well, as least > what that means for HBase (and maybe some of the _how_ it would work too). > Mind leading it Jesse? You have the necessary qualifications (smile). Would suggest you make include rehearsal of points made by Andrew Purtell and LarsH in the most recent thread on 2ndary indexes. (Hopefully LarsH is back by Tuesday. Unless someone else volunteers meantime, lets volunteer him to lead the process section). St.Ack
-
Re: HBase Developer's Pow-wow.Jesse Yates 2012-09-09, 22:25
On Sun, Sep 9, 2012 at 3:21 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <[EMAIL PROTECTED]> > wrote: > > I think we talked about wanting to do secondary indexing as well, as > least > > what that means for HBase (and maybe some of the _how_ it would work > too). > > > > Mind leading it Jesse? You have the necessary qualifications (smile). > Would suggest you make include rehearsal of points made by Andrew > Purtell and LarsH in the most recent thread on 2ndary indexes. > > ....ok, I can do that :) ------------------- Jesse Yates @jesse_yates jyates.github.com > (Hopefully LarsH is back by Tuesday. Unless someone else volunteers > meantime, lets volunteer him to lead the process section). > St.Ack >
-
Re: HBase Developer's Pow-wow.Stack 2012-09-09, 22:44
On Sun, Sep 9, 2012 at 3:25 PM, Jesse Yates <[EMAIL PROTECTED]> wrote:
> On Sun, Sep 9, 2012 at 3:21 PM, Stack <[EMAIL PROTECTED]> wrote: > >> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <[EMAIL PROTECTED]> >> wrote: >> > I think we talked about wanting to do secondary indexing as well, as >> least >> > what that means for HBase (and maybe some of the _how_ it would work >> too). >> > >> >> Mind leading it Jesse? You have the necessary qualifications (smile). >> Would suggest you make include rehearsal of points made by Andrew >> Purtell and LarsH in the most recent thread on 2ndary indexes. >> >> > ....ok, I can do that :) Adding you to the list... Thanks J, St.Ack
-
Re: HBase Developer's Pow-wow.Jacques 2012-09-10, 03:03
Some random thoughts/questions bubbling around in my mind regarding
secondary indexes/indices. What are the top 5 use cases people are trying to solve? What solves more of these needs: synchronous 'transactional' or asynchronous best-effort (or delayed durable) index commit? Does family level indexing make sense or is the real need for qualifier level indexing? What are ideas for a client interface and how transparent is index usage? (E.g. if you set a filter on a qualifier... ) How important is supporting multiple simultaneous criteria or would 90% of uses cases be captured with single criteria support? How important is value multi-parsing (e.g. a single value can be indexed to multiple index values: e.g. free text indexing)? What were the challenges and issues with the proof of concept TrendMicro approach that ultimately made it untenable? (was an eventually consistent approach) What are people's thoughts regarding region-level alternative structure, secondary table structure, etc? Is it important to colocate/duplicate indexed values and/or additional portions of data in secondary indices to minimize disk seeks (almost making HBase optionally more columnar in nature)? How important are multi-qualifier indexes? (e.g. when you want to do a query for all users who are male engineers that have kids) How important is partial index matching/ range matching (e.g. startswith and/or between)? How important is ordering of returned values? (e.g. if you support startswith or range matching and you do indexing at the region-level, you'll be able to get back two rows with the same value the are interspersed with rows of different values) These were partially in response to: http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing http://apache-hbase.679495.n3.nabble.com/what-s-the-roadmap-of-secondary-index-of-hbase-td2573618.html https://issues.apache.org/jira/browse/HBASE-3529 https://issues.apache.org/jira/browse/HBASE-2038 https://issues.apache.org/jira/browse/HBASE-3340 https://github.com/jyates/culvert On Sun, Sep 9, 2012 at 3:44 PM, Stack <[EMAIL PROTECTED]> wrote: > On Sun, Sep 9, 2012 at 3:25 PM, Jesse Yates <[EMAIL PROTECTED]> > wrote: > > On Sun, Sep 9, 2012 at 3:21 PM, Stack <[EMAIL PROTECTED]> wrote: > > > >> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <[EMAIL PROTECTED]> > >> wrote: > >> > I think we talked about wanting to do secondary indexing as well, as > >> least > >> > what that means for HBase (and maybe some of the _how_ it would work > >> too). > >> > > >> > >> Mind leading it Jesse? You have the necessary qualifications (smile). > >> Would suggest you make include rehearsal of points made by Andrew > >> Purtell and LarsH in the most recent thread on 2ndary indexes. > >> > >> > > ....ok, I can do that :) > > Adding you to the list... Thanks J, > St.Ack >
-
Re: HBase Developer's Pow-wow.Stack 2012-09-10, 04:41
On Sun, Sep 9, 2012 at 8:03 PM, Jacques <[EMAIL PROTECTED]> wrote:
> Some random thoughts/questions bubbling around in my mind regarding > secondary indexes/indices. > Nice list Jacques. (Jesse, here is your chance to look real good. You are getting the questions in advance! When Jacques stands up to start asking Tuesday, you can look real intelligent as you bang out the answers) St.Ack
-
Re: HBase Developer's Pow-wow.Jacques 2012-09-10, 07:03
more food for thought on secondary indexing...
*Additional questions*: - How important is indexing column qualifiers themselves (similar to Cassandra where people frequently utilize column qualifiers as "values" with no actual values stored)? - How important is indexing cell timestamps? *More thoughts/my answers on some of the questions I posed:* - From my experience, indexes should be at the region level (e.g. row-level sharding as opposed to term). Other sharding approaches will likely have scale and consistency problems. - In general it seems like there is tension between the main low level approaches of (1) leverage as much HBase infrastructure as possible (e.g. secondary tables) and (2) leverage an efficient indexing library e.g. Lucene. * * *Approach Thoughts* Trying to leverage HBase as much as possible is hard if we want to utilize the approach above and have consistent indexing. However, I think we can do it if we add support for what I will call a "local shadow family". These are additional, internally managed families for a table. However, they have the special characteristic that they belong to the region despite their primary keys being outside the range of the region's. Otherwise they look like a typical family. On splits, they are regenerated (somehow). If we take advantage of Lars' HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>, we then have the opportunity to consistently insert one or more rows into these local shadow families for the purpose of secondary indexing. The structure of these secondary families could use row keys as the indexed values, qualifiers for specific store files and the value of each being a list of originating keys (using read-append or HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>). By leveraging the existing family infrastructure, we get things like optional in-memory indexes and basic scanners for free and don't have to swallow a big chunk of external indexing code. The simplest approach for integration of these for queries would be internally be a ScannerBasedFilter (a filter that is based on a scanner) and a GroupingScanner (a Scanner that does intersection and/or union of scanners for multi criteria queries). Implementation of these scanners could happen at one of two levels: - StoreScanner level: A more efficient approach using the store file qualifier approach above (this allows easier maintenance of index deletions) - RegionScanner level: A simpler implementation with less violation of existing encapsulation. We'd store row keys in qualifiers instead of values to ensure ordering that works iteratively with RegionScanner. The weaknesses of this approach are less efficient scanning and figuring out how to manage primary value deletes. In general, the best way to deal with deletes is probably to age them out per storefile and just filter "near misses" as a secondary filter that works with ScannerBasedFilter. The client side would be TBD but would probably offer some kind of criteria filters that on server side had all the lower level ramifications. *Future Optimizations* In a perfect world, we'd actually use StoreFile block start locations as the index pointer values in the secondary families. This would make things much more compact and efficient. Especially if we used a smarter block codec that took advantage of this nature. However, this requires quite a bit more work since we'd need to actually use the primary keys in the secondary memstore and then "patch" the values to block locations as we flushed the primary family that we were indexing (ugh). Assuming that the primary limiter of peak write throughput for HBase is typically WAL writing and since indexes have no "real" data, we could consider disabling WAL for local shadow families and simply regenerate this data upon primary WAL playback. I haven't spent enough time in that code to know what kind of consistency pain this would cause (my intuition is it would be fine as long as we didn't fix HBASE-3149<https://issues.apache.org/jira/browse/HBASE-3149>). If consistency isn't a problem, this would be a nice option since it means that indexing would have minimal impact on peak write throughput. *I haven't thought at all about...* - How/whether this makes sense to be implemented as a coprocessor. - Weird timestamp impacts/considerations here. - Version handling/impacts. On Sun, Sep 9, 2012 at 8:03 PM, Jacques <[EMAIL PROTECTED]> wrote:
-
Re: HBase Developer's Pow-wow.Ted Yu 2012-09-10, 17:51
Jacques:
Thanks for your sharing. bq. row-level sharding as opposed to term Please elaborate on the above a little more: what is term sharding ? bq. for what I will call a "local shadow family" I like this idea. User may request more than one index. Currently HBase is not so good at serving high number of families. So we may need to watch out. bq. GroupingScanner (a Scanner that does intersection and/or union of scanners for multi criteria queries) Do you think the following enhancement is related to your proposal above ? HBASE-5416 Improve performance of scans with some kind of filters bq. and then "patch" the values to block locations as we flushed the primary family that we were indexing (ugh). Yeah. We also need to consider the effect of compaction. bq. my intuition is it would be fine as long as we didn't fix HBASE-3149 I was actually expecting someone to pick up the work of HBASE-3149 :-) Cheers On Mon, Sep 10, 2012 at 12:03 AM, Jacques <[EMAIL PROTECTED]> wrote: > more food for thought on secondary indexing... > > *Additional questions*: > > - How important is indexing column qualifiers themselves (similar to > Cassandra where people frequently utilize column qualifiers as "values" > with no actual values stored)? > - How important is indexing cell timestamps? > > > *More thoughts/my answers on some of the questions I posed:* > > - From my experience, indexes should be at the region level (e.g. > row-level sharding as opposed to term). Other sharding approaches will > likely have scale and consistency problems. > - In general it seems like there is tension between the main low level > approaches of (1) leverage as much HBase infrastructure as possible > (e.g. > secondary tables) and (2) leverage an efficient indexing library e.g. > Lucene. > > * > * > *Approach Thoughts* > Trying to leverage HBase as much as possible is hard if we want to utilize > the approach above and have consistent indexing. However, I think we can > do it if we add support for what I will call a "local shadow family". > These are additional, internally managed families for a table. However, > they have the special characteristic that they belong to the region despite > their primary keys being outside the range of the region's. Otherwise they > look like a typical family. On splits, they are regenerated (somehow). If > we take advantage of Lars' > HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>, > we then have the opportunity to consistently insert one or more rows into > these local shadow families for the purpose of secondary indexing. The > structure of these secondary families could use row keys as the indexed > values, qualifiers for specific store files and the value of each being a > list of originating keys (using read-append or > HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>). > By leveraging the existing family infrastructure, we get things like > optional in-memory indexes and basic scanners for free and don't have to > swallow a big chunk of external indexing code. > > The simplest approach for integration of these for queries would be > internally be a ScannerBasedFilter (a filter that is based on a scanner) > and a GroupingScanner (a Scanner that does intersection and/or union of > scanners for multi criteria queries). Implementation of these scanners > could happen at one of two levels: > > - StoreScanner level: A more efficient approach using the store file > qualifier approach above (this allows easier maintenance of index > deletions) > - RegionScanner level: A simpler implementation with less violation of > existing encapsulation. We'd store row keys in qualifiers instead of > values to ensure ordering that works iteratively with RegionScanner. > The > weaknesses of this approach are less efficient scanning and figuring out > how to manage primary value deletes. > > In general, the best way to deal with deletes is probably to age them out
-
Re: HBase Developer's Pow-wow.Andrew Purtell 2012-09-10, 17:58
Hi Jaques,
> Does family level indexing make sense or is the real need for qualifier > level indexing? The use cases considered, at least over here at TM, all come down to range scanning over values (e.g. WHERE INTEGER($value) < 50). So we need a mapping such that a scan over the index returns either lists of pointers to row:family:qualifier, or the value itself embedded in the index, following the natural order of values in the primary table as given by a comparator. And a number of projections like this. A set of default comparators for interpreting values as integers, longs, floating point, and complex JSON or AVRO records, would be useful. > What are ideas for a client interface and how transparent is index usage? > (E.g. if you set a filter on a qualifier... ) It would be nice if the existing client API can handle it somehow. Get, Put, Increment, Scan, all of these API objects can transmit arbitrary attributes from the client to the server. It would be low friction for a user to modify their use of these existing API objects, rather than using a completely different interface like coprocessor Endpoint invocations. (Or, at least a client library should hide that, in that case.) > What were the challenges and issues with the proof of concept TrendMicro > approach that ultimately made it untenable? (was an eventually consistent > approach) This was simply a prototype implementation quality issue, nothing wrong about an eventually consistent approach per se. > Is it important to colocate/duplicate indexed values and/or additional > portions of data in secondary indices to minimize disk seeks (almost making > HBase optionally more columnar in nature)? I do think we want to offer the Megastore-like option for storing value data into indexes, and also not. Then we can manage this tradeoff of minimizing seeks and round trips versus increased storage utilization on a per-index basis according to the needs of the use case. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
Re: HBase Developer's Pow-wow.Andrew Purtell 2012-09-10, 18:09
On Mon, Sep 10, 2012 at 12:03 AM, Jacques <[EMAIL PROTECTED]> wrote:
> - How important is indexing column qualifiers themselves (similar to > Cassandra where people frequently utilize column qualifiers as "values" > with no actual values stored)? It would be good to have a secondary indexing option that can build an index from some transform of family+qualifier. > - In general it seems like there is tension between the main low level > approaches of (1) leverage as much HBase infrastructure as possible (e.g. > secondary tables) and (2) leverage an efficient indexing library e.g. > Lucene. Regarding option #2, Jason Rutherglen's experiences may be of interest: https://issues.apache.org/jira/browse/HBASE-3529 . The new Codec and CodecProvider classes of Lucene 4 could conceivably support storage of postings in HBase proper now (http://wiki.apache.org/lucene-java/FlexibleIndexing) so HDFS hacks for bringing indexes local for mmapping may not be necessary, though this is a huge hand-wave. The remainder of your mail is focused on option #1, I have no comment to add there, lots of food for thought. > * > * > *Approach Thoughts* > Trying to leverage HBase as much as possible is hard if we want to utilize > the approach above and have consistent indexing. However, I think we can > do it if we add support for what I will call a "local shadow family". > These are additional, internally managed families for a table. However, > they have the special characteristic that they belong to the region despite > their primary keys being outside the range of the region's. Otherwise they > look like a typical family. On splits, they are regenerated (somehow). If > we take advantage of Lars' > HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>, > we then have the opportunity to consistently insert one or more rows into > these local shadow families for the purpose of secondary indexing. The > structure of these secondary families could use row keys as the indexed > values, qualifiers for specific store files and the value of each being a > list of originating keys (using read-append or > HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>). > By leveraging the existing family infrastructure, we get things like > optional in-memory indexes and basic scanners for free and don't have to > swallow a big chunk of external indexing code. > > The simplest approach for integration of these for queries would be > internally be a ScannerBasedFilter (a filter that is based on a scanner) > and a GroupingScanner (a Scanner that does intersection and/or union of > scanners for multi criteria queries). Implementation of these scanners > could happen at one of two levels: > > - StoreScanner level: A more efficient approach using the store file > qualifier approach above (this allows easier maintenance of index > deletions) > - RegionScanner level: A simpler implementation with less violation of > existing encapsulation. We'd store row keys in qualifiers instead of > values to ensure ordering that works iteratively with RegionScanner. The > weaknesses of this approach are less efficient scanning and figuring out > how to manage primary value deletes. > > In general, the best way to deal with deletes is probably to age them out > per storefile and just filter "near misses" as a secondary filter that > works with ScannerBasedFilter. The client side would be TBD but would > probably offer some kind of criteria filters that on server side had all > the lower level ramifications. > > *Future Optimizations* > In a perfect world, we'd actually use StoreFile block start locations as > the index pointer values in the secondary families. This would make things > much more compact and efficient. Especially if we used a smarter block > codec that took advantage of this nature. However, this requires quite a > bit more work since we'd need to actually use the primary keys in the > secondary memstore and then "patch" the values to block locations as we Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
Re: HBase Developer's Pow-wow.Matt Corgan 2012-09-10, 19:13
Can indexing be boiled down to these questions to start?
1) Per-region or Per-table 2) Sync or Async 3) Client-managed or Server-managed 4) Schema or Schema-less Definitions: 1) - Per-region: the index entries are stored on the same machine as the primary rows - Per-table: each index is stored in a separate table, requiring cross-server consistency 2) - Sync: the client blocks until all index entries exist - Async: the client returns when the primary row has been inserted, but indexes are guaranteed to be created eventually 3) - Client-managed: client pushes index entries directly to regions, possibly utilizing some server-side locks or id generators - Server-managed: client pushes index entries to the same server as the primary row, letting the server push the index entries on to the destination regions 4) - Schema: (complex to even define) client and/or server have info about column names, value formats, etc. (Taking this route opens a world of follow-on questions) - Schema-less: client provides the index entries which are rows with opaque row/family/qualifier/timestamp like in normal hbase Personal opinions: All of my use-cases would require Per-table indexes. Per-region is easier to keep consistent at write-time, but is seems useless to me for the large tables that hbase is designed for (because you have to hit every region for each read). I think Synchronous writes is important for high-consistency (OLTP style) uses cases while Async is important for high-throughput (OLAP style). I'd say sync is a more desirable feature because it's easier to roll your own async. I would love to see the difference reduced to a per-index-entry flag on the Put object. Client-managed vs Server-managed isn't tremendously important. Client-managed seems admirable for the sync case, but server-managed is better for async. Therefore, probably better to keep the api simple and do server-managed for both cases with a flag for sync/async. The notion of adding a schema to hbase for secondary indexing scares me a little. Many of us already have ORM-type layers above hbase that do all sorts of custom serializations. It would be more flexible to let the client generate abritrary index entries and ship them to the server inside the Put object. Anyway - my abbreviated 2 cents on a big topic. Matt On Mon, Sep 10, 2012 at 11:09 AM, Andrew Purtell <[EMAIL PROTECTED]>wrote: > On Mon, Sep 10, 2012 at 12:03 AM, Jacques <[EMAIL PROTECTED]> wrote: > > - How important is indexing column qualifiers themselves (similar to > > Cassandra where people frequently utilize column qualifiers as > "values" > > with no actual values stored)? > > It would be good to have a secondary indexing option that can build an > index from some transform of family+qualifier. > > > - In general it seems like there is tension between the main low level > > approaches of (1) leverage as much HBase infrastructure as possible > (e.g. > > secondary tables) and (2) leverage an efficient indexing library e.g. > > Lucene. > > Regarding option #2, Jason Rutherglen's experiences may be of > interest: https://issues.apache.org/jira/browse/HBASE-3529 . The new > Codec and CodecProvider classes of Lucene 4 could conceivably support > storage of postings in HBase proper now > (http://wiki.apache.org/lucene-java/FlexibleIndexing) so HDFS hacks > for bringing indexes local for mmapping may not be necessary, though > this is a huge hand-wave. > > The remainder of your mail is focused on option #1, I have no comment > to add there, lots of food for thought. > > > * > > * > > *Approach Thoughts* > > Trying to leverage HBase as much as possible is hard if we want to > utilize > > the approach above and have consistent indexing. However, I think we can > > do it if we add support for what I will call a "local shadow family". > > These are additional, internally managed families for a table. However, > > they have the special characteristic that they belong to the region
-
Re: HBase Developer's Pow-wow.Jacques 2012-09-10, 20:45
See below
On Mon, Sep 10, 2012 at 10:51 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Jacques: > Thanks for your sharing. > > bq. row-level sharding as opposed to term > > Please elaborate on the above a little more: what is term sharding ? > If an index is basically a value (or term) pointing back to a row, there are two main ways that you can slice up the data to scale it. Lets say you have ten nodes and you want to index a column that stores values between 1 and 100. This columns values are likely distributed throughout all the regions. The two options would look like: Option 1 (term sharding): Each node/region holds all pointers for a single value. E.g. Node A holds 1-10, B 11-20, C:21-30, etc. (A variation of this is hashing the values to avoid distribution problems.) The strength of this approach is that if you know you only want values 1-5, you don't have to have all the nodes evaluate their index. The downsides are: you have to have some kind of cross node/region data approach and consistency is hard. You also have problems as your data scales: on a massive scale, an index can takes a while to iterate through once it gets large you'll bottleneck this problem to a single machine. Option 2 (row-sharding): Each node/region holds all pointers for all the rows that are on that node. In this case, you have to consult all the nodes before you get all the values. More complicated on query time but limitless scale and simpler consistency problems. > > bq. for what I will call a "local shadow family" > > I like this idea. User may request more than one index. Currently HBase is > not so good at serving high number of families. So we may need to watch > out. > > Yeah. A simple approach could utilize two families, one in-memory and one not. No reason a family can't hold multiple indexes. Just need to get a little more tricky about how we use things like qualifiers. Also makes index dropping more convoluted. > bq. GroupingScanner (a Scanner that does intersection and/or union of > scanners for multi criteria queries) > > Do you think the following enhancement is related to your proposal above ? > HBASE-5416 Improve performance of scans with some kind of filters > On first glance, I don't think this is really related. A grouping scanner would be used to take the secondary index scanners and merge them into a single filter scanner to then be used when the primary scan is done. > > bq. and then "patch" the values to block locations as we > flushed the primary family that we were indexing (ugh). > > Yeah. We also need to consider the effect of compaction. > Yeah... painful... > > bq. my intuition is it would be fine as long as we didn't fix HBASE-3149 > > I was actually expecting someone to pick up the work of HBASE-3149 :-) > :P
-
Re: HBase Developer's Pow-wow.Jacques 2012-09-10, 20:50
>
> > The use cases considered, at least over here at TM, all come down to > range scanning over values (e.g. WHERE INTEGER($value) < 50). So we > need a mapping such that a scan over the index returns either lists of > pointers to row:family:qualifier, or the value itself embedded in the > index, following the natural order of values in the primary table as > given by a comparator. And a number of projections like this. I was thinking that exact criteria queries were higher priority than range queries. Interesting that you have a lot of needs for range queries. Performant range queries definitely lead to more likely storing values next to the index and also in general a more compact storage format than is easily achievable utilizing the shadow family idea. > A set of > default comparators for interpreting values as integers, longs, > floating point, and complex JSON or AVRO records, would be useful. > Agreed. Once a framework is in place, I see these being fairly straightforward.
-
Re: HBase Developer's Pow-wow.lars hofhansl 2012-09-10, 22:46
I'm back from the woods (and yes, I'm already reading the dev list, sigh) :)
I'll be back at work tomorrow, but I might have to tie some other knots first.Let's see. I'd also be interested to join the talk about 2ndary indexing. In addition I can talk a bit about - the profiling I did, and maybe mention some (just 1 or 2 really) gotchas to avoid in the future - the additions to the coprocessor framework I added - thoughts about backups (?) - using iterator trees instead of scanners (although the relational DB world apparently has become a bit skeptical) (?) Let me know. I won't have time to prepare much for this, though. So it would be an ad hoc discussion, maybe with some white boarding. -- Lars ----- Original Message ----- From: Jesse Yates <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Sunday, September 9, 2012 3:11 PM Subject: Re: HBase Developer's Pow-wow. > > We are missing fellas to lead a chat on process change ideas (How to > have it so Jenkins is more blue than red; How do we enforce more rigor > around what gets committer, etc.). Anyone want to volunteer? I'd > volunteer LarsH since he was last to float these eternally recurring > notions but I believe he will be up on Half Dome looking down on us > when the meeting goes off. Anyone else want to lead the discussion > (Jon? Andrew?)? > I thought Lars would be be back by the meetup, but lets get a second talker on it too :) Anyone want to lead a discussion on whats next? Post 0.96? > > Anything else that folks want to talk about? > I think we talked about wanting to do secondary indexing as well, as least what that means for HBase (and maybe some of the _how_ it would work too). -Jesse ------------------- Jesse Yates @jesse_yates jyates.github.com
-
Re: HBase Developer's Pow-wow.Jacques 2012-09-10, 23:40
>
> All of my use-cases would require Per-table indexes. Per-region is easier > to keep consistent at write-time, but is seems useless to me for the large > tables that hbase is designed for (because you have to hit every region for > each read). > Can you expound on use cases? The pros and cons are heavily dependent on the sparseness of the indexed values and the particular use case. If we're talking about a gender column on a user profile table, you really want that to be spread out among all regions. If we're talking about an email address... not so much.
-
Re: HBase Developer's Pow-wow.Devaraj Das 2012-09-11, 00:21
Guys, if you want to join the Pow-Wow over phone, here are the details:
Phone: 1 (605) 475-6700 Access code: 232-8385 See you all at Hortonworks. On Wed, Aug 29, 2012 at 9:21 PM, Ramkrishna.S.Vasudevan <[EMAIL PROTECTED]> wrote: > > I would be interested on this may be for the Secondary index related > discussion. > > Can I attend it over phone? Or someother way ? > > Regards > Ram > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of > > Stack > > Sent: Thursday, August 30, 2012 2:03 AM > > To: [EMAIL PROTECTED] > > Subject: Re: HBase Developer's Pow-wow. > > > > On Wed, Aug 29, 2012 at 1:06 PM, Devaraj Das <[EMAIL PROTECTED]> > > wrote: > > > We could look at hosting here at Hortonworks, Sunnyvale. Thoughts? > > > > > > On Aug 29, 2012, at 11:31 AM, Jonathan Hsieh <[EMAIL PROTECTED]> > > wrote: > > > > > >> There are a couple discussions brewing and major changes being it > > would be > > >> good to have a face-to-face pow-wow to demo, to discuss designs, and > > to > > >> talk about project goals and policies. This would be mostly focused > > on > > >> project internals and maybe last half a day. > > >> > > >> Here are some suggestions for agenda items. > > >> > > >> * Jimmy on Major Assignment Manager refactor > > >> * Enis on integration testing infrastructure > > >> * Process change ideas: > > >> - Revisit check-in policies for trunk and sustaining branches. > > >> - Strategies for keeping Jenkins blue? Have a flaky test list that > > >> avoids running flaky tests? File a JIRA automatically on failure? > > >> - Holistic reviews of some subsystem's code (there are some > > convoluted > > >> evolved portions of code that could use some intelligent redesign) > > >> * Major features like Secondary Indexes - core, coproc, or external? > > >> > > >> The last few meetups and hackathons like this were at > > >> Salesforce, eBay, Stumble and Cloudera. I can look into having > > Cloudera to > > >> host either in its new SF "penthouse" office or in its PA office. > > >> > > > > Lets do it. I meant to say that Deveraj suggested a while back that > > HW could host next one. Seems fine by me (as long as the beer is as > > good as it was at SF Deveraj!). > > > > Weeks of 10th so Andrew is included sounds good to me too. > > St.Ack >
-
Re: HBase Developer's Pow-wow.Matt Corgan 2012-09-11, 01:20
One sparse use case for us is rate limit detection. We store user events
in an Event table whose primary key is a unique timestamp (sharded to avoid hotspotting) and which has eventType and ipAddress columns. We manually keep a separate table (the index, also sharded) called EventByDateIpType with row format [year/month/date/ipAddress/eventType/eventId]. Background jobs are constantly scanning the index to count combinations of ipAddress+eventType to hunt down the people that are doing things like adding spam to the site. Then we might dig up all the events for a suspect ipAddress, where the absolute busiest ipAddress might account for .1% of the events in a day, so pretty sparse. A per-table index is a must-have here. For this same Event table, there are also dense indexes like EventByDateType whose row key is [year/month/date/eventType/eventId]. There are only about 200 eventTypes. If we have 1 million of a certain eventType on a given day where we need to access the primary rows, we do a scan on the EventByDateType index table and pull the rows out of the Event table in batches. One nice aspect of this is that we are getting the rows in globally sorted order. Either per-table or per-region indexes would work here, but i guess i'm failing to see the read-time benefit of the per-region index. Seems like there are 3 categories of sparseness: 1) sparse indexes (like ipAddress) where a per-table approach is more efficient for reads 2) dense indexes (like eventType) where there are likely values of every index key on each region 3) very dense indexes (like male/female) where you should just be doing a table scan anyway Jacques, you say "If we're talking about a gender column on a user profile table, you really want that to be spread out among all regions". Can you expand on that more? I guess i don't understand your read pattern. If you have 5 million of each user, you are probably not doing a single select of all males. You will probably have to iterate through them in small batches. Why is the per-region approach more beneficial than the per-table? Is it because it's easier to plug into hbase's existing per-region MapReduce splitter? If so, could you just as easily feed the separate per-table index into MapReduce? Thanks for starting the important discussion. On Mon, Sep 10, 2012 at 4:40 PM, Jacques <[EMAIL PROTECTED]> wrote: > > > > All of my use-cases would require Per-table indexes. Per-region is > easier > > to keep consistent at write-time, but is seems useless to me for the > large > > tables that hbase is designed for (because you have to hit every region > for > > each read). > > > > Can you expound on use cases? The pros and cons are heavily dependent on > the sparseness of the indexed values and the particular use case. If we're > talking about a gender column on a user profile table, you really want that > to be spread out among all regions. If we're talking about an email > address... not so much. >
-
Re: HBase Developer's Pow-wow.Jacques 2012-09-11, 04:04
On Mon, Sep 10, 2012 at 6:20 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> ... snipping lots of helpful use cases... It seems like portions of what you discussed would probably be nominally impacted by indexes while other would be very impacted. Also seems like compound-qualifier indexing would potentially be of interest to you... (although I'm not sure how much it would buy you). Are you going to be at the powwow tomorrow? > Seems like there are 3 categories of sparseness: > 1) sparse indexes (like ipAddress) where a per-table approach is more > efficient for reads > 2) dense indexes (like eventType) where there are likely values of every > index key on each region > 3) very dense indexes (like male/female) where you should just be doing a > table scan anyway > Yes. I probably shouldn't have used the male/female example since you're right that a table scan is probably the best the option in that case. For category one, I was imagining a situation of more extreme sparseness such as one target row in a large number of regions. This is the place where the all region checking of region-based approach is the most egregious. I'd probably put anything that was in at a small percentages of regions as the second case. (I also wonder if, in the single row scenario, a judicious use of bloomfilters might provide satisfactory performance even if you do need to hit all regions-- one of the things we've used as a guiding principle for our search stuff is that if you're trying to hit realtime, you can actually eat the most latency on the smallest scan since you have so little data to move around...depends on allowable memory usage I suppose.) > Why is the per-region > approach more beneficial than the per-table? Is it because it's easier to > plug into hbase's existing per-region MapReduce splitter? > Part of it has to do with a bunch of non-HBase work I've been doing over the past few years. That's why I really hope people share as many use cases as possible... so that the conclusions that come out of our work are representative of everyone's needs (as much as possible). What makes me lean towards region-level for a lot of use cases are the following: (I hadn't even really thought about the existing MR splitter.) - How to maintain consistency (maybe this is unimportant?) - How to avoid network bottleneck as the cluster expands (in the case of a per-table approach, you're going to have pass primary keys around constantly except in the case that the only value you want is the indexed value and you saved that entire value in the index table.) - How to maximize scale. (In the per table case, a particular set of indexed values will probably be colocated among a fraction of all nodes. Any kind of parallel/MR job will then be constrained by these nodes.) - How to minimize long term storage cost of indexes. (If we have region-level relationships, we can get more tightly coupled over time and use more efficient compact approaches like the store file position approach I tossed out in one of my other emails.) I spent some time in the Cassandra community doing a review of various indexing use cases. I should go take another look to see what they do and how it works for them... >> Thanks for starting the important discussion. Lots to talk about. Lots to potentially do. It will be interesting to see who has time to put against this as that will probably substantially constrain all of our great ideas :) Jacques
-
Re: HBase Developer's Pow-wow.Andrew Purtell 2012-09-11, 04:22
Regarding this:
On Mon, Sep 10, 2012 at 12:13 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > 1) Per-region or Per-table [...] > 1) > - Per-region: the index entries are stored on the same machine as the > primary rows > - Per-table: each index is stored in a separate table, requiring > cross-server consistency LarsH and I were discussing this a bit. This doesn't have to be a choice, it could be possible to have both, a separate table for index storage, and colocation of the index table regions and primary table regions on the same regionserver so cross-region consistency issues can be dealt with through low latency in-memory channels. (With fallback to cross-server consistency mechanism when placement can't be ideal when the cluster is out of steady state due to failure/churn.) The master might assign primary and index regions out together as a group. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
RE: HBase Developer's Pow-wow.Ramkrishna.S.Vasudevan 2012-09-11, 04:47
Hi
Yes, a separate index table along with the main table and the master should ensure that the regions of both tables are collocated during assignments. The regions in index table can be same as that of the main table in the sense that both should have the same start and endkeys. Different indices can be grouped within these regions. In case of spare data definitely the index creation is going to be a beneficial one. In case of dense data may be the indices may be an overhead in some cases. In one of the wiki pages of Cassandra I also read that they suggest to have atleast one EQUALS condition in the query that tries to use indices. This will help in confining the results to a specific set and over which the range queries can be applied. So may be at the first level we can see what gain we get when we use EQUALs condition but any way the framework can be generic to handle range queries and EQUALs condition queries. After the meet up is over, I can go through the discussion topics and provide our experiences also. Regards Ram > -----Original Message----- > From: Andrew Purtell [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, September 11, 2012 9:52 AM > To: [EMAIL PROTECTED] > Subject: Re: HBase Developer's Pow-wow. > > Regarding this: > > On Mon, Sep 10, 2012 at 12:13 PM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > > 1) Per-region or Per-table > [...] > > 1) > > - Per-region: the index entries are stored on the same machine as the > > primary rows > > - Per-table: each index is stored in a separate table, requiring > > cross-server consistency > > LarsH and I were discussing this a bit. This doesn't have to be a > choice, it could be possible to have both, a separate table for index > storage, and colocation of the index table regions and primary table > regions on the same regionserver so cross-region consistency issues > can be dealt with through low latency in-memory channels. (With > fallback to cross-server consistency mechanism when placement can't be > ideal when the cluster is out of steady state due to failure/churn.) > The master might assign primary and index regions out together as a > group. > > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet > Hein (via Tom White)
-
Re: HBase Developer's Pow-wow.Matt Corgan 2012-09-11, 05:59
Jacques - i'll be there tomorrow. Look forward to talking. Some comments
before then: - How to maintain consistency (maybe this is unimportant?) Not unimportant at all. In fact, I picture the whole secondary index conversation as a lower level goal of supporting consistent cross-region updates. I'm hesitant on some of the region co-location ideas because they look like optimizations on the end goal of consistency across servers. All of the optimizations are nice, but the real meat of the problem is how to bake cross-region consistency in at the ground level as opposed to patching it on as the failure case where an index region gets separated from its parent. It would be better to get the cross-server stuff working first, and then optimize the same-server scenario. That is, like you say, if anybody has time =) - How to avoid network bottleneck as the cluster expands (in the case of > a per-table approach, you're going to have pass primary keys > around constantly except in the case that the only value you want is the > indexed value and you saved that entire value in the index table.) In my use cases, i typically scan batches of ~1000 index entries from the index table (~1 RPC / ~1 data block), and then i issue a multiGet to fetch the primary rows. Because the index is sorted by the primary rows, they all go to the first region in the table which again equates to ~1 RPC. So maybe it's 2 RPC's instead of 1 which doesn't seem too bad. - How to maximize scale. (In the per table case, a particular set of indexed > values will probably be colocated among a fraction of all nodes. Writes will definitely be slightly faster in the per-region case, but at the huge expense of reads having to go to multiple servers. In terms of number of regions (R), the additional write expense is O(1) whereas the read expense is on average O(R/2). If you have 100 regions of users and want to look up a userId by email, you have to jump through 50 regions on average to find the user. I spent some time in the Cassandra community doing a review of various indexing > use cases. I should go take another look to see what they do and how it > works for them... HBase has a lot of similarities to Cassandra but i would say it is a different beast when it comes to indexing. The biggest difference (even bigger than the tunable consistency) is the fact that hbase stores all rows in a sorted order that automatically split into regions and evenly distributed. Cassandra is not designed to host unpredictably growing sorted tables (like secondary index tables tend to be), so it makes some concessions in index design. Instead of storing each index entry as a separate row in a rapidly growing table, which hbase deals with nicely because it can split/balance the index table, cassandra stores all of the index entries for an index value as columns (qualifiers) in the same row. For low cardinality indexes this can create several huge rows which become hotspots. Said differently, cassandra is forced to create indexes using wide tables, where hbase has the luxury of using tall tables. My cassandra knowledge is dated, so please correct me if that's wrong. On Mon, Sep 10, 2012 at 9:47 PM, Ramkrishna.S.Vasudevan < [EMAIL PROTECTED]> wrote: > Hi > > Yes, a separate index table along with the main table and the master should > ensure that the regions of both tables are collocated during assignments. > > The regions in index table can be same as that of the main table in the > sense that both should have the same start and endkeys. > > Different indices can be grouped within these regions. > > In case of spare data definitely the index creation is going to be a > beneficial one. > In case of dense data may be the indices may be an overhead in some cases. > > In one of the wiki pages of Cassandra I also read that they suggest to have > atleast one EQUALS condition in the query that tries to use indices. This > will help in confining the results to a specific set and over which the |