|
lars hofhansl
2013-02-23, 06:40
Andrew Purtell
2013-02-23, 17:41
lars hofhansl
2013-02-23, 20:39
Andrew Purtell
2013-02-23, 17:18
Stack
2013-02-26, 22:08
Jesse Yates
2013-02-26, 22:31
Andrew Purtell
2013-02-26, 23:27
Enis Söztutar
2013-02-27, 00:15
lars hofhansl
2013-02-27, 00:27
Jesse Yates
2013-02-27, 00:31
Jesse Yates
2013-02-28, 01:52
|
-
Simple stastics per regionlars hofhansl 2013-02-23, 06:40
This topic comes up now and then (see recent discussion about translating multi Gets into Scan+Filter).
It's not that hard to keep statistics as part of compactions. I envision two knobs: 1. Max number of distinct values to track directly. If a column has less this # of values, keep track of their occurrences explicitly. 2. Number of (equal width) histogram partitions to maintain. Statistics would be kept per store (i.e. per region per column family) and stored into an HBase table (one row per store).Initially we could just support major compactions that atomically insert a new version of that statistics for the store. An simple implementation (not knowing ahead of time how many values it will see during the compaction) could start by keeping track of individual values for columns. If it gets past the max # of distinct values to track, start with equal width histograms (using the distinct values picket up so far to estimate an initial partition width). If the number of partition gets larger than what was configured it would increase the width and merge the previous counts into the new width (which means the new partition width must be a multiple of the previous size). There's probably a lot of other fanciness that could be used here (haven't spent a lot of time thinking about details). Is this something that should be in core HBase or rather be implemented as coprocessor? -- Lars +
lars hofhansl 2013-02-23, 06:40
-
Re: Simple stastics per regionAndrew Purtell 2013-02-23, 17:41
> Statistics would be kept per store (i.e. per region per column family)
and stored into an HBase table (one row per store).Initially we could just support major compactions that atomically insert a new version of that statistics for the store. Will we drop updates to the statistics table if regions of it are in transition? (I think that would be ok.) Should we have a lightweight RPC for server to server communication that does not block or retry? The above two considerations would avoid a repeat of the region historian trouble... ancient history. Can we expect pretty quickly desire for more than just statistics on data contributed after major compactions? That would be fine for characterizing the data within, but doesn't provide any information about access patterns to the data like I mentioned in the other email. On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > This topic comes up now and then (see recent discussion about translating > multi Gets into Scan+Filter). > > It's not that hard to keep statistics as part of compactions. > I envision two knobs: > 1. Max number of distinct values to track directly. If a column has less > this # of values, keep track of their occurrences explicitly. > 2. Number of (equal width) histogram partitions to maintain. > > Statistics would be kept per store (i.e. per region per column family) and > stored into an HBase table (one row per store).Initially we could just > support major compactions that atomically insert a new version of that > statistics for the store. > > An simple implementation (not knowing ahead of time how many values it > will see during the compaction) could start by keeping track of individual > values for columns. If it gets past the max # of distinct values to track, > start with equal width histograms (using the distinct values picket up so > far to estimate an initial partition width). > If the number of partition gets larger than what was configured it would > increase the width and merge the previous counts into the new width (which > means the new partition width must be a multiple of the previous size). > There's probably a lot of other fanciness that could be used here (haven't > spent a lot of time thinking about details). > > > Is this something that should be in core HBase or rather be implemented as > coprocessor? > > > -- Lars > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) +
Andrew Purtell 2013-02-23, 17:41
-
Re: Simple stastics per regionlars hofhansl 2013-02-23, 20:39
Equal width histograms lend themselves relatively nicely to incremental updates, so we can extend that to in place updates later.
As for the lightweight RPC, yeah, or we could just do a Put without retries. I think if we fail to update the statistics it should not be considered a failure. Could keep track of the last statistics update. We might also want a facility to update the stats without any compaction (maybe an M/R job). In addition to histograms it might be also nice to keep track of the region min/max values for each column and keys. Maybe we also have to indicate (somehow) which columns we want to track in this way. -- Lars ________________________________ From: Andrew Purtell <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> Sent: Saturday, February 23, 2013 9:41 AM Subject: Re: Simple stastics per region > Statistics would be kept per store (i.e. per region per column family) and stored into an HBase table (one row per store).Initially we could just support major compactions that atomically insert a new version of that statistics for the store. Will we drop updates to the statistics table if regions of it are in transition? (I think that would be ok.) Should we have a lightweight RPC for server to server communication that does not block or retry? The above two considerations would avoid a repeat of the region historian trouble... ancient history. Can we expect pretty quickly desire for more than just statistics on data contributed after major compactions? That would be fine for characterizing the data within, but doesn't provide any information about access patterns to the data like I mentioned in the other email. On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: This topic comes up now and then (see recent discussion about translating multi Gets into Scan+Filter). > >It's not that hard to keep statistics as part of compactions. >I envision two knobs: >1. Max number of distinct values to track directly. If a column has less this # of values, keep track of their occurrences explicitly. >2. Number of (equal width) histogram partitions to maintain. > >Statistics would be kept per store (i.e. per region per column family) and stored into an HBase table (one row per store).Initially we could just support major compactions that atomically insert a new version of that statistics for the store. > >An simple implementation (not knowing ahead of time how many values it will see during the compaction) could start by keeping track of individual values for columns. If it gets past the max # of distinct values to track, start with equal width histograms (using the distinct values picket up so far to estimate an initial partition width). >If the number of partition gets larger than what was configured it would increase the width and merge the previous counts into the new width (which means the new partition width must be a multiple of the previous size). >There's probably a lot of other fanciness that could be used here (haven't spent a lot of time thinking about details). > > >Is this something that should be in core HBase or rather be implemented as coprocessor? > > >-- Lars > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) +
lars hofhansl 2013-02-23, 20:39
-
Re: Simple stastics per regionAndrew Purtell 2013-02-23, 17:18
I like this idea. For the kind of decisions I would like to make based on
CF statistics (characterizing access patterns), histograms or approximations would be fine. +1 on a core HBase facility. Otherwise, let's consider how would statistics kept by such a CP be shared with other CPs or clients. Coprocessors (deliberately) is not a module system. There are no provisions for exporting functions from one coprocessor and importing them in another, and managing the dependencies between. If we were to do so, it really might be appropriate to rebase CPs on an OSGi runtime rather than incrementally reinvent that wheel and end up with something close anyway but with more bugs. We have avoided anything like this so far of course. We had a guy from Apache Felix come by and suggest that once. To see what's involved with that go here: http://felix.apache.org/ On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > This topic comes up now and then (see recent discussion about translating > multi Gets into Scan+Filter). > > It's not that hard to keep statistics as part of compactions. > I envision two knobs: > 1. Max number of distinct values to track directly. If a column has less > this # of values, keep track of their occurrences explicitly. > 2. Number of (equal width) histogram partitions to maintain. > > Statistics would be kept per store (i.e. per region per column family) and > stored into an HBase table (one row per store).Initially we could just > support major compactions that atomically insert a new version of that > statistics for the store. > > An simple implementation (not knowing ahead of time how many values it > will see during the compaction) could start by keeping track of individual > values for columns. If it gets past the max # of distinct values to track, > start with equal width histograms (using the distinct values picket up so > far to estimate an initial partition width). > If the number of partition gets larger than what was configured it would > increase the width and merge the previous counts into the new width (which > means the new partition width must be a multiple of the previous size). > There's probably a lot of other fanciness that could be used here (haven't > spent a lot of time thinking about details). > > > Is this something that should be in core HBase or rather be implemented as > coprocessor? > > > -- Lars > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) +
Andrew Purtell 2013-02-23, 17:18
-
Re: Simple stastics per regionStack 2013-02-26, 22:08
On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> This topic comes up now and then (see recent discussion about translating > multi Gets into Scan+Filter). > > It's not that hard to keep statistics as part of compactions. > I envision two knobs: > 1. Max number of distinct values to track directly. If a column has less > this # of values, keep track of their occurrences explicitly. > 2. Number of (equal width) histogram partitions to maintain. > > Statistics would be kept per store (i.e. per region per column family) and > stored into an HBase table (one row per store).Initially we could just > support major compactions that atomically insert a new version of that > statistics for the store. > > Sounds great. In .META. add columns for each each cf on each region row? Or another table? What kind of stats would you keep? Would they be useful for operators? Or just for stuff like say Phoenix making decisions? > An simple implementation (not knowing ahead of time how many values it > will see during the compaction) could start by keeping track of individual > values for columns. If it gets past the max # of distinct values to track, > start with equal width histograms (using the distinct values picket up so > far to estimate an initial partition width). > If the number of partition gets larger than what was configured it would > increase the width and merge the previous counts into the new width (which > means the new partition width must be a multiple of the previous size). > There's probably a lot of other fanciness that could be used here (haven't > spent a lot of time thinking about details). > > > Is this something that should be in core HBase or rather be implemented as > coprocessor? > I think it could go in core if it generated pretty pictures. St.Ack +
Stack 2013-02-26, 22:08
-
Re: Simple stastics per regionJesse Yates 2013-02-26, 22:31
TL;DR Making it part of the UI and ensuring that you don't load things the
wrong way seem to be the only reasons for making this part of core - certainly not bad reasons. They are fairly easy to handle as a CP though, so maybe its not necessary immediately. I ended up writing a simple stats framework last week (ok, its like 6 classes) that makes it easy to create your own stats for a table. Its all coprocessor based, and as Lars suggested, hooks up to the major compactions to let you build per-column-per-region stats and writes it to a 'system' table = "_stats_". With the framework you could easily write your own custom stats, from simple things like min/max keys to things like fixed width or fixed depth histograms, or even more complicated. There has been some internal discussion around how to make this available to the community (as part of Phoenix, core in HBase, an independent github project, ...?). The biggest isssue around having it all CP based is that you need to be really careful to ensure that it comes _after_ all the other compaction coprocessors. This way you know exactly what keys come out and have correct statistics (for that point in time). Not a huge issue - you just need to be careful. Baking the stats framework into HBase is really nice in that we can be sure we never mess this up. Building it into the core of HBase isn't going to get us per-region statistics without a whole bunch of pain - compactions per store make this a pain to actualize; there isn't a real advantage here, as I'd like to keep it per CF, if only not to change all the things. Further, this would be a great first use-case for real system tables. Mixing this data with .META. is going to be a bit of a mess, especially for doing clean scans, etc. to read the stats. Also, I'd be gravely concerned to muck with such important state, especially if we make a 'statistic' a pluggable element (so people can easily expand their own). And sure, we could make it make pretty graphs on the UI, no harm in it and very little overhead :) ------------------- Jesse Yates @jesse_yates jyates.github.com On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote: > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > This topic comes up now and then (see recent discussion about translating > > multi Gets into Scan+Filter). > > > > It's not that hard to keep statistics as part of compactions. > > I envision two knobs: > > 1. Max number of distinct values to track directly. If a column has less > > this # of values, keep track of their occurrences explicitly. > > 2. Number of (equal width) histogram partitions to maintain. > > > > Statistics would be kept per store (i.e. per region per column family) > and > > stored into an HBase table (one row per store).Initially we could just > > support major compactions that atomically insert a new version of that > > statistics for the store. > > > > > Sounds great. > > In .META. add columns for each each cf on each region row? Or another > table? > > What kind of stats would you keep? Would they be useful for operators? Or > just for stuff like say Phoenix making decisions? > > > > > An simple implementation (not knowing ahead of time how many values it > > will see during the compaction) could start by keeping track of > individual > > values for columns. If it gets past the max # of distinct values to > track, > > start with equal width histograms (using the distinct values picket up so > > far to estimate an initial partition width). > > If the number of partition gets larger than what was configured it would > > increase the width and merge the previous counts into the new width > (which > > means the new partition width must be a multiple of the previous size). > > There's probably a lot of other fanciness that could be used here > (haven't > > spent a lot of time thinking about details). > > > > > > Is this something that should be in core HBase or rather be implemented > as +
Jesse Yates 2013-02-26, 22:31
-
Re: Simple stastics per regionAndrew Purtell 2013-02-26, 23:27
If this is going to be a CP then other CPs need an easy way to use the
output stats. If a subsequent proposal from core requires statistics from this CP does that then mandate it itself must be a CP? What if that can't work? Putting the stats into a table addresses the first concern. For the second, it is an issue that comes up I think when building a generally useful shared function as a CP. Please consider inserting my earlier comments about OSGi here, in that we trend toward a real module system if we're not careful (unless that is the aim). On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED]>wrote: > TL;DR Making it part of the UI and ensuring that you don't load things the > wrong way seem to be the only reasons for making this part of core - > certainly not bad reasons. They are fairly easy to handle as a CP though, > so maybe its not necessary immediately. > > I ended up writing a simple stats framework last week (ok, its like 6 > classes) that makes it easy to create your own stats for a table. Its all > coprocessor based, and as Lars suggested, hooks up to the major compactions > to let you build per-column-per-region stats and writes it to a 'system' > table = "_stats_". > > With the framework you could easily write your own custom stats, from > simple things like min/max keys to things like fixed width or fixed depth > histograms, or even more complicated. There has been some internal > discussion around how to make this available to the community (as part of > Phoenix, core in HBase, an independent github project, ...?). > > The biggest isssue around having it all CP based is that you need to be > really careful to ensure that it comes _after_ all the other compaction > coprocessors. This way you know exactly what keys come out and have correct > statistics (for that point in time). Not a huge issue - you just need to be > careful. Baking the stats framework into HBase is really nice in that we > can be sure we never mess this up. > > Building it into the core of HBase isn't going to get us per-region > statistics without a whole bunch of pain - compactions per store make this > a pain to actualize; there isn't a real advantage here, as I'd like to keep > it per CF, if only not to change all the things. > > Further, this would be a great first use-case for real system tables. > Mixing this data with .META. is going to be a bit of a mess, especially for > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned > to muck with such important state, especially if we make a 'statistic' a > pluggable element (so people can easily expand their own). > > And sure, we could make it make pretty graphs on the UI, no harm in it and > very little overhead :) > > ------------------- > Jesse Yates > @jesse_yates > jyates.github.com > > > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote: > > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > > > This topic comes up now and then (see recent discussion about > translating > > > multi Gets into Scan+Filter). > > > > > > It's not that hard to keep statistics as part of compactions. > > > I envision two knobs: > > > 1. Max number of distinct values to track directly. If a column has > less > > > this # of values, keep track of their occurrences explicitly. > > > 2. Number of (equal width) histogram partitions to maintain. > > > > > > Statistics would be kept per store (i.e. per region per column family) > > and > > > stored into an HBase table (one row per store).Initially we could just > > > support major compactions that atomically insert a new version of that > > > statistics for the store. > > > > > > > > Sounds great. > > > > In .META. add columns for each each cf on each region row? Or another > > table? > > > > What kind of stats would you keep? Would they be useful for operators? > Or > > just for stuff like say Phoenix making decisions? > > > > > > > > > An simple implementation (not knowing ahead of time how many values it Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) +
Andrew Purtell 2013-02-26, 23:27
-
Re: Simple stastics per regionEnis Söztutar 2013-02-27, 00:15
+1 for core. I can see that histograms might help us in automatic splits
and merges as well. On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > If this is going to be a CP then other CPs need an easy way to use the > output stats. If a subsequent proposal from core requires statistics from > this CP does that then mandate it itself must be a CP? What if that can't > work? > > Putting the stats into a table addresses the first concern. > > For the second, it is an issue that comes up I think when building a > generally useful shared function as a CP. Please consider inserting my > earlier comments about OSGi here, in that we trend toward a real module > system if we're not careful (unless that is the aim). > > > On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED] > >wrote: > > > TL;DR Making it part of the UI and ensuring that you don't load things > the > > wrong way seem to be the only reasons for making this part of core - > > certainly not bad reasons. They are fairly easy to handle as a CP though, > > so maybe its not necessary immediately. > > > > I ended up writing a simple stats framework last week (ok, its like 6 > > classes) that makes it easy to create your own stats for a table. Its all > > coprocessor based, and as Lars suggested, hooks up to the major > compactions > > to let you build per-column-per-region stats and writes it to a 'system' > > table = "_stats_". > > > > With the framework you could easily write your own custom stats, from > > simple things like min/max keys to things like fixed width or fixed depth > > histograms, or even more complicated. There has been some internal > > discussion around how to make this available to the community (as part of > > Phoenix, core in HBase, an independent github project, ...?). > > > > The biggest isssue around having it all CP based is that you need to be > > really careful to ensure that it comes _after_ all the other compaction > > coprocessors. This way you know exactly what keys come out and have > correct > > statistics (for that point in time). Not a huge issue - you just need to > be > > careful. Baking the stats framework into HBase is really nice in that we > > can be sure we never mess this up. > > > > Building it into the core of HBase isn't going to get us per-region > > statistics without a whole bunch of pain - compactions per store make > this > > a pain to actualize; there isn't a real advantage here, as I'd like to > keep > > it per CF, if only not to change all the things. > > > > Further, this would be a great first use-case for real system tables. > > Mixing this data with .META. is going to be a bit of a mess, especially > for > > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned > > to muck with such important state, especially if we make a 'statistic' a > > pluggable element (so people can easily expand their own). > > > > And sure, we could make it make pretty graphs on the UI, no harm in it > and > > very little overhead :) > > > > ------------------- > > Jesse Yates > > @jesse_yates > > jyates.github.com > > > > > > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote: > > > > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > > > > This topic comes up now and then (see recent discussion about > > translating > > > > multi Gets into Scan+Filter). > > > > > > > > It's not that hard to keep statistics as part of compactions. > > > > I envision two knobs: > > > > 1. Max number of distinct values to track directly. If a column has > > less > > > > this # of values, keep track of their occurrences explicitly. > > > > 2. Number of (equal width) histogram partitions to maintain. > > > > > > > > Statistics would be kept per store (i.e. per region per column > family) > > > and > > > > stored into an HBase table (one row per store).Initially we could > just > > > > support major compactions that atomically insert a new version of +
Enis Söztutar 2013-02-27, 00:15
-
Re: Simple stastics per regionlars hofhansl 2013-02-27, 00:27
Just had a discussion with the Phoenix folks (my cubicle neighbors :) ).
Turns out that the types of problem we're trying to solve for Phoenix would need equal-depth histograms, whereas for decisions such as picking a 2ndary index equal-width histograms are often used. So a key in this is a proper framework through, which, stats can hooked up and calculated. OSGi for coprocessors would be nice, but may also be overkill for this. Maybe something like the chores framework would work. In either case, there will be core stats (that would allow HBase to decide between a scan and a multi get), and user defined stats to help higher layers such as Phoenix, or an indexing library. -- Lars ________________________________ From: Enis Söztutar <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, February 26, 2013 4:15 PM Subject: Re: Simple stastics per region +1 for core. I can see that histograms might help us in automatic splits and merges as well. On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > If this is going to be a CP then other CPs need an easy way to use the > output stats. If a subsequent proposal from core requires statistics from > this CP does that then mandate it itself must be a CP? What if that can't > work? > > Putting the stats into a table addresses the first concern. > > For the second, it is an issue that comes up I think when building a > generally useful shared function as a CP. Please consider inserting my > earlier comments about OSGi here, in that we trend toward a real module > system if we're not careful (unless that is the aim). > > > On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED] > >wrote: > > > TL;DR Making it part of the UI and ensuring that you don't load things > the > > wrong way seem to be the only reasons for making this part of core - > > certainly not bad reasons. They are fairly easy to handle as a CP though, > > so maybe its not necessary immediately. > > > > I ended up writing a simple stats framework last week (ok, its like 6 > > classes) that makes it easy to create your own stats for a table. Its all > > coprocessor based, and as Lars suggested, hooks up to the major > compactions > > to let you build per-column-per-region stats and writes it to a 'system' > > table = "_stats_". > > > > With the framework you could easily write your own custom stats, from > > simple things like min/max keys to things like fixed width or fixed depth > > histograms, or even more complicated. There has been some internal > > discussion around how to make this available to the community (as part of > > Phoenix, core in HBase, an independent github project, ...?). > > > > The biggest isssue around having it all CP based is that you need to be > > really careful to ensure that it comes _after_ all the other compaction > > coprocessors. This way you know exactly what keys come out and have > correct > > statistics (for that point in time). Not a huge issue - you just need to > be > > careful. Baking the stats framework into HBase is really nice in that we > > can be sure we never mess this up. > > > > Building it into the core of HBase isn't going to get us per-region > > statistics without a whole bunch of pain - compactions per store make > this > > a pain to actualize; there isn't a real advantage here, as I'd like to > keep > > it per CF, if only not to change all the things. > > > > Further, this would be a great first use-case for real system tables. > > Mixing this data with .META. is going to be a bit of a mess, especially > for > > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned > > to muck with such important state, especially if we make a 'statistic' a > > pluggable element (so people can easily expand their own). > > > > And sure, we could make it make pretty graphs on the UI, no harm in it > and > > very little overhead :) > > > > ------------------- > > Jesse Yates > > @jesse_yates +
lars hofhansl 2013-02-27, 00:27
-
Re: Simple stastics per regionJesse Yates 2013-02-27, 00:31
The more I think about it, the more I'd like it in core. OSGi is something
I'd like to avoid as long as we can, and baking this in makes (I think) more sense overall. This is especially true for how to deal with displaying the histograms in the UI - dependent CPs make me twitch. The things we would need to make this happen cleanly (IMO) would be: - system tables - basically metadata in the table descriptor that would hide it from the usual user queries like list_tables, etc. and expose something like deleteSystemTable - An extra 'stat' scanner that goes on top of the store scanner used for compaction that writes to the stats system table - CPs could still muck with this, but as always, that's at their own peril - Some pretty UI graphs on the master for the stats The debateable piece is then: pluggable? If so, to what degree? Something Lars just mentioned which would be nice is to have a Chore-like mechanism that lets people easily change the stats they want to keep track of. Probably along the lines of dynamic config, but since we can just push the changes into a waiting state element/queue-thingy and then let the next round of major compaction grab it without race concerns. Shall I file a JIRA (and sub-jiras) to get this into core; we can also take discussion there? ------------------- Jesse Yates @jesse_yates jyates.github.com On Tue, Feb 26, 2013 at 4:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Just had a discussion with the Phoenix folks (my cubicle neighbors :) ). > Turns out that the types of problem we're trying to solve for Phoenix > would need equal-depth histograms, whereas for decisions such as picking a > 2ndary index equal-width histograms are often used. > So a key in this is a proper framework through, which, stats can hooked up > and calculated. OSGi for coprocessors would be nice, but may also be > overkill for this. > Maybe something like the chores framework would work. > > In either case, there will be core stats (that would allow HBase to decide > between a scan and a multi get), and user defined stats to help higher > layers such as Phoenix, or an indexing library. > > > -- Lars > > > > ________________________________ > From: Enis Söztutar <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Tuesday, February 26, 2013 4:15 PM > Subject: Re: Simple stastics per region > > +1 for core. I can see that histograms might help us in automatic splits > and merges as well. > > > On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[EMAIL PROTECTED]> > wrote: > > > If this is going to be a CP then other CPs need an easy way to use the > > output stats. If a subsequent proposal from core requires statistics from > > this CP does that then mandate it itself must be a CP? What if that can't > > work? > > > > Putting the stats into a table addresses the first concern. > > > > For the second, it is an issue that comes up I think when building a > > generally useful shared function as a CP. Please consider inserting my > > earlier comments about OSGi here, in that we trend toward a real module > > system if we're not careful (unless that is the aim). > > > > > > On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED] > > >wrote: > > > > > TL;DR Making it part of the UI and ensuring that you don't load things > > the > > > wrong way seem to be the only reasons for making this part of core - > > > certainly not bad reasons. They are fairly easy to handle as a CP > though, > > > so maybe its not necessary immediately. > > > > > > I ended up writing a simple stats framework last week (ok, its like 6 > > > classes) that makes it easy to create your own stats for a table. Its > all > > > coprocessor based, and as Lars suggested, hooks up to the major > > compactions > > > to let you build per-column-per-region stats and writes it to a > 'system' > > > table = "_stats_". > > > > > > With the framework you could easily write your own custom stats, from +
Jesse Yates 2013-02-27, 00:31
-
Re: Simple stastics per regionJesse Yates 2013-02-28, 01:52
I filed HBASE-7958 <https://issues.apache.org/jira/browse/HBASE-7958> to
follow up on this. Includes a summary of the discussion so far. ------------------- Jesse Yates @jesse_yates jyates.github.com On Tue, Feb 26, 2013 at 4:31 PM, Jesse Yates <[EMAIL PROTECTED]>wrote: > The more I think about it, the more I'd like it in core. OSGi is something > I'd like to avoid as long as we can, and baking this in makes (I think) > more sense overall. This is especially true for how to deal with displaying > the histograms in the UI - dependent CPs make me twitch. > > The things we would need to make this happen cleanly (IMO) would be: > > - system tables > - basically metadata in the table descriptor that would hide it > from the usual user queries like list_tables, etc. and expose something > like deleteSystemTable > - An extra 'stat' scanner that goes on top of the store scanner used > for compaction that writes to the stats system table > - CPs could still muck with this, but as always, that's at their > own peril > - Some pretty UI graphs on the master for the stats > > The debateable piece is then: pluggable? If so, to what degree? > > Something Lars just mentioned which would be nice is to have a Chore-like > mechanism that lets people easily change the stats they want to keep track > of. Probably along the lines of dynamic config, but since we can just push > the changes into a waiting state element/queue-thingy and then let the next > round of major compaction grab it without race concerns. > > Shall I file a JIRA (and sub-jiras) to get this into core; we can also > take discussion there? > ------------------- > Jesse Yates > @jesse_yates > jyates.github.com > > > On Tue, Feb 26, 2013 at 4:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > >> Just had a discussion with the Phoenix folks (my cubicle neighbors :) ). >> Turns out that the types of problem we're trying to solve for Phoenix >> would need equal-depth histograms, whereas for decisions such as picking a >> 2ndary index equal-width histograms are often used. >> So a key in this is a proper framework through, which, stats can hooked >> up and calculated. OSGi for coprocessors would be nice, but may also be >> overkill for this. >> Maybe something like the chores framework would work. >> >> In either case, there will be core stats (that would allow HBase to >> decide between a scan and a multi get), and user defined stats to help >> higher layers such as Phoenix, or an indexing library. >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Enis Söztutar <[EMAIL PROTECTED]> >> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> Sent: Tuesday, February 26, 2013 4:15 PM >> Subject: Re: Simple stastics per region >> >> +1 for core. I can see that histograms might help us in automatic splits >> and merges as well. >> >> >> On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[EMAIL PROTECTED]> >> wrote: >> >> > If this is going to be a CP then other CPs need an easy way to use the >> > output stats. If a subsequent proposal from core requires statistics >> from >> > this CP does that then mandate it itself must be a CP? What if that >> can't >> > work? >> > >> > Putting the stats into a table addresses the first concern. >> > >> > For the second, it is an issue that comes up I think when building a >> > generally useful shared function as a CP. Please consider inserting my >> > earlier comments about OSGi here, in that we trend toward a real module >> > system if we're not careful (unless that is the aim). >> > >> > >> > On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED] >> > >wrote: >> > >> > > TL;DR Making it part of the UI and ensuring that you don't load things >> > the >> > > wrong way seem to be the only reasons for making this part of core - >> > > certainly not bad reasons. They are fairly easy to handle as a CP >> though, >> > > so maybe its not necessary immediately. >> > > >> > > I ended up writing a simple stats framework last week (ok, its like 6 +
Jesse Yates 2013-02-28, 01:52
|