Ted Yu
20120106, 03:31
Royston Sellman
20120106, 14:22
yuzhihong@...
20120106, 14:37
Royston Sellman
20120106, 15:09
Ted Yu
20120106, 19:29
Royston Sellman
20120106, 22:00
Tom Wilcox
20120107, 11:32
yuzhihong@...
20120107, 11:45
Ted Yu
20120110, 04:04
Royston Sellman
20120110, 13:01


median aggregate Was: AggregateProtocol HelpTed Yu 20120106, 03:31
Royston:
For the median aggregate, is the following what you're looking for ? Find the median among the values of all the keyvalue for cf:qualifier column. There is a well known distributed method of computing median that involves multiple roundtrips (to the region servers). Just want to confirm the use case. Thanks On Wed, Jan 4, 2012 at 10:57 AM, Royston Sellman < [EMAIL PROTECTED]> wrote: > Great ideas. Thanks. > > w.r.t. 5123: I'll think about it for a day or two then make some comments. > > 5122 is very desirable. > > Best Regards, > Royston > > On 4 Jan 2012, at 15:01, Ted Yu wrote: > > > I can see room for improvement w.r.t. ColumnInterpreters > > I logged two JIRAs: > > https://issues.apache.org/jira/browse/HBASE5122 is for loading > > ColumnInterpreters dynamically > > > > https://issues.apache.org/jira/browse/HBASE5123 is for adding more > > aggregation functions. > > > > Royston: > > Feel free to elaborate on 5213 and explain what Mult aggregate should do. > > > > Cheers > > > > On Wed, Jan 4, 2012 at 3:43 AM, Royston Sellman < > > [EMAIL PROTECTED]> wrote: > > > >> Ted, Himanshu and Gary, > >> > >> It works now! I recreated my HBase table to contain Bytes.toBytes(Long) > >> values and that fixed it. > >> > >> For the time being we can convert everything to Longs and work with > that, > >> but we will probably write our own ColumnInterpreters soon for our data > >> types, so thanks for the pointer to HBASE4946. There are also > Functions we > >> need (e.g. Median, Weighted Median, Mult) which might best be placed in > the > >> Aggregations Protocol. We'll be sure to discuss this with you when we > start. > >> > >> Meanwhile, thanks again for all your help! > >> > >> Royston > >> > >> > >> On 3 Jan 2012, at 18:58, Ted Yu wrote: > >> > >>> I like long messages :) because they provide more clues. > >>> > >>> For part 1, you don't have to call Bytes.toxxx as long as the > interpreter > >>> uses method consistent with the way you write values into HBase tables. > >>> > >>> For part 2, HBASE4946 is related. > >>> Basically you need to place the jar containing your coprocessor and > >>> interpreter code on hdfs so that you can load it into your HBase table. > >>> Look at this for details: > >>> https://issues.apache.org/jira/browse/HBASE4554 > >>> > >>> Cheers > >>> > >>> On Tue, Jan 3, 2012 at 10:42 AM, Royston Sellman < > >>> [EMAIL PROTECTED]> wrote: > >>> > >>>> Hi Ted, > >>>> > >>>> PART 1 > >>>> ====> >>>> Thanks for the hint. I think maybe you have given me some inspiration! > >>>> > >>>> It looks like getValue will return null if the table value is not the > >>>> length > >>>> of a long. When we created our table (batch loading CSVs using the > >>>> SampleUploader example) we simply have this as our put(): > >>>> put.add(family, Bytes.toBytes("advanceKWh"), advanceKWh); > >>>> [note we do no Bytes.toxxx casts to the advanceKWh value. The values > >> look > >>>> OK > >>>> from HBase shell though :)] > >>>> > >>>> but I looked at TestAggregateProtocol.java again and I see there puts > >> like: > >>>> p2.add(TEST_FAMILY, Bytes.add(TEST_MULTI_CQ, Bytes.toBytes(l)), > >>>> Bytes.toBytes(l * 10)); > >>>> > >>>> So my hypothesis is that we need to do something like: > >>>> Long l = new Long(1); > >>>> put.add(family, Bytes.toBytes("advanceKWh"), Bytes.toBytes(l * > >>>> advanceKWh)); > >>>> when we create the table. > >>>> > >>>> Do you think my hypothesis is correct? Did we build our table > >> incorrectly > >>>> for reading longs from it? > >>>> > >>>> PART 2 > >>>> ====> >>>> Anyway we will obviously need to make our own interpreters. but we > >> failed > >>>> at > >>>> this task so far: > >>>> In order to implement our own ColumnInterpretter, we first attempted > >> simply > >>>> extending the LongColumnInterpreter and passing that as a parameter to > >>>> aClient.sum(). > >>>> import > org.apache.hadoop.hbase.client.coprocessor.LongColumnInterpreter; > >>>> > >>>> public class LCI extends LongColumnInterpreter { +
Ted Yu 20120106, 03:31

Re: median aggregate Was: AggregateProtocol HelpRoyston Sellman 20120106, 14:22
Hi Ted,
Yes, that is the use case I am thinking of. Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case. So, common SVSC aggregation functions (AP supported first): min max sum count avg (arithmetic mean) std median mode quantile/ntile mult/product for column values of all numeric types, returning values of that type. Some thoughts on the future possibilities: An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived. (note: there is a very good description of Weighted Median in the R language documentation: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g: int bins = 100; aClient.sum(table, ci, scan, bins); => {12.3, 14.5...} Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors. MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now. I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase. Best regards, Royston On 6 Jan 2012, at 03:31, Ted Yu wrote: > Royston: > For the median aggregate, is the following what you're looking for ? > Find the median among the values of all the keyvalue for cf:qualifier > column. > > There is a well known distributed method of computing median that involves > multiple roundtrips (to the region servers). > > Just want to confirm the use case. > > Thanks > > On Wed, Jan 4, 2012 at 10:57 AM, Royston Sellman < > [EMAIL PROTECTED]> wrote: > >> Great ideas. Thanks. >> >> w.r.t. 5123: I'll think about it for a day or two then make some comments. >> >> 5122 is very desirable. >> >> Best Regards, >> Royston >> >> On 4 Jan 2012, at 15:01, Ted Yu wrote: >> >>> I can see room for improvement w.r.t. ColumnInterpreters >>> I logged two JIRAs: >>> https://issues.apache.org/jira/browse/HBASE5122 is for loading >>> ColumnInterpreters dynamically >>> >>> https://issues.apache.org/jira/browse/HBASE5123 is for adding more >>> aggregation functions. >>> >>> Royston: >>> Feel free to elaborate on 5213 and explain what Mult aggregate should do. >>> >>> Cheers >>> >>> On Wed, Jan 4, 2012 at 3:43 AM, Royston Sellman < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Ted, Himanshu and Gary, >>>> >>>> It works now! I recreated my HBase table to contain Bytes.toBytes(Long) >>>> values and that fixed it. >>>> >>>> For the time being we can convert everything to Longs and work with >> that, >>>> but we will probably write our own ColumnInterpreters soon for our data >>>> types, so thanks for the pointer to HBASE4946. There are also >> Functions we >>>> need (e.g. Median, Weighted Median, Mult) which might best be placed in +
Royston Sellman 20120106, 14:22

Re: median aggregate Was: AggregateProtocol Helpyuzhihong@... 20120106, 14:37
This is a good summary.
Do you mind putting what you wrote below on hbase5123 ? Thanks On Jan 6, 2012, at 6:22 AM, Royston Sellman <[EMAIL PROTECTED]> wrote: > Hi Ted, > > Yes, that is the use case I am thinking of. > > Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case. > > So, common SVSC aggregation functions (AP supported first): > min > max > sum > count > avg (arithmetic mean) > std > median > mode > quantile/ntile > mult/product > > for column values of all numeric types, returning values of that type. > > Some thoughts on the future possibilities: > An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived. > (note: there is a very good description of Weighted Median in the R language documentation: > http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) > > An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g: > int bins = 100; > aClient.sum(table, ci, scan, bins); => {12.3, 14.5...} > Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors. > > MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now. > > I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase. > > Best regards, > Royston > > > On 6 Jan 2012, at 03:31, Ted Yu wrote: > >> Royston: >> For the median aggregate, is the following what you're looking for ? >> Find the median among the values of all the keyvalue for cf:qualifier >> column. >> >> There is a well known distributed method of computing median that involves >> multiple roundtrips (to the region servers). >> >> Just want to confirm the use case. >> >> Thanks >> >> On Wed, Jan 4, 2012 at 10:57 AM, Royston Sellman < >> [EMAIL PROTECTED]> wrote: >> >>> Great ideas. Thanks. >>> >>> w.r.t. 5123: I'll think about it for a day or two then make some comments. >>> >>> 5122 is very desirable. >>> >>> Best Regards, >>> Royston >>> >>> On 4 Jan 2012, at 15:01, Ted Yu wrote: >>> >>>> I can see room for improvement w.r.t. ColumnInterpreters >>>> I logged two JIRAs: >>>> https://issues.apache.org/jira/browse/HBASE5122 is for loading >>>> ColumnInterpreters dynamically >>>> >>>> https://issues.apache.org/jira/browse/HBASE5123 is for adding more >>>> aggregation functions. >>>> >>>> Royston: >>>> Feel free to elaborate on 5213 and explain what Mult aggregate should do. >>>> >>>> Cheers >>>> >>>> On Wed, Jan 4, 2012 at 3:43 AM, Royston Sellman < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Ted, Himanshu and Gary, >>>>> >>>>> It works now! I recreated my HBase table to contain Bytes.toBytes(Long) >>>>> values and that fixed it. +
yuzhihong@... 20120106, 14:37

Re: median aggregate Was: AggregateProtocol HelpRoyston Sellman 20120106, 15:09
Done.
Thanks, Royston > Do you mind putting what you wrote below on hbase5123 ? > > Thanks > > > > On Jan 6, 2012, at 6:22 AM, Royston Sellman <[EMAIL PROTECTED]> wrote: > >> Hi Ted, >> >> Yes, that is the use case I am thinking of. >> >> Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case. >> >> So, common SVSC aggregation functions (AP supported first): >> min >> max >> sum >> count >> avg (arithmetic mean) >> std >> median >> mode >> quantile/ntile >> mult/product >> >> for column values of all numeric types, returning values of that type. >> >> Some thoughts on the future possibilities: >> An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived. >> (note: there is a very good description of Weighted Median in the R language documentation: >> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) >> >> An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g: >> int bins = 100; >> aClient.sum(table, ci, scan, bins); => {12.3, 14.5...} >> Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors. >> >> MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now. >> >> I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase. >> >> Best regards, >> Royston >> >> >> On 6 Jan 2012, at 03:31, Ted Yu wrote: >> >>> Royston: >>> For the median aggregate, is the following what you're looking for ? >>> Find the median among the values of all the keyvalue for cf:qualifier >>> column. >>> >>> There is a well known distributed method of computing median that involves >>> multiple roundtrips (to the region servers). >>> >>> Just want to confirm the use case. >>> >>> Thanks >>> >>> On Wed, Jan 4, 2012 at 10:57 AM, Royston Sellman < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Great ideas. Thanks. >>>> >>>> w.r.t. 5123: I'll think about it for a day or two then make some comments. >>>> >>>> 5122 is very desirable. >>>> >>>> Best Regards, >>>> Royston >>>> >>>> On 4 Jan 2012, at 15:01, Ted Yu wrote: >>>> >>>>> I can see room for improvement w.r.t. ColumnInterpreters >>>>> I logged two JIRAs: >>>>> https://issues.apache.org/jira/browse/HBASE5122 is for loading >>>>> ColumnInterpreters dynamically >>>>> >>>>> https://issues.apache.org/jira/browse/HBASE5123 is for adding more >>>>> aggregation functions. >>>>> >>>>> Royston: >>>>> Feel free to elaborate on 5213 and explain what Mult aggregate should do. >>>>> >>>>> Cheers >>>>> >>>>> On Wed, Jan 4, 2012 at 3:43 AM, Royston Sellman < >>>>> [EMAIL PROTECTED]> wrote: >>>>> >>>>>> Ted, Himanshu and Gary, +
Royston Sellman 20120106, 15:09

Re: median aggregate Was: AggregateProtocol HelpTed Yu 20120106, 19:29
Royston:
I need to brush up my math knowledge so bear with me for a few questions. For binning, you gave 100 as the number of bins. If the computation is initiated on each region server simultaneously, how would each region know where the bin boundaries are ? If the boundaries are naturally aligned with region boundaries, that would be easier. I logged HBASE5139 for weighted median, please comment there. If you or other people feel there is plausible implementation for any new aggregate, please create subtask so that the original JIRA can host general discussions. Cheers On Fri, Jan 6, 2012 at 6:22 AM, Royston Sellman < [EMAIL PROTECTED]> wrote: > Hi Ted, > > Yes, that is the use case I am thinking of. > > Re: 5123 I have also had some time to think about other aggregation > functions (Please be aware that I am new to HBase, Coprocessors, and the > Aggregation Protocol and I have little knowledge of distributed numerical > algorithms!). It seems to me the pattern in AP is to return a SINGLE value > from a SINGLE column (CF:CQ) of a table. In future one might wish to extend > AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep > this in mind for the SINGLE value/SINGLE column (SVSC) case. > > So, common SVSC aggregation functions (AP supported first): > min > max > sum > count > avg (arithmetic mean) > std > median > mode > quantile/ntile > mult/product > > for column values of all numeric types, returning values of that type. > > Some thoughts on the future possibilities: > An example of a future SINGLE value MULTIPLE column use case could be > weighted versions of the above functions i.e. a column of weights applied > to the column of values then the new aggregation derived. > (note: there is a very good description of Weighted Median in the R > language documentation: > > http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html > ) > > An example of future MULTIPLE value SINGLE column could be range: return > all rows with a column value between two values. Maybe this is a bad > example because there could be better HBase ways to do it with > filters/scans at a higher level. Perhaps binning is a better example? i.e. > return an array containing values derived from applying one of the SVSC > functions to a binned column e.g: > int bins = 100; > aClient.sum(table, ci, scan, bins); => {12.3, 14.5...} > Another example (common in several programming languages) is to map an > arbitrary function over a column and return the new vector. Of course, > again this may be a bad example in the case of long HBase columns but it > seems like an appropriate thing to do with coprocessors. > > MULTIPLE value MULTIPLE column examples are common in spatial data > processing but I see there has been a lot of spatial/GIS discussion around > HBase which I have not read yet. So I'll keep quiet for now. > > I hope these thoughts strike a balance between my (special interest) use > case of statistical/spatial functions on tables and general purpose (but > coprocessor enabled/regionserver distributed) HBase. > > Best regards, > Royston > > > On 6 Jan 2012, at 03:31, Ted Yu wrote: > > > Royston: > > For the median aggregate, is the following what you're looking for ? > > Find the median among the values of all the keyvalue for cf:qualifier > > column. > > > > There is a well known distributed method of computing median that > involves > > multiple roundtrips (to the region servers). > > > > Just want to confirm the use case. > > > > Thanks > > > > On Wed, Jan 4, 2012 at 10:57 AM, Royston Sellman < > > [EMAIL PROTECTED]> wrote: > > > >> Great ideas. Thanks. > >> > >> w.r.t. 5123: I'll think about it for a day or two then make some > comments. > >> > >> 5122 is very desirable. > >> > >> Best Regards, > >> Royston > >> > >> On 4 Jan 2012, at 15:01, Ted Yu wrote: > >> > >>> I can see room for improvement w.r.t. ColumnInterpreters > >>> I logged two JIRAs: > >>> https://issues.apache.org/jira/browse/HBASE5122 is for loading +
Ted Yu 20120106, 19:29

Re: median aggregate Was: AggregateProtocol HelpRoyston Sellman 20120106, 22:00
I will have to think about this properly next week as I am travelling this weekend but...
I was using binning only as an example. I have worked with R in the past and there is a neat R function called hist which generates histograms from arrays of values and the number of "breaks" (=bins) is a parameter to hist. The generated histogram is an object so you can examine it: hist()?counts returns a vector containing the frequencies in each bin ("?" in R is like "." in Java). The discussion is here: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html I am not trying to turn HBase into R ;) but binning is in my experience a useful aggregation. I have no idea how to efficiently implement it across the regionservers though. I think it is *me* who needs to brush up my knowledge of HBase internal machinery. But I think it will be a similar problem to crack for quantile/ntile. The start of the boundaries will be the ntiles. Maybe if ntile is done first then it will help with binning, maybe even make it trivial. HBASE5139 looks good, thanks. I will get colleagues to look at it and comment. Cheers, Royston On 6 Jan 2012, at 19:29, Ted Yu wrote: > Royston: > I need to brush up my math knowledge so bear with me for a few questions. > > For binning, you gave 100 as the number of bins. If the computation is > initiated on each region server simultaneously, how would each region know > where the bin boundaries are ? If the boundaries are naturally aligned with > region boundaries, that would be easier. > > I logged HBASE5139 for weighted median, please comment there. > > If you or other people feel there is plausible implementation for any new > aggregate, please create subtask so that the original JIRA can host general > discussions. > > Cheers > > On Fri, Jan 6, 2012 at 6:22 AM, Royston Sellman < > [EMAIL PROTECTED]> wrote: > >> Hi Ted, >> >> Yes, that is the use case I am thinking of. >> >> Re: 5123 I have also had some time to think about other aggregation >> functions (Please be aware that I am new to HBase, Coprocessors, and the >> Aggregation Protocol and I have little knowledge of distributed numerical >> algorithms!). It seems to me the pattern in AP is to return a SINGLE value >> from a SINGLE column (CF:CQ) of a table. In future one might wish to extend >> AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep >> this in mind for the SINGLE value/SINGLE column (SVSC) case. >> >> So, common SVSC aggregation functions (AP supported first): >> min >> max >> sum >> count >> avg (arithmetic mean) >> std >> median >> mode >> quantile/ntile >> mult/product >> >> for column values of all numeric types, returning values of that type. >> >> Some thoughts on the future possibilities: >> An example of a future SINGLE value MULTIPLE column use case could be >> weighted versions of the above functions i.e. a column of weights applied >> to the column of values then the new aggregation derived. >> (note: there is a very good description of Weighted Median in the R >> language documentation: >> >> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html >> ) >> >> An example of future MULTIPLE value SINGLE column could be range: return >> all rows with a column value between two values. Maybe this is a bad >> example because there could be better HBase ways to do it with >> filters/scans at a higher level. Perhaps binning is a better example? i.e. >> return an array containing values derived from applying one of the SVSC >> functions to a binned column e.g: >> int bins = 100; >> aClient.sum(table, ci, scan, bins); => {12.3, 14.5...} >> Another example (common in several programming languages) is to map an >> arbitrary function over a column and return the new vector. Of course, >> again this may be a bad example in the case of long HBase columns but it >> seems like an appropriate thing to do with coprocessors. >> >> MULTIPLE value MULTIPLE column examples are common in spatial data +
Royston Sellman 20120106, 22:00

RE: median aggregate Was: AggregateProtocol HelpTom Wilcox 20120107, 11:32
Forgive me if this is stating the obvious (I just want to understand this better), but a naive approach to hist would surely just be a 2pass algorithm where the first pass gathers statistics such as the range. Those statistics could be cached for subsequent requests that are also "rangedependent" such as ntiles.
Are 2pass algorithms out of the question or too inefficient to consider? Cheers, Tom ________________________________________ From: Royston Sellman [[EMAIL PROTECTED]] Sent: 06 January 2012 22:00 To: [EMAIL PROTECTED] Subject: Re: median aggregate Was: AggregateProtocol Help I will have to think about this properly next week as I am travelling this weekend but... I was using binning only as an example. I have worked with R in the past and there is a neat R function called hist which generates histograms from arrays of values and the number of "breaks" (=bins) is a parameter to hist. The generated histogram is an object so you can examine it: hist()?counts returns a vector containing the frequencies in each bin ("?" in R is like "." in Java). The discussion is here: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html I am not trying to turn HBase into R ;) but binning is in my experience a useful aggregation. I have no idea how to efficiently implement it across the regionservers though. I think it is *me* who needs to brush up my knowledge of HBase internal machinery. But I think it will be a similar problem to crack for quantile/ntile. The start of the boundaries will be the ntiles. Maybe if ntile is done first then it will help with binning, maybe even make it trivial. HBASE5139 looks good, thanks. I will get colleagues to look at it and comment. Cheers, Royston On 6 Jan 2012, at 19:29, Ted Yu wrote: > Royston: > I need to brush up my math knowledge so bear with me for a few questions. > > For binning, you gave 100 as the number of bins. If the computation is > initiated on each region server simultaneously, how would each region know > where the bin boundaries are ? If the boundaries are naturally aligned with > region boundaries, that would be easier. > > I logged HBASE5139 for weighted median, please comment there. > > If you or other people feel there is plausible implementation for any new > aggregate, please create subtask so that the original JIRA can host general > discussions. > > Cheers > > On Fri, Jan 6, 2012 at 6:22 AM, Royston Sellman < > [EMAIL PROTECTED]> wrote: > >> Hi Ted, >> >> Yes, that is the use case I am thinking of. >> >> Re: 5123 I have also had some time to think about other aggregation >> functions (Please be aware that I am new to HBase, Coprocessors, and the >> Aggregation Protocol and I have little knowledge of distributed numerical >> algorithms!). It seems to me the pattern in AP is to return a SINGLE value >> from a SINGLE column (CF:CQ) of a table. In future one might wish to extend >> AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep >> this in mind for the SINGLE value/SINGLE column (SVSC) case. >> >> So, common SVSC aggregation functions (AP supported first): >> min >> max >> sum >> count >> avg (arithmetic mean) >> std >> median >> mode >> quantile/ntile >> mult/product >> >> for column values of all numeric types, returning values of that type. >> >> Some thoughts on the future possibilities: >> An example of a future SINGLE value MULTIPLE column use case could be >> weighted versions of the above functions i.e. a column of weights applied >> to the column of values then the new aggregation derived. >> (note: there is a very good description of Weighted Median in the R >> language documentation: >> >> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html >> ) >> >> An example of future MULTIPLE value SINGLE column could be range: return >> all rows with a column value between two values. Maybe this is a bad >> example because there could be better HBase ways to do it with +
Tom Wilcox 20120107, 11:32

Re: median aggregate Was: AggregateProtocol Helpyuzhihong@... 20120107, 11:45
Tom:
Two pass algorithm is fine. See HBASE5139. But we have to consider that there might be change in the underlying data across the two passes. Feel free to log subtasks for hbase5123 for each aggregate that you think should be supported. Cheers On Jan 7, 2012, at 3:32 AM, Tom Wilcox <[EMAIL PROTECTED]> wrote: > Forgive me if this is stating the obvious (I just want to understand this better), but a naive approach to hist would surely just be a 2pass algorithm where the first pass gathers statistics such as the range. Those statistics could be cached for subsequent requests that are also "rangedependent" such as ntiles. > > Are 2pass algorithms out of the question or too inefficient to consider? > > Cheers, > Tom > ________________________________________ > From: Royston Sellman [[EMAIL PROTECTED]] > Sent: 06 January 2012 22:00 > To: [EMAIL PROTECTED] > Subject: Re: median aggregate Was: AggregateProtocol Help > > I will have to think about this properly next week as I am travelling this weekend but... > > I was using binning only as an example. I have worked with R in the past and there is a neat R function called hist which generates histograms from arrays of values and the number of "breaks" (=bins) is a parameter to hist. The generated histogram is an object so you can examine it: hist()?counts returns a vector containing the frequencies in each bin ("?" in R is like "." in Java). The discussion is here: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html > > I am not trying to turn HBase into R ;) but binning is in my experience a useful aggregation. I have no idea how to efficiently implement it across the regionservers though. I think it is *me* who needs to brush up my knowledge of HBase internal machinery. But I think it will be a similar problem to crack for quantile/ntile. The start of the boundaries will be the ntiles. Maybe if ntile is done first then it will help with binning, maybe even make it trivial. > > HBASE5139 looks good, thanks. I will get colleagues to look at it and comment. > > Cheers, > Royston > > On 6 Jan 2012, at 19:29, Ted Yu wrote: > >> Royston: >> I need to brush up my math knowledge so bear with me for a few questions. >> >> For binning, you gave 100 as the number of bins. If the computation is >> initiated on each region server simultaneously, how would each region know >> where the bin boundaries are ? If the boundaries are naturally aligned with >> region boundaries, that would be easier. >> >> I logged HBASE5139 for weighted median, please comment there. >> >> If you or other people feel there is plausible implementation for any new >> aggregate, please create subtask so that the original JIRA can host general >> discussions. >> >> Cheers >> >> On Fri, Jan 6, 2012 at 6:22 AM, Royston Sellman < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Ted, >>> >>> Yes, that is the use case I am thinking of. >>> >>> Re: 5123 I have also had some time to think about other aggregation >>> functions (Please be aware that I am new to HBase, Coprocessors, and the >>> Aggregation Protocol and I have little knowledge of distributed numerical >>> algorithms!). It seems to me the pattern in AP is to return a SINGLE value >>> from a SINGLE column (CF:CQ) of a table. In future one might wish to extend >>> AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep >>> this in mind for the SINGLE value/SINGLE column (SVSC) case. >>> >>> So, common SVSC aggregation functions (AP supported first): >>> min >>> max >>> sum >>> count >>> avg (arithmetic mean) >>> std >>> median >>> mode >>> quantile/ntile >>> mult/product >>> >>> for column values of all numeric types, returning values of that type. >>> >>> Some thoughts on the future possibilities: >>> An example of a future SINGLE value MULTIPLE column use case could be >>> weighted versions of the above functions i.e. a column of weights applied >>> to the column of values then the new aggregation derived. +
yuzhihong@... 20120107, 11:45

Re: median aggregate Was: AggregateProtocol HelpTed Yu 20120110, 04:04
Tom / Royston:
I attached first version of patch to HBASE5139. I need to handle weighted median and add more tests. javadoc is available for methods. More javadoc is needed inside median() method. It took longer than I expected due to the generic parameters. Comments are welcome. On Sat, Jan 7, 2012 at 3:45 AM, <[EMAIL PROTECTED]> wrote: > Tom: > Two pass algorithm is fine. See HBASE5139. > > But we have to consider that there might be change in the underlying data > across the two passes. > > Feel free to log subtasks for hbase5123 for each aggregate that you think > should be supported. > > Cheers > > > > On Jan 7, 2012, at 3:32 AM, Tom Wilcox <[EMAIL PROTECTED]> wrote: > > > Forgive me if this is stating the obvious (I just want to understand > this better), but a naive approach to hist would surely just be a 2pass > algorithm where the first pass gathers statistics such as the range. Those > statistics could be cached for subsequent requests that are also > "rangedependent" such as ntiles. > > > > Are 2pass algorithms out of the question or too inefficient to consider? > > > > Cheers, > > Tom > > ________________________________________ > > From: Royston Sellman [[EMAIL PROTECTED]] > > Sent: 06 January 2012 22:00 > > To: [EMAIL PROTECTED] > > Subject: Re: median aggregate Was: AggregateProtocol Help > > > > I will have to think about this properly next week as I am travelling > this weekend but... > > > > I was using binning only as an example. I have worked with R in the past > and there is a neat R function called hist which generates histograms from > arrays of values and the number of "breaks" (=bins) is a parameter to hist. > The generated histogram is an object so you can examine it: hist()?counts > returns a vector containing the frequencies in each bin ("?" in R is like > "." in Java). The discussion is here: > http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html > > > > I am not trying to turn HBase into R ;) but binning is in my experience > a useful aggregation. I have no idea how to efficiently implement it across > the regionservers though. I think it is *me* who needs to brush up my > knowledge of HBase internal machinery. But I think it will be a similar > problem to crack for quantile/ntile. The start of the boundaries will be > the ntiles. Maybe if ntile is done first then it will help with binning, > maybe even make it trivial. > > > > HBASE5139 looks good, thanks. I will get colleagues to look at it and > comment. > > > > Cheers, > > Royston > > > > On 6 Jan 2012, at 19:29, Ted Yu wrote: > > > >> Royston: > >> I need to brush up my math knowledge so bear with me for a few > questions. > >> > >> For binning, you gave 100 as the number of bins. If the computation is > >> initiated on each region server simultaneously, how would each region > know > >> where the bin boundaries are ? If the boundaries are naturally aligned > with > >> region boundaries, that would be easier. > >> > >> I logged HBASE5139 for weighted median, please comment there. > >> > >> If you or other people feel there is plausible implementation for any > new > >> aggregate, please create subtask so that the original JIRA can host > general > >> discussions. > >> > >> Cheers > >> > >> On Fri, Jan 6, 2012 at 6:22 AM, Royston Sellman < > >> [EMAIL PROTECTED]> wrote: > >> > >>> Hi Ted, > >>> > >>> Yes, that is the use case I am thinking of. > >>> > >>> Re: 5123 I have also had some time to think about other aggregation > >>> functions (Please be aware that I am new to HBase, Coprocessors, and > the > >>> Aggregation Protocol and I have little knowledge of distributed > numerical > >>> algorithms!). It seems to me the pattern in AP is to return a SINGLE > value > >>> from a SINGLE column (CF:CQ) of a table. In future one might wish to > extend > >>> AP to return MULTIPLE values from MULTIPLE columns, so it is good to > keep > >>> this in mind for the SINGLE value/SINGLE column (SVSC) case. +
Ted Yu 20120110, 04:04

RE: median aggregate Was: AggregateProtocol HelpRoyston Sellman 20120110, 13:01
Hi Ted,
Great! Thanks for your work. I see you posted another comment saying you now support weighted median. You're very fast! We have to spend time getting ready for a presentation this week but we will try to make time to test the patch. The code is quite hard for me to read due to the generics but it looks like you have made ColumnInterpreter more ready to take types other than Longs BUT you have not provided column interpreter implementations for types other than Longs. So there is more work to do for other types, am I correct? If so, should we start a new JIRA for more types? I am thinking the type we most need is double. Cheers, Royston Original Message From: Ted Yu [mailto:[EMAIL PROTECTED]] Sent: 10 January 2012 04:04 To: [EMAIL PROTECTED] Subject: Re: median aggregate Was: AggregateProtocol Help Tom / Royston: I attached first version of patch to HBASE5139. I need to handle weighted median and add more tests. javadoc is available for methods. More javadoc is needed inside median() method. It took longer than I expected due to the generic parameters. Comments are welcome. On Sat, Jan 7, 2012 at 3:45 AM, <[EMAIL PROTECTED]> wrote: > Tom: > Two pass algorithm is fine. See HBASE5139. > > But we have to consider that there might be change in the underlying data > across the two passes. > > Feel free to log subtasks for hbase5123 for each aggregate that you think > should be supported. > > Cheers > > > > On Jan 7, 2012, at 3:32 AM, Tom Wilcox <[EMAIL PROTECTED]> wrote: > > > Forgive me if this is stating the obvious (I just want to understand > this better), but a naive approach to hist would surely just be a 2pass > algorithm where the first pass gathers statistics such as the range. Those > statistics could be cached for subsequent requests that are also > "rangedependent" such as ntiles. > > > > Are 2pass algorithms out of the question or too inefficient to consider? > > > > Cheers, > > Tom > > ________________________________________ > > From: Royston Sellman [[EMAIL PROTECTED]] > > Sent: 06 January 2012 22:00 > > To: [EMAIL PROTECTED] > > Subject: Re: median aggregate Was: AggregateProtocol Help > > > > I will have to think about this properly next week as I am travelling > this weekend but... > > > > I was using binning only as an example. I have worked with R in the past > and there is a neat R function called hist which generates histograms from > arrays of values and the number of "breaks" (=bins) is a parameter to hist. > The generated histogram is an object so you can examine it: hist()?counts > returns a vector containing the frequencies in each bin ("?" in R is like > "." in Java). The discussion is here: > http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html > > > > I am not trying to turn HBase into R ;) but binning is in my experience > a useful aggregation. I have no idea how to efficiently implement it across > the regionservers though. I think it is *me* who needs to brush up my > knowledge of HBase internal machinery. But I think it will be a similar > problem to crack for quantile/ntile. The start of the boundaries will be > the ntiles. Maybe if ntile is done first then it will help with binning, > maybe even make it trivial. > > > > HBASE5139 looks good, thanks. I will get colleagues to look at it and > comment. > > > > Cheers, > > Royston > > > > On 6 Jan 2012, at 19:29, Ted Yu wrote: > > > >> Royston: > >> I need to brush up my math knowledge so bear with me for a few > questions. > >> > >> For binning, you gave 100 as the number of bins. If the computation is > >> initiated on each region server simultaneously, how would each region > know > >> where the bin boundaries are ? If the boundaries are naturally aligned > with > >> region boundaries, that would be easier. > >> > >> I logged HBASE5139 for weighted median, please comment there. > >> > >> If you or other people feel there is plausible implementation for any > new http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.m edian.html SVSC with not I EDRPAggregator.testSumWithValidRange(EDRPAggregator.java:66) in table (how like Error java.net.SocketTimeoutException: 17:51:09 java.net.SocketTimeoutException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation. this EDRPTestTbl,,1324485124322.7b9ee0d113db9b24ea9fdde90702d006.: add will 'org.apache.hadoop.hbase.constraint.ConstraintProcessor1073741823' following BLOOMFILTER repo). a in EDRPTestTbl,,1324485124322.7b9ee0d113db9b24ea9fdde90702d006. java.util.concurrent.FutureTask.get(FutureTask.java:83) EDRPTestTbl,,1324485124322.7b9ee0d113db9b24ea9fdde90702d006. org.apache.hadoop.hbase.client.coprocessor.AggregationClient$ org.apache.hadoop.hbase.client.coprocessor.AggregationClient$ EDRPTestTbl,,1324485124322.7b9ee0d113db9b24ea9fdde90702d006. org.apache.hadoop.hbase.client.coprocessor.AggregationClient$ org.apache.hadoop.hbase.client.coprocessor.AggregationClient$ how +
Royston Sellman 20120110, 13:01
