median aggregate Was: AggregateProtocol Help
Royston:
Royston:
For the median aggregate, is the following what you're looking for ? Find the median among the values of all the keyvalue for cf:qualifier column. There is a well known distributed method of computing median that involves multiple roundtrips (to the region servers). Just want to confirm the use case. Thanks
Re: median aggregate Was: AggregateProtocol Help
Hi Ted,
Yes, that is the use case I am thinking of. Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case. So, common SVSC aggregation functions (AP supported first): min max sum count avg (arithmetic mean) std median mode quantile/ntile mult/product for column values of all numeric types, returning values of that type. Some thoughts on the future possibilities: An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived. (note: there is a very good description of Weighted Median in the R language documentation: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g: int bins = 100; aClient.sum(table, ci, scan, bins); => {12.3, 14.5...} Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors. MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now. I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase. Best regards, Royston
Re: median aggregate Was: AggregateProtocol Help
This is a good summary.
This is a good summary.
Do you mind putting what you wrote below on hbase5123 ? Thanks
Re: median aggregate Was: AggregateProtocol Help
Done.
Done.
Thanks, Royston
Re: median aggregate Was: AggregateProtocol Help
Royston:
Royston:
I need to brush up my math knowledge so bear with me for a few questions. For binning, you gave 100 as the number of bins. If the computation is initiated on each region server simultaneously, how would each region know where the bin boundaries are ? If the boundaries are naturally aligned with region boundaries, that would be easier. I logged HBASE5139 for weighted median, please comment there. If you or other people feel there is plausible implementation for any new aggregate, please create subtask so that the original JIRA can host general discussions. Cheers
Re: median aggregate Was: AggregateProtocol Help
I will have to think about this properly next week as I am travelling this weekend but...
I will have to think about this properly next week as I am travelling this weekend but...
I was using binning only as an example. I have worked with R in the past and there is a neat R function called hist which generates histograms from arrays of values and the number of "breaks" (=bins) is a parameter to hist. The generated histogram is an object so you can examine it: hist()?counts returns a vector containing the frequencies in each bin ("?" in R is like "." in Java). The discussion is here: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html I am not trying to turn HBase into R ;) but binning is in my experience a useful aggregation. I have no idea how to efficiently implement it across the regionservers though. I think it is *me* who needs to brush up my knowledge of HBase internal machinery. But I think it will be a similar problem to crack for quantile/ntile. The start of the boundaries will be the ntiles. Maybe if ntile is done first then it will help with binning, maybe even make it trivial. HBASE5139 looks good, thanks. I will get colleagues to look at it and comment. Cheers, Royston
RE: median aggregate Was: AggregateProtocol Help
Forgive me if this is stating the obvious (I just want to understand this better), but a naive approach to hist would surely just be a 2pass algorithm where the first pass gathers statistics such as the range. Those statistics could be cached for subsequent requests that are also "rangedependent" such as ntiles.
Forgive me if this is stating the obvious (I just want to understand this better), but a naive approach to hist would surely just be a 2pass algorithm where the first pass gathers statistics such as the range. Those statistics could be cached for subsequent requests that are also "rangedependent" such as ntiles.
Are 2pass algorithms out of the question or too inefficient to consider? Cheers, Tom
Re: median aggregate Was: AggregateProtocol Help
Tom:
Tom:
Two pass algorithm is fine. See HBASE5139. But we have to consider that there might be change in the underlying data across the two passes. Feel free to log subtasks for hbase5123 for each aggregate that you think should be supported. Cheers
Re: median aggregate Was: AggregateProtocol Help
Tom / Royston:
Tom / Royston:
I attached first version of patch to HBASE5139. I need to handle weighted median and add more tests. javadoc is available for methods. More javadoc is needed inside median() method. It took longer than I expected due to the generic parameters. Comments are welcome.
RE: median aggregate Was: AggregateProtocol Help
Hi Ted,
Hi Ted,
Great! Thanks for your work. I see you posted another comment saying you now support weighted median. You're very fast! We have to spend time getting ready for a presentation this week but we will try to make time to test the patch. The code is quite hard for me to read due to the generics but it looks like you have made ColumnInterpreter more ready to take types other than Longs BUT you have not provided column interpreter implementations for types other than Longs. So there is more work to do for other types, am I correct? If so, should we start a new JIRA for more types? I am thinking the type we most need is double. Cheers, Royston
