Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> median aggregate Was: AggregateProtocol Help

Copy link to this message
Re: median aggregate Was: AggregateProtocol Help
Two pass algorithm is fine. See HBASE-5139.

But we have to consider that there might be change in the underlying data across the two passes.

Feel free to log subtasks for hbase-5123 for each aggregate that you think should be supported.


On Jan 7, 2012, at 3:32 AM, Tom Wilcox <[EMAIL PROTECTED]> wrote:

> Forgive me if this is stating the obvious (I just want to understand this better), but a naive approach to hist would surely just be a 2-pass algorithm where the first pass gathers statistics such as the range. Those statistics could be cached for subsequent requests that are also "range-dependent" such as n-tiles.
> Are 2-pass algorithms out of the question or too inefficient to consider?
> Cheers,
> Tom
> ________________________________________
> From: Royston Sellman [[EMAIL PROTECTED]]
> Sent: 06 January 2012 22:00
> Subject: Re: median aggregate Was: AggregateProtocol Help
> I will have to think about this properly next week as I am travelling this weekend but...
> I was using binning only as an example. I have worked with R in the past and there is a neat R function called hist which generates histograms from arrays of values and the number of "breaks" (=bins) is a parameter to hist. The generated histogram is an object so you can examine it: hist()?counts returns a vector containing the frequencies in each bin ("?" in R is like "." in Java). The discussion is here: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html
> I am not trying to turn HBase into R ;) but binning is in my experience a useful aggregation. I have no idea how to efficiently implement it across the regionservers though. I think it is *me* who needs to brush up my knowledge of HBase internal machinery. But I think it will be a similar problem to crack for quantile/ntile. The start of the boundaries will be the ntiles. Maybe if ntile is done first then it will help with binning, maybe even make it trivial.
> HBASE-5139 looks good, thanks. I will get colleagues to look at it and comment.
> Cheers,
> Royston
> On 6 Jan 2012, at 19:29, Ted Yu wrote:
>> Royston:
>> I need to brush up my math knowledge so bear with me for a few questions.
>> For binning, you gave 100 as the number of bins. If the computation is
>> initiated on each region server simultaneously, how would each region know
>> where the bin boundaries are ? If the boundaries are naturally aligned with
>> region boundaries, that would be easier.
>> I logged HBASE-5139 for weighted median, please comment there.
>> If you or other people feel there is plausible implementation for any new
>> aggregate, please create subtask so that the original JIRA can host general
>> discussions.
>> Cheers
>> On Fri, Jan 6, 2012 at 6:22 AM, Royston Sellman <
>> [EMAIL PROTECTED]> wrote:
>>> Hi Ted,
>>> Yes, that is the use case I am thinking of.
>>> Re: 5123 I have also had some time to think about other aggregation
>>> functions (Please be aware that I am new to HBase, Coprocessors, and the
>>> Aggregation Protocol and I have little knowledge of distributed numerical
>>> algorithms!). It seems to me the pattern in AP is to return a SINGLE value
>>> from a SINGLE column (CF:CQ) of a table. In future one might wish to extend
>>> AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep
>>> this in mind for the SINGLE value/SINGLE column (SVSC) case.
>>> So, common SVSC aggregation functions (AP supported first):
>>> min
>>> max
>>> sum
>>> count
>>> avg (arithmetic mean)
>>> std
>>> median
>>> mode
>>> quantile/ntile
>>> mult/product
>>> for column values of all numeric types, returning values of that type.
>>> Some thoughts on the future possibilities:
>>> An example of a future SINGLE value MULTIPLE column use case could be
>>> weighted versions of the above functions i.e. a column of weights applied
>>> to the column of values then the new aggregation derived.