Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> median aggregate Was: AggregateProtocol Help


Copy link to this message
-
Re: median aggregate Was: AggregateProtocol Help
Tom:
Two pass algorithm is fine. See HBASE-5139.

But we have to consider that there might be change in the underlying data across the two passes.

Feel free to log subtasks for hbase-5123 for each aggregate that you think should be supported.

Cheers

On Jan 7, 2012, at 3:32 AM, Tom Wilcox <[EMAIL PROTECTED]> wrote:

> Forgive me if this is stating the obvious (I just want to understand this better), but a naive approach to hist would surely just be a 2-pass algorithm where the first pass gathers statistics such as the range. Those statistics could be cached for subsequent requests that are also "range-dependent" such as n-tiles.
>
> Are 2-pass algorithms out of the question or too inefficient to consider?
>
> Cheers,
> Tom
> ________________________________________
> From: Royston Sellman [[EMAIL PROTECTED]]
> Sent: 06 January 2012 22:00
> To: [EMAIL PROTECTED]
> Subject: Re: median aggregate Was: AggregateProtocol Help
>
> I will have to think about this properly next week as I am travelling this weekend but...
>
> I was using binning only as an example. I have worked with R in the past and there is a neat R function called hist which generates histograms from arrays of values and the number of "breaks" (=bins) is a parameter to hist. The generated histogram is an object so you can examine it: hist()?counts returns a vector containing the frequencies in each bin ("?" in R is like "." in Java). The discussion is here: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html
>
> I am not trying to turn HBase into R ;) but binning is in my experience a useful aggregation. I have no idea how to efficiently implement it across the regionservers though. I think it is *me* who needs to brush up my knowledge of HBase internal machinery. But I think it will be a similar problem to crack for quantile/ntile. The start of the boundaries will be the ntiles. Maybe if ntile is done first then it will help with binning, maybe even make it trivial.
>
> HBASE-5139 looks good, thanks. I will get colleagues to look at it and comment.
>
> Cheers,
> Royston
>
> On 6 Jan 2012, at 19:29, Ted Yu wrote:
>
>> Royston:
>> I need to brush up my math knowledge so bear with me for a few questions.
>>
>> For binning, you gave 100 as the number of bins. If the computation is
>> initiated on each region server simultaneously, how would each region know
>> where the bin boundaries are ? If the boundaries are naturally aligned with
>> region boundaries, that would be easier.
>>
>> I logged HBASE-5139 for weighted median, please comment there.
>>
>> If you or other people feel there is plausible implementation for any new
>> aggregate, please create subtask so that the original JIRA can host general
>> discussions.
>>
>> Cheers
>>
>> On Fri, Jan 6, 2012 at 6:22 AM, Royston Sellman <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi Ted,
>>>
>>> Yes, that is the use case I am thinking of.
>>>
>>> Re: 5123 I have also had some time to think about other aggregation
>>> functions (Please be aware that I am new to HBase, Coprocessors, and the
>>> Aggregation Protocol and I have little knowledge of distributed numerical
>>> algorithms!). It seems to me the pattern in AP is to return a SINGLE value
>>> from a SINGLE column (CF:CQ) of a table. In future one might wish to extend
>>> AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep
>>> this in mind for the SINGLE value/SINGLE column (SVSC) case.
>>>
>>> So, common SVSC aggregation functions (AP supported first):
>>> min
>>> max
>>> sum
>>> count
>>> avg (arithmetic mean)
>>> std
>>> median
>>> mode
>>> quantile/ntile
>>> mult/product
>>>
>>> for column values of all numeric types, returning values of that type.
>>>
>>> Some thoughts on the future possibilities:
>>> An example of a future SINGLE value MULTIPLE column use case could be
>>> weighted versions of the above functions i.e. a column of weights applied
>>> to the column of values then the new aggregation derived.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB