Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> median aggregate Was: AggregateProtocol Help

Copy link to this message
Re: median aggregate Was: AggregateProtocol Help
This is a good summary.

Do you mind putting what you wrote below on hbase-5123 ?


On Jan 6, 2012, at 6:22 AM, Royston Sellman <[EMAIL PROTECTED]> wrote:

> Hi Ted,
> Yes, that is the use case I am thinking of.
> Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case.
> So, common SVSC aggregation functions (AP supported first):
> min
> max
> sum
> count
> avg (arithmetic mean)
> std
> median
> mode
> quantile/ntile
> mult/product
> for column values of all numeric types, returning values of that type.
> Some thoughts on the future possibilities:
> An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived.
> (note: there is a very good description of Weighted Median in the R language documentation:
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g:
> int bins = 100;
> aClient.sum(table, ci, scan, bins); => {12.3, 14.5...}
> Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors.
> MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now.
> I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase.
> Best regards,
> Royston
> On 6 Jan 2012, at 03:31, Ted Yu wrote:
>> Royston:
>> For the median aggregate, is the following what you're looking for ?
>> Find the median among the values of all the keyvalue for cf:qualifier
>> column.
>> There is a well known distributed method of computing median that involves
>> multiple roundtrips (to the region servers).
>> Just want to confirm the use case.
>> Thanks
>> On Wed, Jan 4, 2012 at 10:57 AM, Royston Sellman <
>> [EMAIL PROTECTED]> wrote:
>>> Great ideas. Thanks.
>>> w.r.t. 5123: I'll think about it for a day or two then make some comments.
>>> 5122 is very desirable.
>>> Best Regards,
>>> Royston
>>> On 4 Jan 2012, at 15:01, Ted Yu wrote:
>>>> I can see room for improvement w.r.t. ColumnInterpreters
>>>> I logged two JIRAs:
>>>> https://issues.apache.org/jira/browse/HBASE-5122 is for loading
>>>> ColumnInterpreters dynamically
>>>> https://issues.apache.org/jira/browse/HBASE-5123 is for adding more
>>>> aggregation functions.
>>>> Royston:
>>>> Feel free to elaborate on 5213 and explain what Mult aggregate should do.
>>>> Cheers
>>>> On Wed, Jan 4, 2012 at 3:43 AM, Royston Sellman <
>>>> [EMAIL PROTECTED]> wrote:
>>>>> Ted, Himanshu and Gary,
>>>>> It works now! I re-created my HBase table to contain Bytes.toBytes(Long)
>>>>> values and that fixed it.