Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> median aggregate Was: AggregateProtocol Help


+
Ted Yu 2012-01-06, 03:31
+
Royston Sellman 2012-01-06, 14:22
Copy link to this message
-
Re: median aggregate Was: AggregateProtocol Help
This is a good summary.

Do you mind putting what you wrote below on hbase-5123 ?

Thanks

On Jan 6, 2012, at 6:22 AM, Royston Sellman <[EMAIL PROTECTED]> wrote:

> Hi Ted,
>
> Yes, that is the use case I am thinking of.
>
> Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case.
>
> So, common SVSC aggregation functions (AP supported first):
> min
> max
> sum
> count
> avg (arithmetic mean)
> std
> median
> mode
> quantile/ntile
> mult/product
>
> for column values of all numeric types, returning values of that type.
>
> Some thoughts on the future possibilities:
> An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived.
> (note: there is a very good description of Weighted Median in the R language documentation:
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
>
> An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g:
> int bins = 100;
> aClient.sum(table, ci, scan, bins); => {12.3, 14.5...}
> Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors.
>
> MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now.
>
> I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase.
>
> Best regards,
> Royston
>
>
> On 6 Jan 2012, at 03:31, Ted Yu wrote:
>
>> Royston:
>> For the median aggregate, is the following what you're looking for ?
>> Find the median among the values of all the keyvalue for cf:qualifier
>> column.
>>
>> There is a well known distributed method of computing median that involves
>> multiple roundtrips (to the region servers).
>>
>> Just want to confirm the use case.
>>
>> Thanks
>>
>> On Wed, Jan 4, 2012 at 10:57 AM, Royston Sellman <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Great ideas. Thanks.
>>>
>>> w.r.t. 5123: I'll think about it for a day or two then make some comments.
>>>
>>> 5122 is very desirable.
>>>
>>> Best Regards,
>>> Royston
>>>
>>> On 4 Jan 2012, at 15:01, Ted Yu wrote:
>>>
>>>> I can see room for improvement w.r.t. ColumnInterpreters
>>>> I logged two JIRAs:
>>>> https://issues.apache.org/jira/browse/HBASE-5122 is for loading
>>>> ColumnInterpreters dynamically
>>>>
>>>> https://issues.apache.org/jira/browse/HBASE-5123 is for adding more
>>>> aggregation functions.
>>>>
>>>> Royston:
>>>> Feel free to elaborate on 5213 and explain what Mult aggregate should do.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Jan 4, 2012 at 3:43 AM, Royston Sellman <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Ted, Himanshu and Gary,
>>>>>
>>>>> It works now! I re-created my HBase table to contain Bytes.toBytes(Long)
>>>>> values and that fixed it.
+
Royston Sellman 2012-01-06, 15:09
+
Ted Yu 2012-01-06, 19:29
+
Royston Sellman 2012-01-06, 22:00
+
Tom Wilcox 2012-01-07, 11:32
+
yuzhihong@... 2012-01-07, 11:45
+
Ted Yu 2012-01-10, 04:04
+
Royston Sellman 2012-01-10, 13:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB