Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> aggregation performance


+
Pere Ferrera 2012-05-03, 12:08
+
Tom Brown 2012-05-03, 15:01
+
James Taylor 2012-05-03, 17:02
Copy link to this message
-
Re: aggregation performance
I did some experiments which compares scan, coprocessor and mapreduce
approach, in an ec2 environment.
You may find it interesting:
http://hbase-coprocessor-experiments.blogspot.com/2011/05/extending.html

Thanks,
Himanshu

On Thu, May 3, 2012 at 11:02 AM, James Taylor <[EMAIL PROTECTED]> wrote:
> We're seen reasonable performance, with the caveat that you need to
> parallelize the scan doing the aggregation. In our benchmarking, we have the
> client scan each region in parallel and have a coprocessor aggregate the row
> count and return a single row back (with the client then totaling the counts
> it gets back). Here are the numbers we've seen when aggregating 1 million
> rows, this with a slightly older hbase version (~0.92):
>
> Schema: 50col x 50bytes with compressible data
> Regions     RowCount         RowCount with single binary filter
>            Time (sec)       Time (sec)
>  1          11.3             19.0
>  4           3.5              5.6
>  16           1.8              2.6
>  32           1.2              1.8
>
> Schema: 1col x 2500bytes with compressible data
> Regions     RowCount         RowCount with single binary filter
>            Time (sec)       Time (sec)
>  1           7.0              7.0
>  4           1.2              1.2
>  16           0.7              0.7
>  32           0.3              0.3
>
> This is run on a four machine cluster with each machine having 4G Heap and
> with the servers warmed-up (cached data).
>
> Hope this helps.
>
>    James
>
>
>
> On 05/03/2012 08:01 AM, Tom Brown wrote:
>>
>> For our solution we are doing some aggregation on the server via
>> coprocessors. In general, for each row there are 8 columns: 7 columns
>> that contain numbers (for summation) and 1 column that contains a
>> hyperloglog counter (about 700bytes). Functionally, this solution
>> works well and ought to scale with the number of region servers.
>> However, the individual request performance leaves a little to be
>> desired. What we've seen is that to scan 40000 rows (aggregated into
>> 3000 rows) takes about 4 seconds.
>>
>> Our code is in it's early stages (unoptimized) so we hope to see some
>> significant performance improvements when we run our coprocessor under
>> a profiler. Our benchmarks were on underpowered machines (only 2gb
>> RAM) as well.
>>
>> Hope this helps!
>>
>> --Tom
>>
>> On Thu, May 3, 2012 at 6:08 AM, Pere Ferrera<[EMAIL PROTECTED]>
>>  wrote:
>>>
>>> Hi,
>>>
>>> Is anybody benchmarking the performance of server-side aggregations
>>> through
>>> co-processors in HBase? I am interested to know if HBase could
>>> potentially
>>> be used to calculate real-time SQL-like aggregations at a good level of
>>> performance (q<  200ms on high-load, big dataset scenario). Just curious
>>> to
>>> know before I implement my own benchmarks.
>>>
>>> Pere.
>
>