Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> aggregation performance


Copy link to this message
-
Re: aggregation performance
We're seen reasonable performance, with the caveat that you need to
parallelize the scan doing the aggregation. In our benchmarking, we have
the client scan each region in parallel and have a coprocessor aggregate
the row count and return a single row back (with the client then
totaling the counts it gets back). Here are the numbers we've seen when
aggregating 1 million rows, this with a slightly older hbase version
(~0.92):

Schema: 50col x 50bytes with compressible data
Regions     RowCount         RowCount with single binary filter
             Time (sec)       Time (sec)
   1          11.3             19.0
   4           3.5              5.6
  16           1.8              2.6
  32           1.2              1.8

Schema: 1col x 2500bytes with compressible data
Regions     RowCount         RowCount with single binary filter
             Time (sec)       Time (sec)
   1           7.0              7.0
   4           1.2              1.2
  16           0.7              0.7
  32           0.3              0.3

This is run on a four machine cluster with each machine having 4G Heap
and with the servers warmed-up (cached data).

Hope this helps.

     James
On 05/03/2012 08:01 AM, Tom Brown wrote:
> For our solution we are doing some aggregation on the server via
> coprocessors. In general, for each row there are 8 columns: 7 columns
> that contain numbers (for summation) and 1 column that contains a
> hyperloglog counter (about 700bytes). Functionally, this solution
> works well and ought to scale with the number of region servers.
> However, the individual request performance leaves a little to be
> desired. What we've seen is that to scan 40000 rows (aggregated into
> 3000 rows) takes about 4 seconds.
>
> Our code is in it's early stages (unoptimized) so we hope to see some
> significant performance improvements when we run our coprocessor under
> a profiler. Our benchmarks were on underpowered machines (only 2gb
> RAM) as well.
>
> Hope this helps!
>
> --Tom
>
> On Thu, May 3, 2012 at 6:08 AM, Pere Ferrera<[EMAIL PROTECTED]>  wrote:
>> Hi,
>>
>> Is anybody benchmarking the performance of server-side aggregations through
>> co-processors in HBase? I am interested to know if HBase could potentially
>> be used to calculate real-time SQL-like aggregations at a good level of
>> performance (q<  200ms on high-load, big dataset scenario). Just curious to
>> know before I implement my own benchmarks.
>>
>> Pere.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB