Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - aggregation performance

Pere Ferrera 2012-05-03, 12:08
Tom Brown 2012-05-03, 15:01
Copy link to this message
Re: aggregation performance
James Taylor 2012-05-03, 17:02
We're seen reasonable performance, with the caveat that you need to
parallelize the scan doing the aggregation. In our benchmarking, we have
the client scan each region in parallel and have a coprocessor aggregate
the row count and return a single row back (with the client then
totaling the counts it gets back). Here are the numbers we've seen when
aggregating 1 million rows, this with a slightly older hbase version

Schema: 50col x 50bytes with compressible data
Regions     RowCount         RowCount with single binary filter
             Time (sec)       Time (sec)
   1          11.3             19.0
   4           3.5              5.6
  16           1.8              2.6
  32           1.2              1.8

Schema: 1col x 2500bytes with compressible data
Regions     RowCount         RowCount with single binary filter
             Time (sec)       Time (sec)
   1           7.0              7.0
   4           1.2              1.2
  16           0.7              0.7
  32           0.3              0.3

This is run on a four machine cluster with each machine having 4G Heap
and with the servers warmed-up (cached data).

Hope this helps.

On 05/03/2012 08:01 AM, Tom Brown wrote:
> For our solution we are doing some aggregation on the server via
> coprocessors. In general, for each row there are 8 columns: 7 columns
> that contain numbers (for summation) and 1 column that contains a
> hyperloglog counter (about 700bytes). Functionally, this solution
> works well and ought to scale with the number of region servers.
> However, the individual request performance leaves a little to be
> desired. What we've seen is that to scan 40000 rows (aggregated into
> 3000 rows) takes about 4 seconds.
> Our code is in it's early stages (unoptimized) so we hope to see some
> significant performance improvements when we run our coprocessor under
> a profiler. Our benchmarks were on underpowered machines (only 2gb
> RAM) as well.
> Hope this helps!
> --Tom
> On Thu, May 3, 2012 at 6:08 AM, Pere Ferrera<[EMAIL PROTECTED]>  wrote:
>> Hi,
>> Is anybody benchmarking the performance of server-side aggregations through
>> co-processors in HBase? I am interested to know if HBase could potentially
>> be used to calculate real-time SQL-like aggregations at a good level of
>> performance (q<  200ms on high-load, big dataset scenario). Just curious to
>> know before I implement my own benchmarks.
>> Pere.
Himanshu Vashishtha 2012-05-03, 17:08