Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> aggregation performance


Copy link to this message
-
Re: aggregation performance
I did some experiments which compares scan, coprocessor and mapreduce
approach, in an ec2 environment.
You may find it interesting:
http://hbase-coprocessor-experiments.blogspot.com/2011/05/extending.html

Thanks,
Himanshu

On Thu, May 3, 2012 at 11:02 AM, James Taylor <[EMAIL PROTECTED]> wrote:
> We're seen reasonable performance, with the caveat that you need to
> parallelize the scan doing the aggregation. In our benchmarking, we have the
> client scan each region in parallel and have a coprocessor aggregate the row
> count and return a single row back (with the client then totaling the counts
> it gets back). Here are the numbers we've seen when aggregating 1 million
> rows, this with a slightly older hbase version (~0.92):
>
> Schema: 50col x 50bytes with compressible data
> Regions     RowCount         RowCount with single binary filter
>            Time (sec)       Time (sec)
>  1          11.3             19.0
>  4           3.5              5.6
>  16           1.8              2.6
>  32           1.2              1.8
>
> Schema: 1col x 2500bytes with compressible data
> Regions     RowCount         RowCount with single binary filter
>            Time (sec)       Time (sec)
>  1           7.0              7.0
>  4           1.2              1.2
>  16           0.7              0.7
>  32           0.3              0.3
>
> This is run on a four machine cluster with each machine having 4G Heap and
> with the servers warmed-up (cached data).
>
> Hope this helps.
>
>    James
>
>
>
> On 05/03/2012 08:01 AM, Tom Brown wrote:
>>
>> For our solution we are doing some aggregation on the server via
>> coprocessors. In general, for each row there are 8 columns: 7 columns
>> that contain numbers (for summation) and 1 column that contains a
>> hyperloglog counter (about 700bytes). Functionally, this solution
>> works well and ought to scale with the number of region servers.
>> However, the individual request performance leaves a little to be
>> desired. What we've seen is that to scan 40000 rows (aggregated into
>> 3000 rows) takes about 4 seconds.
>>
>> Our code is in it's early stages (unoptimized) so we hope to see some
>> significant performance improvements when we run our coprocessor under
>> a profiler. Our benchmarks were on underpowered machines (only 2gb
>> RAM) as well.
>>
>> Hope this helps!
>>
>> --Tom
>>
>> On Thu, May 3, 2012 at 6:08 AM, Pere Ferrera<[EMAIL PROTECTED]>
>>  wrote:
>>>
>>> Hi,
>>>
>>> Is anybody benchmarking the performance of server-side aggregations
>>> through
>>> co-processors in HBase? I am interested to know if HBase could
>>> potentially
>>> be used to calculate real-time SQL-like aggregations at a good level of
>>> performance (q<  200ms on high-load, big dataset scenario). Just curious
>>> to
>>> know before I implement my own benchmarks.
>>>
>>> Pere.
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB