Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - endpoint coprocessor performance


Copy link to this message
-
Re: endpoint coprocessor performance
Gary Helmling 2013-03-05, 01:42
>
> I'm running some experiments to understand where to use coprocessors. One
> interesting scenario is computing distinct values. I ran performance tests
> with two distinct value implementations: one using endpoint coprocessors,
> and one using just scans (computing distinct values client side only). I
> noticed that the endpoint coprocessor implementation averaged 80 ms slower
> than the scan implementation. Details of that are below for anyone
> interested.
>
> To drill into the performance, I instrumented the code and ultimately
> deployed a no-op endpoint coprocessor, to look at the overhead of simply
> calling it. I'm measuring around 100ms for calling my empty, no-op endpoint
> coprocessor.
>
>
100ms to do a single no-op coprocessor call seems very high.  Do you have
more details of where you see the code spending time?  Or even better, can
you post sample code somewhere?  Also, which version of HBase are you
testing with?

I need to do more tests, but I believe my tests are leading me to similar
> conclusions drawn here:
> http://hbase-coprocessor-experiments.blogspot.com/2011/05/extending.html
>
> I.e. if the query/scan is selective enough (I'll go out on a limb and
> estimate 50-100 rows), then it's better to just perform a scan and compute
> client side. Endpoint coprocessors will make sense for larger result sets
> and/or scans that hit multiple regions.
>
>
I would certainly agree with this.  Coprocessor endpoints are not a
replacement for the regular HBase client APIs.  They're really meant to
allow you to extend HBase with new capabilities.  Coprocessor endpoints
will allow you to parallelize operations across multiple regions, which can
be a powerful capability if you need it, or will allow you to maintain some
pre-computed state server-side and then easily retrieve it from the client.
 If you're scanning larger amounts of data and computing a much smaller
result, endpoints will also save transferring the full data set over the
network back to the client, but you'll still need to scan through the data
server-side.  In your case, are you applying the same scan options in the
coprocessor (start/end row, any filtering)?
> Before going too far, I wanted to check if anyone in this group has
> suggestions. I.e. perhaps there are just some configuration options I've
> not uncovered. Does this 100ms latency sound correct?
>

It would help to have more details of what your code is actually doing.
 Can you post an extract of what's running in the coprocessor?
--gh