-Re: Coprocessors vs MapReduce?
Andrew Purtell 2012-07-24, 19:05
On Tue, Jul 24, 2012 at 7:59 AM, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
> First, I thought coprocessors needed a restart but it seems a shell can be
> used to add/remove them without requiring a restart. However, at the moment
> the coprocessors are defined within jar and can not be dynamically created.
> Could you confirm that?
You can dynamically load new coprocessors by deploying a jarfile to
HDFS, using the shell to disable the table, add the coprocessor, and
then enable the table.
To remove a coprocessor from a table, you can use the shell to disable
the table, remove the coprocessor, and then enable the table again.
However, whatever was loaded by the JVM will remain resident until the
regionserver process is restarted.
> (I am thinking about the Cascading way of creating
> the implementation which will then be serialized, send and executed.)
... as a MapReduce job.
MR jobs in Hadoop are really each individual submissions of
application code to run on the cluster each and every time.
In contrast, HBase coprocessors can be thought of like Linux loadable
kernel modules. You add them to your infrastructure. HBase becomes
more like an application deployment platform where the details of data
colocation with the application code at scale is handled for you
automatically, as is client side dispatch to the appropriate
An early design of coprocessors considered code shipping at request
time, but that doesn't fit the extension model above well.
But also consider that HBase is a short-request system. The latency of
processing each individual RPC is important and expected to be a short
as possible. If for a table where you want to extend server side
function, imagine the overhead if that extension is shipped in every
request. Each RPC would be what? 10x? 100x? larger? And there would be
the client side latency of figuring the transitive closure of classes
to send up, and then server side latency of installing the bytecode
for execution and then removing it for GC.
> Second, I didn't see any way to give parameters to coprocessors. Is that
> really the case? If not, how would the parameters be handled?
A coprocessor can be an Observer, linked in to server side function.
Parameters are handed to your installed extension via upcall from
Or, a coprocessor can be an Endpoint. This is a dynamic RPC endpoint.
You can send up any parameter to an endpoint via Exec as long as HBase
RPC can serialize it.
For more information see:
> Third, I assume coprocessors are using the processus/thread of the region
> server. Does that means that, if multiple blocks need to be processed,
> MaReduce should be more efficient? Are there other ways to know whether
> coprocessors or MapReduce should be chosen?
Coprocessors operate on requests (RPCs), not blocks.
If you address a coprocessor request to the whole table, whatever
happens will happen on all regionservers in parallel. This is as far
as the similarity to MapReduce goes.
Conceivably you could implement a map() and reduce() interface on top
of HBase using Coprocessors, but CPs themselves are a lower level
> Fourth, I know this is a really broad question but how would you compare
> coprocessors to YARN? I have yet to know more about both subjects but I
> feel that the concepts are not totally unrelated.
Coprocessors are a low level extension framework, YARN is a general
purpose high level cluster resource manager. Not in the same
> Lastly, this is an implementation detail but how the client side waits for
> the results? Is it possible to perform early aggregation or does the client
> need to receive all the information before doing anything else?
> Ps : My two sources for that subject are for HBase 0.92 :
> * https://blogs.apache.org/hbase/entry/coprocessor_introduction
> * HBase The Definitive Guide.
Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)