-Re: Coprocessors vs MapReduce?
Andrew Purtell 2012-07-25, 17:28
Answers inline below.
On Wed, Jul 25, 2012 at 1:09 AM, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
> As Andrew pointed out, Cascading is indeed for MapReduce. I know the use
> case was discussed, I wanted to know what was the state now. (The blog
> entry is from 2010.) The use case is simple. I am doing log analysis and
> would like to perform fast aggregations. These aggregations are common
> (count/sum/average) but what is exactly aggregated will depend on the type
> of logs. With apache, http code errors may be interesting. With other logs,
> request durations could be. I (well, any users) would like to be able to
> reuse common codes and write the coprocessors with a concise (domain
> specific) language.
Implementing DSLs on top of Java is at best problematic. However if
someone was willing and able to do the work, the coprocessor
environments could be plumbed to Clojure or Scala or any other
language that targets the JVM and can efficiently translate between
native and Java types, and is better for building DSLs.
> So I was checking not to miss any new
> projects/development on that idea. From what I understand of your answers,
> implementing a kind of Cascading for coprocessors may be possible but not
> done and may not really be pertinent/safe/efficient with the current
> architecture of coprocessors.
Actually we have considered creating a "Cascading for Coprocessors":
The difference is how code shipping up to the cluster would work. It
would not be like MapReduce where each job is a one-shot code
deployment. That doesn't mean that you cannot install coprocessors and
then map flows over them (via Exec).
> I forgot that the shell still require the table to be offline. Thanks for
> pointing that out. So, coprocessors are not meant to be loaded that often.
Correct. However, given ongoing work like online schema changes,
introduction of a ServiceLoader (in HBASE-4050), separating
classloaders (HBASE-6308), a more dynamic loading scheme for
Coprocessors could happen once supporting pieces are put in place.
> I am not sure to understand your answers. I have read about big table/hbase
> architecture but I may also have not expressed correctly my problem. The
> way I see it, coprocessors would allow me to aggregate information from
> recent logs. The problem I have with vanilla MapReduce is that if the logs
> do not fill a full hfs block then MapReduce is a bit overkill. I though
> that for those cases, coprocessors would be more appropriate. Is that a
> right way to see it? If so is there any rule of thump for knowing when to
> select MapReduce versus Coprocessors? On the other side of the scale, I
> also assume that if I had 1 TeraByte of data, MapReduce would be faster
> because it allows more parallelism. Well... I hope my concern is clearer
If you receive a lot of bulk data and need to transform it for later
storing into HBase, then a MapReduce process is the efficient option.
Even with an identity transform it is more efficient to drop all of
the new data into place in one transaction rather than a transaction
for each item, this is the rationale for HBase bulk loading. On the
other hand if the data arrives in a streaming fashion, then
Coprocessors make it possible to conveniently transform it inline as
it is persisted, via Observers.
Observers may need to be reconfigured at runtime or may need a side
channel for communcation. So, we designed Endpoints (i.e. Exec) to
enable registration of dynamic/user RPC protocols at runtime.
Endpoints have also been used for running aggregation functions over
the region data on demand, see AggregationProtocol. Simple functions
which return quickly make sense, but this is not a replacement for a
generalized framework like MapReduce. Long running computations
server-side can interact with leases and client side RPC management in
problematic ways. However, those issues can be addressed by client and
server side changes layered on Coprocessors, which could be
incorporated into the framework. Hence, HBASE-3131.
See also the Exec method that takes a callback. The callback will be
invoked as results are returned from each individual RegionServer. You
don't need to wait for all results to be gathered into a Map if you do
not want that.
Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)