Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Coprocessor end point vs MapReduce?

Copy link to this message
Re: Coprocessor end point vs MapReduce?
Hi JM:

There was a thread discussing M/R bulk delete vs. Coprocessor bulk delete.
The thread subject is "Bulk Delete".
The guy in that post suggested to write a HFile which contains all the
delete markers and then use bulk incremental load facility to actually move
all the delete markers to the regions at once. This strategy works for my
use case too because my M/R job generates a lot of version delete markers.

You might take a look on that thread for additional ways to delete data
from hbase.

Best Regards,

On Thu, Oct 25, 2012 at 1:13 PM, Anoop John <[EMAIL PROTECTED]> wrote:

> >What I still don’t understand is, since both CP and MR are both
> >running on the region side, with is the MR better than the CP?
> For the case bulk delete alone CP (Endpoint) will be better than MR for
> sure..  Considering your over all need people were suggesting better MR..
> U need a scan and move some data into another table too...
> Both MR and CP run on the region side ???  - Well there is difference. The
> CP run within your RS process itself.. So that is why bulk delete using
> Endpoint is efficient..  It is a local read and delete. No n/w calls
> involved at all..  But in case of MR even if the mappers run on the same
> machine as that of the region it is a inter process communication..
> Hope I explained you the diff well...
> -Anoop-
> On Thu, Oct 25, 2012 at 6:31 PM, Jean-Marc Spaggiari <
> > Hi all,
> >
> > First, sorry about my slowness to reply to this thread, but it went to
> > my spam folder and I lost sight of it.
> >
> > I don’t have good knowledge of RDBMS, and so I don’t have good
> > knowledge of triggers too. That’s why I looked at the endpoints too
> > because they are pretty new for me.
> >
> > First, I can’t really use multiple tables. I have one process writing
> > to this table barely real-time. Another one is deleting from this
> > table too. But some rows are never deleted. They are timing out, and
> > need to be moved by the process I’m building here.
> >
> > I was not aware of the possibility to setup the priority for an MR job
> > (any link to show how?). That’s something I will dig into. I was a bit
> > scared about the network load if I’m doing deletes lines by lines and
> > not bulk.
> >
> > What I still don’t understand is, since both CP and MR are both
> > running on the region side, with is the MR better than the CP? Because
> > the hadoop framework is taking care of it and will guarantee that it
> > will run on all the regions?
> >
> > Also, is there some sort of “pre” and “post” methods I can override
> > for MR jobs to initially list of puts/deletes and submit them at the
> > end? Or should I do that one by one on the map method?
> >
> > Thanks,
> >
> > JM
> >
> >
> > 2012/10/18, lohit <[EMAIL PROTECTED]>:
> > > I might be little off here. If rows are moved to another table on
> weekly
> > or
> > > daily basis, why not create per weekly or per day table.
> > > That way you need to copy and delete. Of course it will not work you
> are
> > > are selectively filtering between timestamps and clients have to have
> > > notion of multiple tables.
> > >
> > > 2012/10/18 Anoop Sam John <[EMAIL PROTECTED]>
> > >
> > >> A CP and Endpoints operates at a region level.. Any operation within
> one
> > >> region we can perform using this..  I have seen in below use case that
> > >> along with the delete there was a need for inserting data to some
> other
> > >> table also.. Also this was kind of a periodic action.. I really doubt
> > how
> > >> the endpoints alone can be used here.. I also tend towards the MR..
> > >>
> > >>   The idea behind the bulk delete CP is simple.  We have a use case of
> > >> deleting a bulk of rows and this need to be online delete. I also have
> > >> seen
> > >> in the mailing list many people ask question regarding that... In all
> > >> people were using scans and get the rowkeys to the client side and
> then
> > >> doing the deletes..  Yes most of the time complaint was the slowness..