Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> bulk deletes


Copy link to this message
-
Re: bulk deletes
Hi Anoop:

In my use case, I use extensively the version delete marker because I need
to delete a specific version of a cell (row key, CF, qualifier, timestamp).
I have a mapreduce job that will run across some regions and based on some
business rules, some of the cells will be deleted in the table using the
version delete marker. The business rules for deletion are scoped to each
column family at a time. Therefore, there are no logically dependency of
deletions between column families.

I also posted the above use case in the HBASE-6942.

Best Regards,

Jerry

On Thu, Oct 11, 2012 at 12:04 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote:

> You are right Jerry..
> In your use case you want to delete full rows or some cfs/columns only?
>  Pls feel free to see the issue HBASE-6942 and give your valuable comments..
> Here I am trying to delete the rows [This is our use case]
>
> -Anoop-
> ________________________________________
> From: Jerry Lam [[EMAIL PROTECTED]]
> Sent: Wednesday, October 10, 2012 8:37 PM
> To: [EMAIL PROTECTED]
> Subject: Re: bulk deletes
>
> Hi guys:
>
> The bulk delete approaches described in this thread are helpful in my case
> as well. If I understood correctly, Paul's approach is useful for offline
> bulk deletes (a.k.a. mapreduce) whereas Anoop's approach is useful for
> online/real-time bulk deletes (a.k.a. co-processor)?
>
> Best Regards,
>
> Jerry
>
> On Mon, Oct 8, 2012 at 7:45 AM, Paul Mackles <[EMAIL PROTECTED]> wrote:
>
> > Very cool Anoop. I can definitely see how that would be useful.
> >
> > Lars - the bulk deletes do appear to work. I just wasn't sure if there
> was
> > something I might be missing since I haven't seen this documented
> > elsewhere.
> >
> > Coprocessors do seem a better fit for this in the long term.
> >
> > Thanks everyone.
> >
> > On 10/7/12 11:55 PM, "Anoop Sam John" <[EMAIL PROTECTED]> wrote:
> >
> > >We also done an implementation using compaction time deletes(avoid KVs).
> > >This works very well for us....
> > >As this would delay the deletes to happen till the next major
> compaction,
> > >we are having an implementation to do the real time bulk delete. [We
> have
> > >such use case]
> > >Here I am using an endpoint implementation to do the scan and delete at
> > >the server side only. Just raised an IA for this [HBASE-6942].  I will
> > >post a patch based on 0.94 model there...Pls have a look....  I have
> > >noticed big performance improvement over the normal way of  scan() +
> > >delete(List<Delete>) as this avoids several network calls and traffic...
> > >
> > >-Anoop-
> > >________________________________________
> > >From: lars hofhansl [[EMAIL PROTECTED]]
> > >Sent: Saturday, October 06, 2012 1:09 AM
> > >To: [EMAIL PROTECTED]
> > >Subject: Re: bulk deletes
> > >
> > >Does it work? :)
> > >
> > >How did you do the deletes before?I assume you used the
> > >HTable.delete(List<Delete>) API?
> > >
> > >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor
> > >into the compactions and simply filter out any KVs you want to have
> > >removed.
> > >
> > >
> > >-- Lars
> > >
> > >
> > >
> > >________________________________
> > > From: Paul Mackles <[EMAIL PROTECTED]>
> > >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > >Sent: Friday, October 5, 2012 11:17 AM
> > >Subject: bulk deletes
> > >
> > >We need to do deletes pretty regularly and sometimes we could have
> > >hundreds of millions of cells to delete. TTLs won't work for us because
> > >we have a fair amount of bizlogic around the deletes.
> > >
> > >Given their current implemention  (we are on 0.90.4), this delete
> process
> > >can take a really long time (half a day or more with 100 or so
> concurrent
> > >threads). From everything I can tell, the performance issues come down
> to
> > >each delete being an individual RPC call (even when using the batch
> API).
> > >In other words, I don't see any thrashing on hbase while this process is
> > >running ­ just lots of waiting for the RPC calls to return.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB