Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> bulk deletes


+
Paul Mackles 2012-10-05, 18:17
+
lars hofhansl 2012-10-05, 19:39
+
Anoop Sam John 2012-10-08, 03:55
+
Paul Mackles 2012-10-08, 11:45
+
Jerry Lam 2012-10-10, 15:07
+
Anoop Sam John 2012-10-11, 04:04
+
Jerry Lam 2012-10-12, 21:41
Copy link to this message
-
Re: bulk deletes
While I didn't spend a lot of time with your code, I believe your approach
is sound.

Depending on your consistency requirements, I would suggest you consider
utilizing a coprocessor to handle the deletes.  Coprocessors can intercept
compaction scans.  Then just shift your delete logic to be an additional
filter to be utilized at compaction time.  This should be less load and
complexity than the bulk load.  Depending on the complexity and frequency
of the criteria, you could potentially add an endpoint to set these batch
deletes.

I was considering a generic version of this but haven't spent much time on
it...

Jacques

On Fri, Oct 5, 2012 at 11:17 AM, Paul Mackles <[EMAIL PROTECTED]> wrote:

> We need to do deletes pretty regularly and sometimes we could have
> hundreds of millions of cells to delete. TTLs won't work for us because we
> have a fair amount of bizlogic around the deletes.
>
> Given their current implemention  (we are on 0.90.4), this delete process
> can take a really long time (half a day or more with 100 or so concurrent
> threads). From everything I can tell, the performance issues come down to
> each delete being an individual RPC call (even when using the batch API).
> In other words, I don't see any thrashing on hbase while this process is
> running – just lots of waiting for the RPC calls to return.
>
> The alternative we came up with is to use the standard bulk load
> facilities to handle the deletes. The code turned out to be surpisingly
> simple and appears to work in the small-scale tests we have tried so far.
> Is anyone else doing deletes in  this fashion? Are there drawbacks that I
> might be missing? Here is a link to the code:
>
> https://gist.github.com/3841437
>
> Pretty simple, eh? I haven't seen much mention of this technique which is
> why I am a tad paranoid about it.
>
> Thanks,
> Paul
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB