|
Paul Mackles
2012-10-05, 18:17
Jacques
2012-10-05, 19:37
lars hofhansl
2012-10-05, 19:39
Anoop Sam John
2012-10-08, 03:55
Paul Mackles
2012-10-08, 11:45
Jerry Lam
2012-10-10, 15:07
Anoop Sam John
2012-10-11, 04:04
Jerry Lam
2012-10-12, 21:41
|
-
bulk deletesPaul Mackles 2012-10-05, 18:17
We need to do deletes pretty regularly and sometimes we could have hundreds of millions of cells to delete. TTLs won't work for us because we have a fair amount of bizlogic around the deletes.
Given their current implemention (we are on 0.90.4), this delete process can take a really long time (half a day or more with 100 or so concurrent threads). From everything I can tell, the performance issues come down to each delete being an individual RPC call (even when using the batch API). In other words, I don't see any thrashing on hbase while this process is running – just lots of waiting for the RPC calls to return. The alternative we came up with is to use the standard bulk load facilities to handle the deletes. The code turned out to be surpisingly simple and appears to work in the small-scale tests we have tried so far. Is anyone else doing deletes in this fashion? Are there drawbacks that I might be missing? Here is a link to the code: https://gist.github.com/3841437 Pretty simple, eh? I haven't seen much mention of this technique which is why I am a tad paranoid about it. Thanks, Paul
-
Re: bulk deletesJacques 2012-10-05, 19:37
While I didn't spend a lot of time with your code, I believe your approach
is sound. Depending on your consistency requirements, I would suggest you consider utilizing a coprocessor to handle the deletes. Coprocessors can intercept compaction scans. Then just shift your delete logic to be an additional filter to be utilized at compaction time. This should be less load and complexity than the bulk load. Depending on the complexity and frequency of the criteria, you could potentially add an endpoint to set these batch deletes. I was considering a generic version of this but haven't spent much time on it... Jacques On Fri, Oct 5, 2012 at 11:17 AM, Paul Mackles <[EMAIL PROTECTED]> wrote: > We need to do deletes pretty regularly and sometimes we could have > hundreds of millions of cells to delete. TTLs won't work for us because we > have a fair amount of bizlogic around the deletes. > > Given their current implemention (we are on 0.90.4), this delete process > can take a really long time (half a day or more with 100 or so concurrent > threads). From everything I can tell, the performance issues come down to > each delete being an individual RPC call (even when using the batch API). > In other words, I don't see any thrashing on hbase while this process is > running – just lots of waiting for the RPC calls to return. > > The alternative we came up with is to use the standard bulk load > facilities to handle the deletes. The code turned out to be surpisingly > simple and appears to work in the small-scale tests we have tried so far. > Is anyone else doing deletes in this fashion? Are there drawbacks that I > might be missing? Here is a link to the code: > > https://gist.github.com/3841437 > > Pretty simple, eh? I haven't seen much mention of this technique which is > why I am a tad paranoid about it. > > Thanks, > Paul > >
-
Re: bulk deleteslars hofhansl 2012-10-05, 19:39
Does it work? :)
How did you do the deletes before?I assume you used the HTable.delete(List<Delete>) API? (Doesn't really help you, but) In 0.92+ you could hook up a coprocessor into the compactions and simply filter out any KVs you want to have removed. -- Lars ________________________________ From: Paul Mackles <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Friday, October 5, 2012 11:17 AM Subject: bulk deletes We need to do deletes pretty regularly and sometimes we could have hundreds of millions of cells to delete. TTLs won't work for us because we have a fair amount of bizlogic around the deletes. Given their current implemention (we are on 0.90.4), this delete process can take a really long time (half a day or more with 100 or so concurrent threads). From everything I can tell, the performance issues come down to each delete being an individual RPC call (even when using the batch API). In other words, I don't see any thrashing on hbase while this process is running – just lots of waiting for the RPC calls to return. The alternative we came up with is to use the standard bulk load facilities to handle the deletes. The code turned out to be surpisingly simple and appears to work in the small-scale tests we have tried so far. Is anyone else doing deletes in this fashion? Are there drawbacks that I might be missing? Here is a link to the code: https://gist.github.com/3841437 Pretty simple, eh? I haven't seen much mention of this technique which is why I am a tad paranoid about it. Thanks, Paul
-
RE: bulk deletesAnoop Sam John 2012-10-08, 03:55
We also done an implementation using compaction time deletes(avoid KVs). This works very well for us....
As this would delay the deletes to happen till the next major compaction, we are having an implementation to do the real time bulk delete. [We have such use case] Here I am using an endpoint implementation to do the scan and delete at the server side only. Just raised an IA for this [HBASE-6942]. I will post a patch based on 0.94 model there...Pls have a look.... I have noticed big performance improvement over the normal way of scan() + delete(List<Delete>) as this avoids several network calls and traffic... -Anoop- ________________________________________ From: lars hofhansl [[EMAIL PROTECTED]] Sent: Saturday, October 06, 2012 1:09 AM To: [EMAIL PROTECTED] Subject: Re: bulk deletes Does it work? :) How did you do the deletes before?I assume you used the HTable.delete(List<Delete>) API? (Doesn't really help you, but) In 0.92+ you could hook up a coprocessor into the compactions and simply filter out any KVs you want to have removed. -- Lars ________________________________ From: Paul Mackles <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Friday, October 5, 2012 11:17 AM Subject: bulk deletes We need to do deletes pretty regularly and sometimes we could have hundreds of millions of cells to delete. TTLs won't work for us because we have a fair amount of bizlogic around the deletes. Given their current implemention (we are on 0.90.4), this delete process can take a really long time (half a day or more with 100 or so concurrent threads). From everything I can tell, the performance issues come down to each delete being an individual RPC call (even when using the batch API). In other words, I don't see any thrashing on hbase while this process is running – just lots of waiting for the RPC calls to return. The alternative we came up with is to use the standard bulk load facilities to handle the deletes. The code turned out to be surpisingly simple and appears to work in the small-scale tests we have tried so far. Is anyone else doing deletes in this fashion? Are there drawbacks that I might be missing? Here is a link to the code: https://gist.github.com/3841437 Pretty simple, eh? I haven't seen much mention of this technique which is why I am a tad paranoid about it. Thanks, Paul
-
Re: bulk deletesPaul Mackles 2012-10-08, 11:45
Very cool Anoop. I can definitely see how that would be useful.
Lars - the bulk deletes do appear to work. I just wasn't sure if there was something I might be missing since I haven't seen this documented elsewhere. Coprocessors do seem a better fit for this in the long term. Thanks everyone. On 10/7/12 11:55 PM, "Anoop Sam John" <[EMAIL PROTECTED]> wrote: >We also done an implementation using compaction time deletes(avoid KVs). >This works very well for us.... >As this would delay the deletes to happen till the next major compaction, >we are having an implementation to do the real time bulk delete. [We have >such use case] >Here I am using an endpoint implementation to do the scan and delete at >the server side only. Just raised an IA for this [HBASE-6942]. I will >post a patch based on 0.94 model there...Pls have a look.... I have >noticed big performance improvement over the normal way of scan() + >delete(List<Delete>) as this avoids several network calls and traffic... > >-Anoop- >________________________________________ >From: lars hofhansl [[EMAIL PROTECTED]] >Sent: Saturday, October 06, 2012 1:09 AM >To: [EMAIL PROTECTED] >Subject: Re: bulk deletes > >Does it work? :) > >How did you do the deletes before?I assume you used the >HTable.delete(List<Delete>) API? > >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor >into the compactions and simply filter out any KVs you want to have >removed. > > >-- Lars > > > >________________________________ > From: Paul Mackles <[EMAIL PROTECTED]> >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >Sent: Friday, October 5, 2012 11:17 AM >Subject: bulk deletes > >We need to do deletes pretty regularly and sometimes we could have >hundreds of millions of cells to delete. TTLs won't work for us because >we have a fair amount of bizlogic around the deletes. > >Given their current implemention (we are on 0.90.4), this delete process >can take a really long time (half a day or more with 100 or so concurrent >threads). From everything I can tell, the performance issues come down to >each delete being an individual RPC call (even when using the batch API). >In other words, I don't see any thrashing on hbase while this process is >running just lots of waiting for the RPC calls to return. > >The alternative we came up with is to use the standard bulk load >facilities to handle the deletes. The code turned out to be surpisingly >simple and appears to work in the small-scale tests we have tried so far. >Is anyone else doing deletes in this fashion? Are there drawbacks that I >might be missing? Here is a link to the code: > >https://gist.github.com/3841437 > >Pretty simple, eh? I haven't seen much mention of this technique which is >why I am a tad paranoid about it. > >Thanks, >Paul
-
Re: bulk deletesJerry Lam 2012-10-10, 15:07
Hi guys:
The bulk delete approaches described in this thread are helpful in my case as well. If I understood correctly, Paul's approach is useful for offline bulk deletes (a.k.a. mapreduce) whereas Anoop's approach is useful for online/real-time bulk deletes (a.k.a. co-processor)? Best Regards, Jerry On Mon, Oct 8, 2012 at 7:45 AM, Paul Mackles <[EMAIL PROTECTED]> wrote: > Very cool Anoop. I can definitely see how that would be useful. > > Lars - the bulk deletes do appear to work. I just wasn't sure if there was > something I might be missing since I haven't seen this documented > elsewhere. > > Coprocessors do seem a better fit for this in the long term. > > Thanks everyone. > > On 10/7/12 11:55 PM, "Anoop Sam John" <[EMAIL PROTECTED]> wrote: > > >We also done an implementation using compaction time deletes(avoid KVs). > >This works very well for us.... > >As this would delay the deletes to happen till the next major compaction, > >we are having an implementation to do the real time bulk delete. [We have > >such use case] > >Here I am using an endpoint implementation to do the scan and delete at > >the server side only. Just raised an IA for this [HBASE-6942]. I will > >post a patch based on 0.94 model there...Pls have a look.... I have > >noticed big performance improvement over the normal way of scan() + > >delete(List<Delete>) as this avoids several network calls and traffic... > > > >-Anoop- > >________________________________________ > >From: lars hofhansl [[EMAIL PROTECTED]] > >Sent: Saturday, October 06, 2012 1:09 AM > >To: [EMAIL PROTECTED] > >Subject: Re: bulk deletes > > > >Does it work? :) > > > >How did you do the deletes before?I assume you used the > >HTable.delete(List<Delete>) API? > > > >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor > >into the compactions and simply filter out any KVs you want to have > >removed. > > > > > >-- Lars > > > > > > > >________________________________ > > From: Paul Mackles <[EMAIL PROTECTED]> > >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > >Sent: Friday, October 5, 2012 11:17 AM > >Subject: bulk deletes > > > >We need to do deletes pretty regularly and sometimes we could have > >hundreds of millions of cells to delete. TTLs won't work for us because > >we have a fair amount of bizlogic around the deletes. > > > >Given their current implemention (we are on 0.90.4), this delete process > >can take a really long time (half a day or more with 100 or so concurrent > >threads). From everything I can tell, the performance issues come down to > >each delete being an individual RPC call (even when using the batch API). > >In other words, I don't see any thrashing on hbase while this process is > >running just lots of waiting for the RPC calls to return. > > > >The alternative we came up with is to use the standard bulk load > >facilities to handle the deletes. The code turned out to be surpisingly > >simple and appears to work in the small-scale tests we have tried so far. > >Is anyone else doing deletes in this fashion? Are there drawbacks that I > >might be missing? Here is a link to the code: > > > >https://gist.github.com/3841437 > > > >Pretty simple, eh? I haven't seen much mention of this technique which is > >why I am a tad paranoid about it. > > > >Thanks, > >Paul > >
-
RE: bulk deletesAnoop Sam John 2012-10-11, 04:04
You are right Jerry..
In your use case you want to delete full rows or some cfs/columns only? Pls feel free to see the issue HBASE-6942 and give your valuable comments.. Here I am trying to delete the rows [This is our use case] -Anoop- ________________________________________ From: Jerry Lam [[EMAIL PROTECTED]] Sent: Wednesday, October 10, 2012 8:37 PM To: [EMAIL PROTECTED] Subject: Re: bulk deletes Hi guys: The bulk delete approaches described in this thread are helpful in my case as well. If I understood correctly, Paul's approach is useful for offline bulk deletes (a.k.a. mapreduce) whereas Anoop's approach is useful for online/real-time bulk deletes (a.k.a. co-processor)? Best Regards, Jerry On Mon, Oct 8, 2012 at 7:45 AM, Paul Mackles <[EMAIL PROTECTED]> wrote: > Very cool Anoop. I can definitely see how that would be useful. > > Lars - the bulk deletes do appear to work. I just wasn't sure if there was > something I might be missing since I haven't seen this documented > elsewhere. > > Coprocessors do seem a better fit for this in the long term. > > Thanks everyone. > > On 10/7/12 11:55 PM, "Anoop Sam John" <[EMAIL PROTECTED]> wrote: > > >We also done an implementation using compaction time deletes(avoid KVs). > >This works very well for us.... > >As this would delay the deletes to happen till the next major compaction, > >we are having an implementation to do the real time bulk delete. [We have > >such use case] > >Here I am using an endpoint implementation to do the scan and delete at > >the server side only. Just raised an IA for this [HBASE-6942]. I will > >post a patch based on 0.94 model there...Pls have a look.... I have > >noticed big performance improvement over the normal way of scan() + > >delete(List<Delete>) as this avoids several network calls and traffic... > > > >-Anoop- > >________________________________________ > >From: lars hofhansl [[EMAIL PROTECTED]] > >Sent: Saturday, October 06, 2012 1:09 AM > >To: [EMAIL PROTECTED] > >Subject: Re: bulk deletes > > > >Does it work? :) > > > >How did you do the deletes before?I assume you used the > >HTable.delete(List<Delete>) API? > > > >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor > >into the compactions and simply filter out any KVs you want to have > >removed. > > > > > >-- Lars > > > > > > > >________________________________ > > From: Paul Mackles <[EMAIL PROTECTED]> > >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > >Sent: Friday, October 5, 2012 11:17 AM > >Subject: bulk deletes > > > >We need to do deletes pretty regularly and sometimes we could have > >hundreds of millions of cells to delete. TTLs won't work for us because > >we have a fair amount of bizlogic around the deletes. > > > >Given their current implemention (we are on 0.90.4), this delete process > >can take a really long time (half a day or more with 100 or so concurrent > >threads). From everything I can tell, the performance issues come down to > >each delete being an individual RPC call (even when using the batch API). > >In other words, I don't see any thrashing on hbase while this process is > >running just lots of waiting for the RPC calls to return. > > > >The alternative we came up with is to use the standard bulk load > >facilities to handle the deletes. The code turned out to be surpisingly > >simple and appears to work in the small-scale tests we have tried so far. > >Is anyone else doing deletes in this fashion? Are there drawbacks that I > >might be missing? Here is a link to the code: > > > >https://gist.github.com/3841437 > > > >Pretty simple, eh? I haven't seen much mention of this technique which is > >why I am a tad paranoid about it. > > > >Thanks, > >Paul > >
-
Re: bulk deletesJerry Lam 2012-10-12, 21:41
Hi Anoop:
In my use case, I use extensively the version delete marker because I need to delete a specific version of a cell (row key, CF, qualifier, timestamp). I have a mapreduce job that will run across some regions and based on some business rules, some of the cells will be deleted in the table using the version delete marker. The business rules for deletion are scoped to each column family at a time. Therefore, there are no logically dependency of deletions between column families. I also posted the above use case in the HBASE-6942. Best Regards, Jerry On Thu, Oct 11, 2012 at 12:04 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > You are right Jerry.. > In your use case you want to delete full rows or some cfs/columns only? > Pls feel free to see the issue HBASE-6942 and give your valuable comments.. > Here I am trying to delete the rows [This is our use case] > > -Anoop- > ________________________________________ > From: Jerry Lam [[EMAIL PROTECTED]] > Sent: Wednesday, October 10, 2012 8:37 PM > To: [EMAIL PROTECTED] > Subject: Re: bulk deletes > > Hi guys: > > The bulk delete approaches described in this thread are helpful in my case > as well. If I understood correctly, Paul's approach is useful for offline > bulk deletes (a.k.a. mapreduce) whereas Anoop's approach is useful for > online/real-time bulk deletes (a.k.a. co-processor)? > > Best Regards, > > Jerry > > On Mon, Oct 8, 2012 at 7:45 AM, Paul Mackles <[EMAIL PROTECTED]> wrote: > > > Very cool Anoop. I can definitely see how that would be useful. > > > > Lars - the bulk deletes do appear to work. I just wasn't sure if there > was > > something I might be missing since I haven't seen this documented > > elsewhere. > > > > Coprocessors do seem a better fit for this in the long term. > > > > Thanks everyone. > > > > On 10/7/12 11:55 PM, "Anoop Sam John" <[EMAIL PROTECTED]> wrote: > > > > >We also done an implementation using compaction time deletes(avoid KVs). > > >This works very well for us.... > > >As this would delay the deletes to happen till the next major > compaction, > > >we are having an implementation to do the real time bulk delete. [We > have > > >such use case] > > >Here I am using an endpoint implementation to do the scan and delete at > > >the server side only. Just raised an IA for this [HBASE-6942]. I will > > >post a patch based on 0.94 model there...Pls have a look.... I have > > >noticed big performance improvement over the normal way of scan() + > > >delete(List<Delete>) as this avoids several network calls and traffic... > > > > > >-Anoop- > > >________________________________________ > > >From: lars hofhansl [[EMAIL PROTECTED]] > > >Sent: Saturday, October 06, 2012 1:09 AM > > >To: [EMAIL PROTECTED] > > >Subject: Re: bulk deletes > > > > > >Does it work? :) > > > > > >How did you do the deletes before?I assume you used the > > >HTable.delete(List<Delete>) API? > > > > > >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor > > >into the compactions and simply filter out any KVs you want to have > > >removed. > > > > > > > > >-- Lars > > > > > > > > > > > >________________________________ > > > From: Paul Mackles <[EMAIL PROTECTED]> > > >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > >Sent: Friday, October 5, 2012 11:17 AM > > >Subject: bulk deletes > > > > > >We need to do deletes pretty regularly and sometimes we could have > > >hundreds of millions of cells to delete. TTLs won't work for us because > > >we have a fair amount of bizlogic around the deletes. > > > > > >Given their current implemention (we are on 0.90.4), this delete > process > > >can take a really long time (half a day or more with 100 or so > concurrent > > >threads). From everything I can tell, the performance issues come down > to > > >each delete being an individual RPC call (even when using the batch > API). > > >In other words, I don't see any thrashing on hbase while this process is > > >running just lots of waiting for the RPC calls to return. |