|
Jean-Marc Spaggiari
2012-10-18, 00:11
Michael Segel
2012-10-18, 00:27
Jean-Marc Spaggiari
2012-10-18, 01:19
Michael Segel
2012-10-18, 01:31
Jean-Marc Spaggiari
2012-10-18, 01:44
Michael Segel
2012-10-18, 01:50
Anoop Sam John
2012-10-18, 04:20
Doug Meil
2012-10-18, 12:36
Michael Segel
2012-10-18, 18:01
Doug Meil
2012-10-18, 19:18
Anoop Sam John
2012-10-19, 03:33
lohit
2012-10-19, 03:58
Jean-Marc Spaggiari
2012-10-25, 13:01
Anoop John
2012-10-25, 17:13
Jerry Lam
2012-10-25, 20:43
|
-
Coprocessor end point vs MapReduce?Jean-Marc Spaggiari 2012-10-18, 00:11
Hi,
Can someone please help me to understand the pros and cons between those 2 options for the following usecase? I need to transfer all the rows between 2 timestamps to another table. My first idea was to run a MapReduce to map the rows and store them on another table, and then delete them using an end point coprocessor. But the more I look into it, the more I think the MapReduce is not a good idea and I should use a coprocessor instead. BUT... The MapReduce framework guarantee me that it will run against all the regions. I tried to stop a regionserver while the job was running. The region moved, and the MapReduce restarted the job from the new location. Will the coprocessor do the same thing? Also, I found the webconsole for the MapReduce with the number of jobs, the status, etc. Is there the same thing with the coprocessors? Are all coprocessors running at the same time on all regions, which mean we can have 100 of them running on a regionserver at a time? Or are they running like the MapReduce jobs based on some configured values? Thanks, JM
-
Re: Coprocessor end point vs MapReduce?Michael Segel 2012-10-18, 00:27
Hi,
I'm a firm believer in KISS (Keep It Simple, Stupid) The Map/Reduce (map job only) is the simplest and least prone to failure. Not sure why you would want to do this using coprocessors. How often are you running this job? It sounds like its going to be sporadic. -Mike On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > Hi, > > Can someone please help me to understand the pros and cons between > those 2 options for the following usecase? > > I need to transfer all the rows between 2 timestamps to another table. > > My first idea was to run a MapReduce to map the rows and store them on > another table, and then delete them using an end point coprocessor. > But the more I look into it, the more I think the MapReduce is not a > good idea and I should use a coprocessor instead. > > BUT... The MapReduce framework guarantee me that it will run against > all the regions. I tried to stop a regionserver while the job was > running. The region moved, and the MapReduce restarted the job from > the new location. Will the coprocessor do the same thing? > > Also, I found the webconsole for the MapReduce with the number of > jobs, the status, etc. Is there the same thing with the coprocessors? > > Are all coprocessors running at the same time on all regions, which > mean we can have 100 of them running on a regionserver at a time? Or > are they running like the MapReduce jobs based on some configured > values? > > Thanks, > > JM >
-
Re: Coprocessor end point vs MapReduce?Jean-Marc Spaggiari 2012-10-18, 01:19
Hi Mike,
I'm expecting to run the job weekly. I initially thought about using end points because I found HBASE-6942 which was a good example for my needs. I'm fine with the Put part for the Map/Reduce, but I'm not sure about the delete. That's why I look at coprocessors. Then I figure that I also can do the Put on the coprocessor side. On a M/R, can I delete the row I'm dealing with based on some criteria like timestamp? If I do that, I will not do bulk deletes, but I will delete the rows one by one, right? Which might be very slow. If in the future I want to run the job daily, might that be an issue? Or should I go with the initial idea of doing the Put with the M/R job and the delete with HBASE-6942? Thanks, JM 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: > Hi, > > I'm a firm believer in KISS (Keep It Simple, Stupid) > > The Map/Reduce (map job only) is the simplest and least prone to failure. > > Not sure why you would want to do this using coprocessors. > > How often are you running this job? It sounds like its going to be > sporadic. > > -Mike > > On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> > wrote: > >> Hi, >> >> Can someone please help me to understand the pros and cons between >> those 2 options for the following usecase? >> >> I need to transfer all the rows between 2 timestamps to another table. >> >> My first idea was to run a MapReduce to map the rows and store them on >> another table, and then delete them using an end point coprocessor. >> But the more I look into it, the more I think the MapReduce is not a >> good idea and I should use a coprocessor instead. >> >> BUT... The MapReduce framework guarantee me that it will run against >> all the regions. I tried to stop a regionserver while the job was >> running. The region moved, and the MapReduce restarted the job from >> the new location. Will the coprocessor do the same thing? >> >> Also, I found the webconsole for the MapReduce with the number of >> jobs, the status, etc. Is there the same thing with the coprocessors? >> >> Are all coprocessors running at the same time on all regions, which >> mean we can have 100 of them running on a regionserver at a time? Or >> are they running like the MapReduce jobs based on some configured >> values? >> >> Thanks, >> >> JM >> > >
-
Re: Coprocessor end point vs MapReduce?Michael Segel 2012-10-18, 01:31
If you're going to be running this weekly, I would suggest that you stick with the M/R job.
Is there any reason why you need to be worried about the time it takes to do the deletes? On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > Hi Mike, > > I'm expecting to run the job weekly. I initially thought about using > end points because I found HBASE-6942 which was a good example for my > needs. > > I'm fine with the Put part for the Map/Reduce, but I'm not sure about > the delete. That's why I look at coprocessors. Then I figure that I > also can do the Put on the coprocessor side. > > On a M/R, can I delete the row I'm dealing with based on some criteria > like timestamp? If I do that, I will not do bulk deletes, but I will > delete the rows one by one, right? Which might be very slow. > > If in the future I want to run the job daily, might that be an issue? > > Or should I go with the initial idea of doing the Put with the M/R job > and the delete with HBASE-6942? > > Thanks, > > JM > > > 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >> Hi, >> >> I'm a firm believer in KISS (Keep It Simple, Stupid) >> >> The Map/Reduce (map job only) is the simplest and least prone to failure. >> >> Not sure why you would want to do this using coprocessors. >> >> How often are you running this job? It sounds like its going to be >> sporadic. >> >> -Mike >> >> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> >> wrote: >> >>> Hi, >>> >>> Can someone please help me to understand the pros and cons between >>> those 2 options for the following usecase? >>> >>> I need to transfer all the rows between 2 timestamps to another table. >>> >>> My first idea was to run a MapReduce to map the rows and store them on >>> another table, and then delete them using an end point coprocessor. >>> But the more I look into it, the more I think the MapReduce is not a >>> good idea and I should use a coprocessor instead. >>> >>> BUT... The MapReduce framework guarantee me that it will run against >>> all the regions. I tried to stop a regionserver while the job was >>> running. The region moved, and the MapReduce restarted the job from >>> the new location. Will the coprocessor do the same thing? >>> >>> Also, I found the webconsole for the MapReduce with the number of >>> jobs, the status, etc. Is there the same thing with the coprocessors? >>> >>> Are all coprocessors running at the same time on all regions, which >>> mean we can have 100 of them running on a regionserver at a time? Or >>> are they running like the MapReduce jobs based on some configured >>> values? >>> >>> Thanks, >>> >>> JM >>> >> >> >
-
Re: Coprocessor end point vs MapReduce?Jean-Marc Spaggiari 2012-10-18, 01:44
I don't have any concern about the time it's taking. It's more about
the load it's putting on the cluster. I have other jobs that I need to run (secondary index, data processing, etc.). So the more time this new job is taking, the less CPU the others will have. I tried the M/R and I really liked the way it's done. So my only concern will really be the performance of the delete part. That's why I'm wondering what's the best practice to move a row to another table. 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: > If you're going to be running this weekly, I would suggest that you stick > with the M/R job. > > Is there any reason why you need to be worried about the time it takes to do > the deletes? > > > On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> > wrote: > >> Hi Mike, >> >> I'm expecting to run the job weekly. I initially thought about using >> end points because I found HBASE-6942 which was a good example for my >> needs. >> >> I'm fine with the Put part for the Map/Reduce, but I'm not sure about >> the delete. That's why I look at coprocessors. Then I figure that I >> also can do the Put on the coprocessor side. >> >> On a M/R, can I delete the row I'm dealing with based on some criteria >> like timestamp? If I do that, I will not do bulk deletes, but I will >> delete the rows one by one, right? Which might be very slow. >> >> If in the future I want to run the job daily, might that be an issue? >> >> Or should I go with the initial idea of doing the Put with the M/R job >> and the delete with HBASE-6942? >> >> Thanks, >> >> JM >> >> >> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>> Hi, >>> >>> I'm a firm believer in KISS (Keep It Simple, Stupid) >>> >>> The Map/Reduce (map job only) is the simplest and least prone to >>> failure. >>> >>> Not sure why you would want to do this using coprocessors. >>> >>> How often are you running this job? It sounds like its going to be >>> sporadic. >>> >>> -Mike >>> >>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari >>> <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi, >>>> >>>> Can someone please help me to understand the pros and cons between >>>> those 2 options for the following usecase? >>>> >>>> I need to transfer all the rows between 2 timestamps to another table. >>>> >>>> My first idea was to run a MapReduce to map the rows and store them on >>>> another table, and then delete them using an end point coprocessor. >>>> But the more I look into it, the more I think the MapReduce is not a >>>> good idea and I should use a coprocessor instead. >>>> >>>> BUT... The MapReduce framework guarantee me that it will run against >>>> all the regions. I tried to stop a regionserver while the job was >>>> running. The region moved, and the MapReduce restarted the job from >>>> the new location. Will the coprocessor do the same thing? >>>> >>>> Also, I found the webconsole for the MapReduce with the number of >>>> jobs, the status, etc. Is there the same thing with the coprocessors? >>>> >>>> Are all coprocessors running at the same time on all regions, which >>>> mean we can have 100 of them running on a regionserver at a time? Or >>>> are they running like the MapReduce jobs based on some configured >>>> values? >>>> >>>> Thanks, >>>> >>>> JM >>>> >>> >>> >> > >
-
Re: Coprocessor end point vs MapReduce?Michael Segel 2012-10-18, 01:50
Run your weekly job in a low priority fair scheduler/capacity scheduler queue.
Maybe its just me, but I look at Coprocessors as a similar structure to RDBMS triggers and stored procedures. You need to restrain and use them sparingly otherwise you end up creating performance issues. Just IMHO. -Mike On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > I don't have any concern about the time it's taking. It's more about > the load it's putting on the cluster. I have other jobs that I need to > run (secondary index, data processing, etc.). So the more time this > new job is taking, the less CPU the others will have. > > I tried the M/R and I really liked the way it's done. So my only > concern will really be the performance of the delete part. > > That's why I'm wondering what's the best practice to move a row to > another table. > > 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >> If you're going to be running this weekly, I would suggest that you stick >> with the M/R job. >> >> Is there any reason why you need to be worried about the time it takes to do >> the deletes? >> >> >> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> >> wrote: >> >>> Hi Mike, >>> >>> I'm expecting to run the job weekly. I initially thought about using >>> end points because I found HBASE-6942 which was a good example for my >>> needs. >>> >>> I'm fine with the Put part for the Map/Reduce, but I'm not sure about >>> the delete. That's why I look at coprocessors. Then I figure that I >>> also can do the Put on the coprocessor side. >>> >>> On a M/R, can I delete the row I'm dealing with based on some criteria >>> like timestamp? If I do that, I will not do bulk deletes, but I will >>> delete the rows one by one, right? Which might be very slow. >>> >>> If in the future I want to run the job daily, might that be an issue? >>> >>> Or should I go with the initial idea of doing the Put with the M/R job >>> and the delete with HBASE-6942? >>> >>> Thanks, >>> >>> JM >>> >>> >>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>> Hi, >>>> >>>> I'm a firm believer in KISS (Keep It Simple, Stupid) >>>> >>>> The Map/Reduce (map job only) is the simplest and least prone to >>>> failure. >>>> >>>> Not sure why you would want to do this using coprocessors. >>>> >>>> How often are you running this job? It sounds like its going to be >>>> sporadic. >>>> >>>> -Mike >>>> >>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari >>>> <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Can someone please help me to understand the pros and cons between >>>>> those 2 options for the following usecase? >>>>> >>>>> I need to transfer all the rows between 2 timestamps to another table. >>>>> >>>>> My first idea was to run a MapReduce to map the rows and store them on >>>>> another table, and then delete them using an end point coprocessor. >>>>> But the more I look into it, the more I think the MapReduce is not a >>>>> good idea and I should use a coprocessor instead. >>>>> >>>>> BUT... The MapReduce framework guarantee me that it will run against >>>>> all the regions. I tried to stop a regionserver while the job was >>>>> running. The region moved, and the MapReduce restarted the job from >>>>> the new location. Will the coprocessor do the same thing? >>>>> >>>>> Also, I found the webconsole for the MapReduce with the number of >>>>> jobs, the status, etc. Is there the same thing with the coprocessors? >>>>> >>>>> Are all coprocessors running at the same time on all regions, which >>>>> mean we can have 100 of them running on a regionserver at a time? Or >>>>> are they running like the MapReduce jobs based on some configured >>>>> values? >>>>> >>>>> Thanks, >>>>> >>>>> JM >>>>> >>>> >>>> >>> >> >> >
-
RE: Coprocessor end point vs MapReduce?Anoop Sam John 2012-10-18, 04:20
Hi Jean >>Are all coprocessors running at the same time on all regions Yes it will try to run all in parallel.. It will submit one callable for each of the involved region. Though it uses the Executor pool available with the HTable. So the available slots it that and total regions count matters the parallel run.. >>The MapReduce framework guarantee me that it will run against >> all the regions. I tried to stop a regionserver while the job was >> running. The region moved, and the MapReduce restarted the job from >> the new location. Will the coprocessor do the same thing Yes it will.. There will be retry (max 10 times by def) for every call to a region. Though one another point came to my mind now is what will happen if in btw a region splits? How MR will handle this case? Sorry I dont know.Need to see the code. Regarding your use case Jean, You want to put some data to another table right? How you plan to make use of CP for this Put?(I wonder) For the bulk delete as you said if you use an MR, it is like a scan to client side and delete rows one by one(Though many parallel clients ur Mappers). So as you expect it will be very slow comparing to the new approach what we are trying to do in 6942.. Hope I have answered your questions.. :) -Anoop- ________________________________________ From: Michael Segel [[EMAIL PROTECTED]] Sent: Thursday, October 18, 2012 7:20 AM To: [EMAIL PROTECTED] Subject: Re: Coprocessor end point vs MapReduce? Run your weekly job in a low priority fair scheduler/capacity scheduler queue. Maybe its just me, but I look at Coprocessors as a similar structure to RDBMS triggers and stored procedures. You need to restrain and use them sparingly otherwise you end up creating performance issues. Just IMHO. -Mike On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote: > I don't have any concern about the time it's taking. It's more about > the load it's putting on the cluster. I have other jobs that I need to > run (secondary index, data processing, etc.). So the more time this > new job is taking, the less CPU the others will have. > > I tried the M/R and I really liked the way it's done. So my only > concern will really be the performance of the delete part. > > That's why I'm wondering what's the best practice to move a row to > another table. > > 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >> If you're going to be running this weekly, I would suggest that you stick >> with the M/R job. >> >> Is there any reason why you need to be worried about the time it takes to do >> the deletes? >> >> >> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> >> wrote: >> >>> Hi Mike, >>> >>> I'm expecting to run the job weekly. I initially thought about using >>> end points because I found HBASE-6942 which was a good example for my >>> needs. >>> >>> I'm fine with the Put part for the Map/Reduce, but I'm not sure about >>> the delete. That's why I look at coprocessors. Then I figure that I >>> also can do the Put on the coprocessor side. >>> >>> On a M/R, can I delete the row I'm dealing with based on some criteria >>> like timestamp? If I do that, I will not do bulk deletes, but I will >>> delete the rows one by one, right? Which might be very slow. >>> >>> If in the future I want to run the job daily, might that be an issue? >>> >>> Or should I go with the initial idea of doing the Put with the M/R job >>> and the delete with HBASE-6942? >>> >>> Thanks, >>> >>> JM >>> >>> >>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>> Hi, >>>> >>>> I'm a firm believer in KISS (Keep It Simple, Stupid) >>>> >>>> The Map/Reduce (map job only) is the simplest and least prone to >>>> failure. >>>> >>>> Not sure why you would want to do this using coprocessors. >>>> >>>> How often are you running this job? It sounds like its going to be >>>> sporadic. >>>> >>>> -Mike >>>> >>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari >>>> <[EMAIL PROTECTED]>
-
Re: Coprocessor end point vs MapReduce?Doug Meil 2012-10-18, 12:36
To echo what Mike said about KISS, would you use triggers for a large time-sensitive batch job in an RDBMS? It's possible, but probably not. Then you might want to think twice about using co-processors for such a purpose with HBase. On 10/17/12 9:50 PM, "Michael Segel" <[EMAIL PROTECTED]> wrote: >Run your weekly job in a low priority fair scheduler/capacity scheduler >queue. > >Maybe its just me, but I look at Coprocessors as a similar structure to >RDBMS triggers and stored procedures. >You need to restrain and use them sparingly otherwise you end up creating >performance issues. > >Just IMHO. > >-Mike > >On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari ><[EMAIL PROTECTED]> wrote: > >> I don't have any concern about the time it's taking. It's more about >> the load it's putting on the cluster. I have other jobs that I need to >> run (secondary index, data processing, etc.). So the more time this >> new job is taking, the less CPU the others will have. >> >> I tried the M/R and I really liked the way it's done. So my only >> concern will really be the performance of the delete part. >> >> That's why I'm wondering what's the best practice to move a row to >> another table. >> >> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>> If you're going to be running this weekly, I would suggest that you >>>stick >>> with the M/R job. >>> >>> Is there any reason why you need to be worried about the time it takes >>>to do >>> the deletes? >>> >>> >>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari >>><[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi Mike, >>>> >>>> I'm expecting to run the job weekly. I initially thought about using >>>> end points because I found HBASE-6942 which was a good example for my >>>> needs. >>>> >>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure about >>>> the delete. That's why I look at coprocessors. Then I figure that I >>>> also can do the Put on the coprocessor side. >>>> >>>> On a M/R, can I delete the row I'm dealing with based on some criteria >>>> like timestamp? If I do that, I will not do bulk deletes, but I will >>>> delete the rows one by one, right? Which might be very slow. >>>> >>>> If in the future I want to run the job daily, might that be an issue? >>>> >>>> Or should I go with the initial idea of doing the Put with the M/R job >>>> and the delete with HBASE-6942? >>>> >>>> Thanks, >>>> >>>> JM >>>> >>>> >>>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>>> Hi, >>>>> >>>>> I'm a firm believer in KISS (Keep It Simple, Stupid) >>>>> >>>>> The Map/Reduce (map job only) is the simplest and least prone to >>>>> failure. >>>>> >>>>> Not sure why you would want to do this using coprocessors. >>>>> >>>>> How often are you running this job? It sounds like its going to be >>>>> sporadic. >>>>> >>>>> -Mike >>>>> >>>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari >>>>> <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Can someone please help me to understand the pros and cons between >>>>>> those 2 options for the following usecase? >>>>>> >>>>>> I need to transfer all the rows between 2 timestamps to another >>>>>>table. >>>>>> >>>>>> My first idea was to run a MapReduce to map the rows and store them >>>>>>on >>>>>> another table, and then delete them using an end point coprocessor. >>>>>> But the more I look into it, the more I think the MapReduce is not a >>>>>> good idea and I should use a coprocessor instead. >>>>>> >>>>>> BUT... The MapReduce framework guarantee me that it will run against >>>>>> all the regions. I tried to stop a regionserver while the job was >>>>>> running. The region moved, and the MapReduce restarted the job from >>>>>> the new location. Will the coprocessor do the same thing? >>>>>> >>>>>> Also, I found the webconsole for the MapReduce with the number of >>>>>> jobs, the status, etc. Is there the same thing with the >>>>>>coprocessors? >>>>>> >>>>>> Are all coprocessors running at the same time on all regions, which
-
Re: Coprocessor end point vs MapReduce?Michael Segel 2012-10-18, 18:01
Doug,
One thing that concerns me is that a lot of folks are gravitating to Coprocessors and may be using them for the wrong thing. Has anyone done any sort of research as to some of the limitations and negative impacts on using coprocessors? While I haven't really toyed with the idea of bulk deletes, periodic deletes is probably not a good use of coprocessors.... however using them to synchronize tables would be a valid use case. Thx -Mike On Oct 18, 2012, at 7:36 AM, Doug Meil <[EMAIL PROTECTED]> wrote: > > To echo what Mike said about KISS, would you use triggers for a large > time-sensitive batch job in an RDBMS? It's possible, but probably not. > Then you might want to think twice about using co-processors for such a > purpose with HBase. > > > > > > On 10/17/12 9:50 PM, "Michael Segel" <[EMAIL PROTECTED]> wrote: > >> Run your weekly job in a low priority fair scheduler/capacity scheduler >> queue. >> >> Maybe its just me, but I look at Coprocessors as a similar structure to >> RDBMS triggers and stored procedures. >> You need to restrain and use them sparingly otherwise you end up creating >> performance issues. >> >> Just IMHO. >> >> -Mike >> >> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari >> <[EMAIL PROTECTED]> wrote: >> >>> I don't have any concern about the time it's taking. It's more about >>> the load it's putting on the cluster. I have other jobs that I need to >>> run (secondary index, data processing, etc.). So the more time this >>> new job is taking, the less CPU the others will have. >>> >>> I tried the M/R and I really liked the way it's done. So my only >>> concern will really be the performance of the delete part. >>> >>> That's why I'm wondering what's the best practice to move a row to >>> another table. >>> >>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>> If you're going to be running this weekly, I would suggest that you >>>> stick >>>> with the M/R job. >>>> >>>> Is there any reason why you need to be worried about the time it takes >>>> to do >>>> the deletes? >>>> >>>> >>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari >>>> <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi Mike, >>>>> >>>>> I'm expecting to run the job weekly. I initially thought about using >>>>> end points because I found HBASE-6942 which was a good example for my >>>>> needs. >>>>> >>>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure about >>>>> the delete. That's why I look at coprocessors. Then I figure that I >>>>> also can do the Put on the coprocessor side. >>>>> >>>>> On a M/R, can I delete the row I'm dealing with based on some criteria >>>>> like timestamp? If I do that, I will not do bulk deletes, but I will >>>>> delete the rows one by one, right? Which might be very slow. >>>>> >>>>> If in the future I want to run the job daily, might that be an issue? >>>>> >>>>> Or should I go with the initial idea of doing the Put with the M/R job >>>>> and the delete with HBASE-6942? >>>>> >>>>> Thanks, >>>>> >>>>> JM >>>>> >>>>> >>>>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>>>> Hi, >>>>>> >>>>>> I'm a firm believer in KISS (Keep It Simple, Stupid) >>>>>> >>>>>> The Map/Reduce (map job only) is the simplest and least prone to >>>>>> failure. >>>>>> >>>>>> Not sure why you would want to do this using coprocessors. >>>>>> >>>>>> How often are you running this job? It sounds like its going to be >>>>>> sporadic. >>>>>> >>>>>> -Mike >>>>>> >>>>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari >>>>>> <[EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Can someone please help me to understand the pros and cons between >>>>>>> those 2 options for the following usecase? >>>>>>> >>>>>>> I need to transfer all the rows between 2 timestamps to another >>>>>>> table. >>>>>>> >>>>>>> My first idea was to run a MapReduce to map the rows and store them >>>>>>> on >>>>>>> another table, and then delete them using an end point coprocessor.
-
Re: Coprocessor end point vs MapReduce?Doug Meil 2012-10-18, 19:18
I agree with the concern and there isn't a ton of guidance on this area yet. On 10/18/12 2:01 PM, "Michael Segel" <[EMAIL PROTECTED]> wrote: >Doug, > >One thing that concerns me is that a lot of folks are gravitating to >Coprocessors and may be using them for the wrong thing. >Has anyone done any sort of research as to some of the limitations and >negative impacts on using coprocessors? > >While I haven't really toyed with the idea of bulk deletes, periodic >deletes is probably not a good use of coprocessors.... however using them >to synchronize tables would be a valid use case. > >Thx > >-Mike > >On Oct 18, 2012, at 7:36 AM, Doug Meil <[EMAIL PROTECTED]> >wrote: > >> >> To echo what Mike said about KISS, would you use triggers for a large >> time-sensitive batch job in an RDBMS? It's possible, but probably not. >> Then you might want to think twice about using co-processors for such a >> purpose with HBase. >> >> >> >> >> >> On 10/17/12 9:50 PM, "Michael Segel" <[EMAIL PROTECTED]> wrote: >> >>> Run your weekly job in a low priority fair scheduler/capacity scheduler >>> queue. >>> >>> Maybe its just me, but I look at Coprocessors as a similar structure to >>> RDBMS triggers and stored procedures. >>> You need to restrain and use them sparingly otherwise you end up >>>creating >>> performance issues. >>> >>> Just IMHO. >>> >>> -Mike >>> >>> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari >>> <[EMAIL PROTECTED]> wrote: >>> >>>> I don't have any concern about the time it's taking. It's more about >>>> the load it's putting on the cluster. I have other jobs that I need to >>>> run (secondary index, data processing, etc.). So the more time this >>>> new job is taking, the less CPU the others will have. >>>> >>>> I tried the M/R and I really liked the way it's done. So my only >>>> concern will really be the performance of the delete part. >>>> >>>> That's why I'm wondering what's the best practice to move a row to >>>> another table. >>>> >>>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>>> If you're going to be running this weekly, I would suggest that you >>>>> stick >>>>> with the M/R job. >>>>> >>>>> Is there any reason why you need to be worried about the time it >>>>>takes >>>>> to do >>>>> the deletes? >>>>> >>>>> >>>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari >>>>> <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Hi Mike, >>>>>> >>>>>> I'm expecting to run the job weekly. I initially thought about using >>>>>> end points because I found HBASE-6942 which was a good example for >>>>>>my >>>>>> needs. >>>>>> >>>>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure >>>>>>about >>>>>> the delete. That's why I look at coprocessors. Then I figure that I >>>>>> also can do the Put on the coprocessor side. >>>>>> >>>>>> On a M/R, can I delete the row I'm dealing with based on some >>>>>>criteria >>>>>> like timestamp? If I do that, I will not do bulk deletes, but I will >>>>>> delete the rows one by one, right? Which might be very slow. >>>>>> >>>>>> If in the future I want to run the job daily, might that be an >>>>>>issue? >>>>>> >>>>>> Or should I go with the initial idea of doing the Put with the M/R >>>>>>job >>>>>> and the delete with HBASE-6942? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> JM >>>>>> >>>>>> >>>>>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>>>>> Hi, >>>>>>> >>>>>>> I'm a firm believer in KISS (Keep It Simple, Stupid) >>>>>>> >>>>>>> The Map/Reduce (map job only) is the simplest and least prone to >>>>>>> failure. >>>>>>> >>>>>>> Not sure why you would want to do this using coprocessors. >>>>>>> >>>>>>> How often are you running this job? It sounds like its going to be >>>>>>> sporadic. >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari >>>>>>> <[EMAIL PROTECTED]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Can someone please help me to understand the pros and cons between
-
RE: Coprocessor end point vs MapReduce?Anoop Sam John 2012-10-19, 03:33
A CP and Endpoints operates at a region level.. Any operation within one region we can perform using this.. I have seen in below use case that along with the delete there was a need for inserting data to some other table also.. Also this was kind of a periodic action.. I really doubt how the endpoints alone can be used here.. I also tend towards the MR..
The idea behind the bulk delete CP is simple. We have a use case of deleting a bulk of rows and this need to be online delete. I also have seen in the mailing list many people ask question regarding that... In all people were using scans and get the rowkeys to the client side and then doing the deletes.. Yes most of the time complaint was the slowness.. One bulk delete performance improvement was done in HBASE-6284.. Still thought we can do all the operation (scan+delete) in server side and we can make use of the endpoints here.. This will be much more faster and can be used for online bulk deletes.. -Anoop- ________________________________________ From: Michael Segel [[EMAIL PROTECTED]] Sent: Thursday, October 18, 2012 11:31 PM To: [EMAIL PROTECTED] Subject: Re: Coprocessor end point vs MapReduce? Doug, One thing that concerns me is that a lot of folks are gravitating to Coprocessors and may be using them for the wrong thing. Has anyone done any sort of research as to some of the limitations and negative impacts on using coprocessors? While I haven't really toyed with the idea of bulk deletes, periodic deletes is probably not a good use of coprocessors.... however using them to synchronize tables would be a valid use case. Thx -Mike On Oct 18, 2012, at 7:36 AM, Doug Meil <[EMAIL PROTECTED]> wrote: > > To echo what Mike said about KISS, would you use triggers for a large > time-sensitive batch job in an RDBMS? It's possible, but probably not. > Then you might want to think twice about using co-processors for such a > purpose with HBase. > > > > > > On 10/17/12 9:50 PM, "Michael Segel" <[EMAIL PROTECTED]> wrote: > >> Run your weekly job in a low priority fair scheduler/capacity scheduler >> queue. >> >> Maybe its just me, but I look at Coprocessors as a similar structure to >> RDBMS triggers and stored procedures. >> You need to restrain and use them sparingly otherwise you end up creating >> performance issues. >> >> Just IMHO. >> >> -Mike >> >> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari >> <[EMAIL PROTECTED]> wrote: >> >>> I don't have any concern about the time it's taking. It's more about >>> the load it's putting on the cluster. I have other jobs that I need to >>> run (secondary index, data processing, etc.). So the more time this >>> new job is taking, the less CPU the others will have. >>> >>> I tried the M/R and I really liked the way it's done. So my only >>> concern will really be the performance of the delete part. >>> >>> That's why I'm wondering what's the best practice to move a row to >>> another table. >>> >>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: >>>> If you're going to be running this weekly, I would suggest that you >>>> stick >>>> with the M/R job. >>>> >>>> Is there any reason why you need to be worried about the time it takes >>>> to do >>>> the deletes? >>>> >>>> >>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari >>>> <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi Mike, >>>>> >>>>> I'm expecting to run the job weekly. I initially thought about using >>>>> end points because I found HBASE-6942 which was a good example for my >>>>> needs. >>>>> >>>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure about >>>>> the delete. That's why I look at coprocessors. Then I figure that I >>>>> also can do the Put on the coprocessor side. >>>>> >>>>> On a M/R, can I delete the row I'm dealing with based on some criteria >>>>> like timestamp? If I do that, I will not do bulk deletes, but I will >>>>> delete the rows one by one, right? Which might be very slow. >>>>> >>
-
Re: Coprocessor end point vs MapReduce?lohit 2012-10-19, 03:58
I might be little off here. If rows are moved to another table on weekly or
daily basis, why not create per weekly or per day table. That way you need to copy and delete. Of course it will not work you are are selectively filtering between timestamps and clients have to have notion of multiple tables. 2012/10/18 Anoop Sam John <[EMAIL PROTECTED]> > A CP and Endpoints operates at a region level.. Any operation within one > region we can perform using this.. I have seen in below use case that > along with the delete there was a need for inserting data to some other > table also.. Also this was kind of a periodic action.. I really doubt how > the endpoints alone can be used here.. I also tend towards the MR.. > > The idea behind the bulk delete CP is simple. We have a use case of > deleting a bulk of rows and this need to be online delete. I also have seen > in the mailing list many people ask question regarding that... In all > people were using scans and get the rowkeys to the client side and then > doing the deletes.. Yes most of the time complaint was the slowness.. One > bulk delete performance improvement was done in HBASE-6284.. Still thought > we can do all the operation (scan+delete) in server side and we can make > use of the endpoints here.. This will be much more faster and can be used > for online bulk deletes.. > > -Anoop- > > ________________________________________ > From: Michael Segel [[EMAIL PROTECTED]] > Sent: Thursday, October 18, 2012 11:31 PM > To: [EMAIL PROTECTED] > Subject: Re: Coprocessor end point vs MapReduce? > > Doug, > > One thing that concerns me is that a lot of folks are gravitating to > Coprocessors and may be using them for the wrong thing. > Has anyone done any sort of research as to some of the limitations and > negative impacts on using coprocessors? > > While I haven't really toyed with the idea of bulk deletes, periodic > deletes is probably not a good use of coprocessors.... however using them > to synchronize tables would be a valid use case. > > Thx > > -Mike > > On Oct 18, 2012, at 7:36 AM, Doug Meil <[EMAIL PROTECTED]> > wrote: > > > > > To echo what Mike said about KISS, would you use triggers for a large > > time-sensitive batch job in an RDBMS? It's possible, but probably not. > > Then you might want to think twice about using co-processors for such a > > purpose with HBase. > > > > > > > > > > > > On 10/17/12 9:50 PM, "Michael Segel" <[EMAIL PROTECTED]> wrote: > > > >> Run your weekly job in a low priority fair scheduler/capacity scheduler > >> queue. > >> > >> Maybe its just me, but I look at Coprocessors as a similar structure to > >> RDBMS triggers and stored procedures. > >> You need to restrain and use them sparingly otherwise you end up > creating > >> performance issues. > >> > >> Just IMHO. > >> > >> -Mike > >> > >> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari > >> <[EMAIL PROTECTED]> wrote: > >> > >>> I don't have any concern about the time it's taking. It's more about > >>> the load it's putting on the cluster. I have other jobs that I need to > >>> run (secondary index, data processing, etc.). So the more time this > >>> new job is taking, the less CPU the others will have. > >>> > >>> I tried the M/R and I really liked the way it's done. So my only > >>> concern will really be the performance of the delete part. > >>> > >>> That's why I'm wondering what's the best practice to move a row to > >>> another table. > >>> > >>> 2012/10/17, Michael Segel <[EMAIL PROTECTED]>: > >>>> If you're going to be running this weekly, I would suggest that you > >>>> stick > >>>> with the M/R job. > >>>> > >>>> Is there any reason why you need to be worried about the time it takes > >>>> to do > >>>> the deletes? > >>>> > >>>> > >>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari > >>>> <[EMAIL PROTECTED]> > >>>> wrote: > >>>> > >>>>> Hi Mike, > >>>>> > >>>>> I'm expecting to run the job weekly. I initially thought about using Have a Nice Day! Lohit
-
Re: Coprocessor end point vs MapReduce?Jean-Marc Spaggiari 2012-10-25, 13:01
Hi all,
First, sorry about my slowness to reply to this thread, but it went to my spam folder and I lost sight of it. I don’t have good knowledge of RDBMS, and so I don’t have good knowledge of triggers too. That’s why I looked at the endpoints too because they are pretty new for me. First, I can’t really use multiple tables. I have one process writing to this table barely real-time. Another one is deleting from this table too. But some rows are never deleted. They are timing out, and need to be moved by the process I’m building here. I was not aware of the possibility to setup the priority for an MR job (any link to show how?). That’s something I will dig into. I was a bit scared about the network load if I’m doing deletes lines by lines and not bulk. What I still don’t understand is, since both CP and MR are both running on the region side, with is the MR better than the CP? Because the hadoop framework is taking care of it and will guarantee that it will run on all the regions? Also, is there some sort of “pre” and “post�� methods I can override for MR jobs to initially list of puts/deletes and submit them at the end? Or should I do that one by one on the map method? Thanks, JM 2012/10/18, lohit <[EMAIL PROTECTED]>: > I might be little off here. If rows are moved to another table on weekly or > daily basis, why not create per weekly or per day table. > That way you need to copy and delete. Of course it will not work you are > are selectively filtering between timestamps and clients have to have > notion of multiple tables. > > 2012/10/18 Anoop Sam John <[EMAIL PROTECTED]> > >> A CP and Endpoints operates at a region level.. Any operation within one >> region we can perform using this.. I have seen in below use case that >> along with the delete there was a need for inserting data to some other >> table also.. Also this was kind of a periodic action.. I really doubt how >> the endpoints alone can be used here.. I also tend towards the MR.. >> >> The idea behind the bulk delete CP is simple. We have a use case of >> deleting a bulk of rows and this need to be online delete. I also have >> seen >> in the mailing list many people ask question regarding that... In all >> people were using scans and get the rowkeys to the client side and then >> doing the deletes.. Yes most of the time complaint was the slowness.. >> One >> bulk delete performance improvement was done in HBASE-6284.. Still >> thought >> we can do all the operation (scan+delete) in server side and we can make >> use of the endpoints here.. This will be much more faster and can be used >> for online bulk deletes.. >> >> -Anoop- >> >> ________________________________________ >> From: Michael Segel [[EMAIL PROTECTED]] >> Sent: Thursday, October 18, 2012 11:31 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Coprocessor end point vs MapReduce? >> >> Doug, >> >> One thing that concerns me is that a lot of folks are gravitating to >> Coprocessors and may be using them for the wrong thing. >> Has anyone done any sort of research as to some of the limitations and >> negative impacts on using coprocessors? >> >> While I haven't really toyed with the idea of bulk deletes, periodic >> deletes is probably not a good use of coprocessors.... however using them >> to synchronize tables would be a valid use case. >> >> Thx >> >> -Mike >> >> On Oct 18, 2012, at 7:36 AM, Doug Meil <[EMAIL PROTECTED]> >> wrote: >> >> > >> > To echo what Mike said about KISS, would you use triggers for a large >> > time-sensitive batch job in an RDBMS? It's possible, but probably not. >> > Then you might want to think twice about using co-processors for such a >> > purpose with HBase. >> > >> > >> > >> > >> > >> > On 10/17/12 9:50 PM, "Michael Segel" <[EMAIL PROTECTED]> wrote: >> > >> >> Run your weekly job in a low priority fair scheduler/capacity >> >> scheduler >> >> queue. >> >> >> >> Maybe its just me, but I look at Coprocessors as a similar structure
-
Re: Coprocessor end point vs MapReduce?Anoop John 2012-10-25, 17:13
>What I still don’t understand is, since both CP and MR are both
>running on the region side, with is the MR better than the CP? For the case bulk delete alone CP (Endpoint) will be better than MR for sure.. Considering your over all need people were suggesting better MR.. U need a scan and move some data into another table too... Both MR and CP run on the region side ??? - Well there is difference. The CP run within your RS process itself.. So that is why bulk delete using Endpoint is efficient.. It is a local read and delete. No n/w calls involved at all.. But in case of MR even if the mappers run on the same machine as that of the region it is a inter process communication.. Hope I explained you the diff well... -Anoop- On Thu, Oct 25, 2012 at 6:31 PM, Jean-Marc Spaggiari < [EMAIL PROTECTED]> wrote: > Hi all, > > First, sorry about my slowness to reply to this thread, but it went to > my spam folder and I lost sight of it. > > I don’t have good knowledge of RDBMS, and so I don’t have good > knowledge of triggers too. That’s why I looked at the endpoints too > because they are pretty new for me. > > First, I can’t really use multiple tables. I have one process writing > to this table barely real-time. Another one is deleting from this > table too. But some rows are never deleted. They are timing out, and > need to be moved by the process I’m building here. > > I was not aware of the possibility to setup the priority for an MR job > (any link to show how?). That’s something I will dig into. I was a bit > scared about the network load if I’m doing deletes lines by lines and > not bulk. > > What I still don’t understand is, since both CP and MR are both > running on the region side, with is the MR better than the CP? Because > the hadoop framework is taking care of it and will guarantee that it > will run on all the regions? > > Also, is there some sort of “pre” and “post” methods I can override > for MR jobs to initially list of puts/deletes and submit them at the > end? Or should I do that one by one on the map method? > > Thanks, > > JM > > > 2012/10/18, lohit <[EMAIL PROTECTED]>: > > I might be little off here. If rows are moved to another table on weekly > or > > daily basis, why not create per weekly or per day table. > > That way you need to copy and delete. Of course it will not work you are > > are selectively filtering between timestamps and clients have to have > > notion of multiple tables. > > > > 2012/10/18 Anoop Sam John <[EMAIL PROTECTED]> > > > >> A CP and Endpoints operates at a region level.. Any operation within one > >> region we can perform using this.. I have seen in below use case that > >> along with the delete there was a need for inserting data to some other > >> table also.. Also this was kind of a periodic action.. I really doubt > how > >> the endpoints alone can be used here.. I also tend towards the MR.. > >> > >> The idea behind the bulk delete CP is simple. We have a use case of > >> deleting a bulk of rows and this need to be online delete. I also have > >> seen > >> in the mailing list many people ask question regarding that... In all > >> people were using scans and get the rowkeys to the client side and then > >> doing the deletes.. Yes most of the time complaint was the slowness.. > >> One > >> bulk delete performance improvement was done in HBASE-6284.. Still > >> thought > >> we can do all the operation (scan+delete) in server side and we can make > >> use of the endpoints here.. This will be much more faster and can be > used > >> for online bulk deletes.. > >> > >> -Anoop- > >> > >> ________________________________________ > >> From: Michael Segel [[EMAIL PROTECTED]] > >> Sent: Thursday, October 18, 2012 11:31 PM > >> To: [EMAIL PROTECTED] > >> Subject: Re: Coprocessor end point vs MapReduce? > >> > >> Doug, > >> > >> One thing that concerns me is that a lot of folks are gravitating to > >> Coprocessors and may be using them for the wrong thing.
-
Re: Coprocessor end point vs MapReduce?Jerry Lam 2012-10-25, 20:43
Hi JM:
There was a thread discussing M/R bulk delete vs. Coprocessor bulk delete. The thread subject is "Bulk Delete". The guy in that post suggested to write a HFile which contains all the delete markers and then use bulk incremental load facility to actually move all the delete markers to the regions at once. This strategy works for my use case too because my M/R job generates a lot of version delete markers. You might take a look on that thread for additional ways to delete data from hbase. Best Regards, Jerry On Thu, Oct 25, 2012 at 1:13 PM, Anoop John <[EMAIL PROTECTED]> wrote: > >What I still don’t understand is, since both CP and MR are both > >running on the region side, with is the MR better than the CP? > For the case bulk delete alone CP (Endpoint) will be better than MR for > sure.. Considering your over all need people were suggesting better MR.. > U need a scan and move some data into another table too... > Both MR and CP run on the region side ??? - Well there is difference. The > CP run within your RS process itself.. So that is why bulk delete using > Endpoint is efficient.. It is a local read and delete. No n/w calls > involved at all.. But in case of MR even if the mappers run on the same > machine as that of the region it is a inter process communication.. > Hope I explained you the diff well... > > -Anoop- > > On Thu, Oct 25, 2012 at 6:31 PM, Jean-Marc Spaggiari < > [EMAIL PROTECTED]> wrote: > > > Hi all, > > > > First, sorry about my slowness to reply to this thread, but it went to > > my spam folder and I lost sight of it. > > > > I don’t have good knowledge of RDBMS, and so I don’t have good > > knowledge of triggers too. That’s why I looked at the endpoints too > > because they are pretty new for me. > > > > First, I can’t really use multiple tables. I have one process writing > > to this table barely real-time. Another one is deleting from this > > table too. But some rows are never deleted. They are timing out, and > > need to be moved by the process I’m building here. > > > > I was not aware of the possibility to setup the priority for an MR job > > (any link to show how?). That’s something I will dig into. I was a bit > > scared about the network load if I’m doing deletes lines by lines and > > not bulk. > > > > What I still don’t understand is, since both CP and MR are both > > running on the region side, with is the MR better than the CP? Because > > the hadoop framework is taking care of it and will guarantee that it > > will run on all the regions? > > > > Also, is there some sort of “pre” and “post” methods I can override > > for MR jobs to initially list of puts/deletes and submit them at the > > end? Or should I do that one by one on the map method? > > > > Thanks, > > > > JM > > > > > > 2012/10/18, lohit <[EMAIL PROTECTED]>: > > > I might be little off here. If rows are moved to another table on > weekly > > or > > > daily basis, why not create per weekly or per day table. > > > That way you need to copy and delete. Of course it will not work you > are > > > are selectively filtering between timestamps and clients have to have > > > notion of multiple tables. > > > > > > 2012/10/18 Anoop Sam John <[EMAIL PROTECTED]> > > > > > >> A CP and Endpoints operates at a region level.. Any operation within > one > > >> region we can perform using this.. I have seen in below use case that > > >> along with the delete there was a need for inserting data to some > other > > >> table also.. Also this was kind of a periodic action.. I really doubt > > how > > >> the endpoints alone can be used here.. I also tend towards the MR.. > > >> > > >> The idea behind the bulk delete CP is simple. We have a use case of > > >> deleting a bulk of rows and this need to be online delete. I also have > > >> seen > > >> in the mailing list many people ask question regarding that... In all > > >> people were using scans and get the rowkeys to the client side and > then > > >> doing the deletes.. Yes most of the time complaint was the slowness.. |