|
fding hbase
2012-05-14, 09:35
Michel Segel
2012-05-14, 12:17
fding hbase
2012-05-14, 13:20
Michel Segel
2012-05-15, 12:23
fding hbase
2012-05-16, 06:12
Michael Segel
2012-05-16, 18:03
Andrew Purtell
2012-05-16, 18:17
Michael Segel
2012-05-16, 19:07
Dave Revell
2012-05-16, 21:40
Andrew Purtell
2012-05-16, 22:28
fding hbase
2012-05-17, 01:43
Andrew Purtell
2012-05-17, 01:49
Michael Segel
2012-05-17, 17:39
fding hbase
2012-05-18, 00:38
Michael Segel
2012-05-18, 10:40
Michael Segel
2012-05-16, 22:16
|
-
EndPoint Coprocessor could be dealocked?fding hbase 2012-05-14, 09:35
Hi all,
Is it possible to use table scanner (different from the host table region) or execute coprocessor of another table, in the endpoint coprocessor? It looks like chaining coprocessors. But I found a possible deadlock! Can anyone help me with this? In my testing environment I deployed the 0.92.0 version from CDH. I wrote an Endpoint coprocessor to do composite secondary index queries. The index is stored in another table and the index update is maintained by the client through a extended HTable. While a single index query works fine through Scanners of index table, soon after we realized we need to do multi-index queries at the same time. At first we tried to pull every row keys queried from a single index table and do the merge (just set intersection) on the client, but that overruns the network bandwidth. So I proposed to try the endpoint coprocessor. The idea is to use coprocessors, one in master table (the indexed table) and the other for each index table regions. Each master table region coprocessor instance invokes the index table coprocessor instances with its regioninfo (the startKey and endKey) and the scan, the index table region coprocessor instance scans and returns the row keys within the range of startKey and endKey passed in. The cluster blocks sometimes in invoking the index table coprocessor. I traced into the code and found that when HConnection locates regions it will rpc to the same regionserver. (After a while I found the index table coprocessor is equivalent to just a plain scan with filter, so I switched to scanners with filter, but the problem remains.) +
fding hbase 2012-05-14, 09:35
-
Re: EndPoint Coprocessor could be dealocked?Michel Segel 2012-05-14, 12:17
Need a little clarification...
You said that you need to do multi-index queries. Did you mean to say multiple people running queries at the same time, or did you mean you wanted to do multi-key indexes where the key is a multi-key part. Or did you mean that you really wanted to use multiple indexes at the same time on a single query? If its the latter, not really a good idea... How do you handle the intersection of the two sets? (3 sets or more?) Can you assume that the indexes are in sort order? What happens when the results from the indexes exceed the amount of allocated memory? What I am suggesting you to do is to set aside the underpinnings of HBase and look at the problem you are trying to solve in general terms. Not an easy one... Sent from a remote device. Please excuse any typos... Mike Segel On May 14, 2012, at 4:35 AM, fding hbase <[EMAIL PROTECTED]> wrote: > Hi all, > > Is it possible to use table scanner (different from the host table region) > or > execute coprocessor of another table, in the endpoint coprocessor? > It looks like chaining coprocessors. But I found a possible deadlock! > Can anyone help me with this? > > In my testing environment I deployed the 0.92.0 version from CDH. > I wrote an Endpoint coprocessor to do composite secondary index queries. > The index is stored in another table and the index update is maintained > by the client through a extended HTable. While a single index query > works fine through Scanners of index table, soon after we realized > we need to do multi-index queries at the same time. > At first we tried to pull every row keys queried from a single index table > and do the merge (just set intersection) on the client, > but that overruns the network bandwidth. So I proposed to try > the endpoint coprocessor. The idea is to use coprocessors, one > in master table (the indexed table) and the other for each index table > regions. > Each master table region coprocessor instance invokes the index table > coprocessor instances with its regioninfo (the startKey and endKey) and the > scan, > the index table region coprocessor instance scans and returns the row keys > within the range of startKey and endKey passed in. > > The cluster blocks sometimes in invoking the index table coprocessor. I > traced > into the code and found that when HConnection locates regions it will rpc > to the same regionserver. > > (After a while I found the index table coprocessor is equivalent to > just a plain scan with filter, so I switched to scanners with filter, but > the problem > remains.) +
Michel Segel 2012-05-14, 12:17
-
Re: EndPoint Coprocessor could be dealocked?fding hbase 2012-05-14, 13:20
Hi Michel,
I indexed each column within a column family of a table, so we can query a row with specific column value. By multi-index I mean using multiple indexes at the same time on a single query. That looks like a SQL select with two *where* clauses of two indexed columns. The row key of index table is made up of column value and row key of indexed table. For set intersection I used the utility class from Apache common-collections package CollectionUtils.intersection(). There's no assumption on sort order on indices. A scan with column value as startKey and column value+1 as endKey applied to index table will return all rows in indexed table with that column value. For multi-index queries, previously I tried to use a scan for each index column and intersect of those result sets to get the rows that I want. But the query time is too long. So I decided to move the computation of intersection to server side and reduce the amount of data transferred. Do you have any better idea? On Mon, May 14, 2012 at 8:17 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > Need a little clarification... > > You said that you need to do multi-index queries. > > Did you mean to say multiple people running queries at the same time, or > did you mean you wanted to do multi-key indexes where the key is a > multi-key part. > > Or did you mean that you really wanted to use multiple indexes at the same > time on a single query? > > If its the latter, not really a good idea... > How do you handle the intersection of the two sets? (3 sets or more?) > Can you assume that the indexes are in sort order? > > What happens when the results from the indexes exceed the amount of > allocated memory? > > What I am suggesting you to do is to set aside the underpinnings of HBase > and look at the problem you are trying to solve in general terms. Not an > easy one... > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 14, 2012, at 4:35 AM, fding hbase <[EMAIL PROTECTED]> wrote: > > > Hi all, > > > > Is it possible to use table scanner (different from the host table > region) > > or > > execute coprocessor of another table, in the endpoint coprocessor? > > It looks like chaining coprocessors. But I found a possible deadlock! > > Can anyone help me with this? > > > > In my testing environment I deployed the 0.92.0 version from CDH. > > I wrote an Endpoint coprocessor to do composite secondary index queries. > > The index is stored in another table and the index update is maintained > > by the client through a extended HTable. While a single index query > > works fine through Scanners of index table, soon after we realized > > we need to do multi-index queries at the same time. > > At first we tried to pull every row keys queried from a single index > table > > and do the merge (just set intersection) on the client, > > but that overruns the network bandwidth. So I proposed to try > > the endpoint coprocessor. The idea is to use coprocessors, one > > in master table (the indexed table) and the other for each index table > > regions. > > Each master table region coprocessor instance invokes the index table > > coprocessor instances with its regioninfo (the startKey and endKey) and > the > > scan, > > the index table region coprocessor instance scans and returns the row > keys > > within the range of startKey and endKey passed in. > > > > The cluster blocks sometimes in invoking the index table coprocessor. I > > traced > > into the code and found that when HConnection locates regions it will rpc > > to the same regionserver. > > > > (After a while I found the index table coprocessor is equivalent to > > just a plain scan with filter, so I switched to scanners with filter, but > > the problem > > remains.) > -- Best Regards! Fei Ding [EMAIL PROTECTED] +
fding hbase 2012-05-14, 13:20
-
Re: EndPoint Coprocessor could be dealocked?Michel Segel 2012-05-15, 12:23
Sorry for the delay... Had a full day yesterday...
In a nut shell... Tough nut to crack. I can give you a solution which you can probably enhance... At the start, ignore coProcessors for now... So what end up doing is the following. General solution... N indexes.. Create a temp table in HBase. (1 column foo) Assuming that you have a simple K,V index, so you just need to do a simple get() against the index to get the list of rows ... For each index, fetch the rows. For each row, write the rowid and then auto increment a counter in a column foo. Then scan the table where foo's counter >= N. note that it should == N but just in case... Now you have found multiple indexes. Having said that... Again assuming your indexes are a simple K,V pair where V is a set of row ids... Create a hash map of <rowid, count> For each index: Get() row based on key For each rowid in row: If map.fetch(rowid) is null then add ( rowid, 1) Else increment the value in count; ; ; For each rowid in map(rowid, count): If count == number of indexes N Then add rowid to result set. ; Now just return the rows where you have it's rowid in the result set. That you can do in a coprocessor... but you may have a memory issue... Depending on the number of rowid in your index. does that help? Sent from a remote device. Please excuse any typos... Mike Segel On May 14, 2012, at 8:20 AM, fding hbase <[EMAIL PROTECTED]> wrote: > Hi Michel, > > I indexed each column within a column family of a table, so we can query a > row with specific column value. > By multi-index I mean using multiple indexes at the same time on a single > query. That looks like a SQL select > with two *where* clauses of two indexed columns. > > The row key of index table is made up of column value and row key of > indexed table. For set intersection > I used the utility class from Apache common-collections package > CollectionUtils.intersection(). There's no > assumption on sort order on indices. A scan with column value as startKey > and column value+1 as endKey > applied to index table will return all rows in indexed table with that > column value. > > For multi-index queries, previously I tried to use a scan for each index > column and intersect of those > result sets to get the rows that I want. But the query time is too long. So > I decided to move the computation of > intersection to server side and reduce the amount of data transferred. > > Do you have any better idea? > > On Mon, May 14, 2012 at 8:17 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > >> Need a little clarification... >> >> You said that you need to do multi-index queries. >> >> Did you mean to say multiple people running queries at the same time, or >> did you mean you wanted to do multi-key indexes where the key is a >> multi-key part. >> >> Or did you mean that you really wanted to use multiple indexes at the same >> time on a single query? >> >> If its the latter, not really a good idea... >> How do you handle the intersection of the two sets? (3 sets or more?) >> Can you assume that the indexes are in sort order? >> >> What happens when the results from the indexes exceed the amount of >> allocated memory? >> >> What I am suggesting you to do is to set aside the underpinnings of HBase >> and look at the problem you are trying to solve in general terms. Not an >> easy one... >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On May 14, 2012, at 4:35 AM, fding hbase <[EMAIL PROTECTED]> wrote: >> >>> Hi all, >>> >>> Is it possible to use table scanner (different from the host table >> region) >>> or >>> execute coprocessor of another table, in the endpoint coprocessor? >>> It looks like chaining coprocessors. But I found a possible deadlock! >>> Can anyone help me with this? >>> >>> In my testing environment I deployed the 0.92.0 version from CDH. >>> I wrote an Endpoint coprocessor to do composite secondary index queries. +
Michel Segel 2012-05-15, 12:23
-
Re: EndPoint Coprocessor could be dealocked?fding hbase 2012-05-16, 06:12
Hi Michel,
Thanks for your reply. I believe your idea works both in theory and practice. But the problem I worried about does not lie on the memory usage, but on the network performance. If I query all the indexed rows from index tables and pull all of them to client and push them to the temp table, then the client network overhead is heavy. If I can move the calculation to server side then the result will be reduced a lot after intersection. But sadly, HBase ipc doesn't allow coprocessor chaining mechanism... Someone mentioned on http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples : If a RegionObserver issues RPC to another table from any of the hooks that are called out of RPC handlers (for Gets, Puts, Deletes, etc.), you risk deadlock. Whatever activity you want to check should be in the same region as account data to avoid that. (Or HBase RPC needs to change.) So, that means, the deadlock is inevitable under current circumstance. The coprocessors are still limited. What I'm seeking is possible extensions of coprocessors or workaround for such situations that extra RPC is needed in the RPC handlers. By the way, the idea you described looks like what Apache commons-collections CollectionUtils.intersection() does. On Tue, May 15, 2012 at 8:23 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > Sorry for the delay... Had a full day yesterday... > > In a nut shell... Tough nut to crack. I can give you a solution which you > can probably enhance... > > At the start, ignore coProcessors for now... > > So what end up doing is the following. > > General solution... N indexes.. > Create a temp table in HBase. (1 column foo) > > Assuming that you have a simple K,V index, so you just need to do a simple > get() against the index to get the list of rows ... > > For each index, fetch the rows. > For each row, write the rowid and then auto increment a counter in a > column foo. > > Then scan the table where foo's counter >= N. note that it should == N but > just in case... > > Now you have found multiple indexes. > > Having said that... > Again assuming your indexes are a simple K,V pair where V is a set of row > ids... > > Create a hash map of <rowid, count> > For each index: > Get() row based on key > For each rowid in row: > If map.fetch(rowid) is null then add ( rowid, 1) > Else increment the value in count; > ; > ; > For each rowid in map(rowid, count): > If count == number of indexes N > Then add rowid to result set. > ; > > Now just return the rows where you have it's rowid in the result set. > > That you can do in a coprocessor... > but you may have a memory issue... Depending on the number of > rowid in your index. > > > > does that help? > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 14, 2012, at 8:20 AM, fding hbase <[EMAIL PROTECTED]> wrote: > > > Hi Michel, > > > > I indexed each column within a column family of a table, so we can query > a > > row with specific column value. > > By multi-index I mean using multiple indexes at the same time on a single > > query. That looks like a SQL select > > with two *where* clauses of two indexed columns. > > > > The row key of index table is made up of column value and row key of > > indexed table. For set intersection > > I used the utility class from Apache common-collections package > > CollectionUtils.intersection(). There's no > > assumption on sort order on indices. A scan with column value as startKey > > and column value+1 as endKey > > applied to index table will return all rows in indexed table with that > > column value. > > > > For multi-index queries, previously I tried to use a scan for each index > > column and intersect of those > > result sets to get the rows that I want. But the query time is too long. > So > > I decided to move the computation of > > intersection to server side and reduce the amount of data transferred. > > > > Do you have any better idea? Best Regards! Fei Ding [EMAIL PROTECTED] +
fding hbase 2012-05-16, 06:12
-
Re: EndPoint Coprocessor could be dealocked?Michael Segel 2012-05-16, 18:03
Ok...
I think you need to step away from your solution and take a look at the problem from a different perspective. From my limited understanding of Co-processors, this doesn't fit well in what you want to do. I don't believe that you want to run a M/R query within a Co-processor. In short, if I understood your problem, your goal is to pull data efficiently from a table based on using the intersections of 2 or more indexes. Note: Most people create composite indexes but its possible that you want to index data against a column value along with a different type of index... like geo spatial. So here you need to capture the intersection of the index lists and then use that resulting subset as input in to a m/r job to return the underlying data. (Note: you can do this in a single child too. ) If you use a M/R job to fetch and process over the result set, you would need to take your intersection in to a java object like an ordered list where you can then split the list and pass this off to each node. On May 16, 2012, at 1:12 AM, fding hbase wrote: > Hi Michel, > > Thanks for your reply. I believe your idea works both in theory and > practice. But the problem I worried about does not > lie on the memory usage, but on the network performance. If I query all the > indexed rows from index tables and pull all > of them to client and push them to the temp table, then the > client network overhead is heavy. If I can move the calculation to > server side then the result will be reduced a lot after intersection. > > But sadly, HBase ipc doesn't allow coprocessor chaining mechanism... > Someone mentioned on > http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples > : > > If a RegionObserver issues RPC to another table from any of the hooks that > are called > out of RPC handlers (for Gets, Puts, Deletes, etc.), you risk deadlock. > Whatever activity > you want to check should be in the same region as account data to avoid > that. > (Or HBase RPC needs to change.) > > > So, that means, the deadlock is inevitable under current circumstance. The > coprocessors are still limited. > > What I'm seeking is possible extensions of coprocessors or workaround for > such situations that extra RPC is needed > in the RPC handlers. > > By the way, the idea you described looks like what Apache > commons-collections CollectionUtils.intersection() does. > > On Tue, May 15, 2012 at 8:23 PM, Michel Segel <[EMAIL PROTECTED]>wrote: > >> Sorry for the delay... Had a full day yesterday... >> >> In a nut shell... Tough nut to crack. I can give you a solution which you >> can probably enhance... >> >> At the start, ignore coProcessors for now... >> >> So what end up doing is the following. >> >> General solution... N indexes.. >> Create a temp table in HBase. (1 column foo) >> >> Assuming that you have a simple K,V index, so you just need to do a simple >> get() against the index to get the list of rows ... >> >> For each index, fetch the rows. >> For each row, write the rowid and then auto increment a counter in a >> column foo. >> >> Then scan the table where foo's counter >= N. note that it should == N but >> just in case... >> >> Now you have found multiple indexes. >> >> Having said that... >> Again assuming your indexes are a simple K,V pair where V is a set of row >> ids... >> >> Create a hash map of <rowid, count> >> For each index: >> Get() row based on key >> For each rowid in row: >> If map.fetch(rowid) is null then add ( rowid, 1) >> Else increment the value in count; >> ; >> ; >> For each rowid in map(rowid, count): >> If count == number of indexes N >> Then add rowid to result set. >> ; >> >> Now just return the rows where you have it's rowid in the result set. >> >> That you can do in a coprocessor... >> but you may have a memory issue... Depending on the number of >> rowid in your index. >> >> >> >> does that help? >> >> >> Sent from a remote device. Please excuse any typos... +
Michael Segel 2012-05-16, 18:03
-
Re: EndPoint Coprocessor could be dealocked?Andrew Purtell 2012-05-16, 18:17
> On May 16, 2012, at 1:12 AM, fding hbase wrote:
>> But sadly, HBase ipc doesn't allow coprocessor chaining mechanism... >> Someone mentioned on >> http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples >> : >> >> If a RegionObserver issues RPC to another table from any of the hooks that >> are called out of RPC handlers (for Gets, Puts, Deletes, etc.), you risk >> deadlock. Whatever activity you want to check should be in the same >> region as account data to avoid that. >> (Or HBase RPC needs to change.) >> >> So, that means, the deadlock is inevitable under current circumstance. The >> coprocessors are still limited. >> >> What I'm seeking is possible extensions of coprocessors or workaround for >> such situations that extra RPC is needed in the RPC handlers. This isn't a limitation, this is a design choice. Such extensions of coprocessors most likely won't happen. What a RegionObserver allows you to do is exactly this: Intercept and potentially modify lifecycle or user operations on that single region alone. If it helps, think of each region as its own independent database. If you need to take cross-region actions according to some user action, then you should be looking first at extending the client, not the server. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) +
Andrew Purtell 2012-05-16, 18:17
-
Re: EndPoint Coprocessor could be dealocked?Michael Segel 2012-05-16, 19:07
I think we need to look at the base problem that is trying to be solved. I mean the discussion on the RPC mechanism. but the problem that the OP is trying to solve is how to use multiple indexes in a 'query'. Note: I put ' ' around query because its a m/r job or a single thread where the user is trying to get a result set which is a significantly smaller subset, using more than 1 index. So the idea is to do a quick get() against each index and the result would be a list of row keys. The next step is to get the intersection(s) quickly (which I proposed), and then you would just need to do a quick series of get()s to pull back the list of rows. If I understand the OP's problem, its not a co-processor type of problem. Its one of where you submit a m/r job. Within your toolRunner, you would actually do the fetches against the indexes and then build the ultimate result set. then you just need a map job to take your result set as an input. Drawback... if the list of rows is very, very long, you may run out of memory. So you need to resolve that... (Which is why I was suggesting on using a temp table and then you can use the rows in the temp table as input in to your fetch... While not something I would use for 'real time' its something where I can really shrink the number of rows you have to fetch for further processing. So if your full table scan takes an hour, but we can do N get()s to get the rows in the Index, find the intersection I and then do I.size() get()s to fetch the data. This should take much less time. Again, I don't see this in a coprocessor based solution, however, the N get()s and intersection could be done at the start of the job, or could be part of a Map only job. Kind of an interesting problem... but if anyone has a large set of data and some time to play, you will end up solving a problem that you can' do in an RDBMS easily. On May 16, 2012, at 1:17 PM, Andrew Purtell wrote: >> On May 16, 2012, at 1:12 AM, fding hbase wrote: >>> But sadly, HBase ipc doesn't allow coprocessor chaining mechanism... >>> Someone mentioned on >>> http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples >>> : >>> >>> If a RegionObserver issues RPC to another table from any of the hooks that >>> are called out of RPC handlers (for Gets, Puts, Deletes, etc.), you risk >>> deadlock. Whatever activity you want to check should be in the same >>> region as account data to avoid that. >>> (Or HBase RPC needs to change.) >>> >>> So, that means, the deadlock is inevitable under current circumstance. The >>> coprocessors are still limited. >>> >>> What I'm seeking is possible extensions of coprocessors or workaround for >>> such situations that extra RPC is needed in the RPC handlers. > > This isn't a limitation, this is a design choice. Such extensions of > coprocessors most likely won't happen. What a RegionObserver allows > you to do is exactly this: Intercept and potentially modify lifecycle > or user operations on that single region alone. If it helps, think of > each region as its own independent database. > > If you need to take cross-region actions according to some user > action, then you should be looking first at extending the client, not > the server. > > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet > Hein (via Tom White) > +
Michael Segel 2012-05-16, 19:07
-
Re: EndPoint Coprocessor could be dealocked?Dave Revell 2012-05-16, 21:40
Many people will probably try to use coprocessors as a way of implementing
app logic on top of HBase without the headaches of writing a daemon. Sometimes client-side approaches are inadvisable; for example, there may be several client languages/runtimes and the app logic should not be reimplemented in each. It's understandable that people wouldn't want to deal with setting up a daemon and RPC mechanism if they can piggyback on the existing HBase coprocessor mechanism. Are HBase coprocessors explicitly wrong for this use case if the app logic needs to access multiple regions in a single call? Cheers, Dave On Wed, May 16, 2012 at 12:07 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > > I think we need to look at the base problem that is trying to be solved. > > I mean the discussion on the RPC mechanism. but the problem that the OP is > trying to solve is how to use multiple indexes in a 'query'. > > Note: I put ' ' around query because its a m/r job or a single thread > where the user is trying to get a result set which is a significantly > smaller subset, using more than 1 index. > > So the idea is to do a quick get() against each index and the result would > be a list of row keys. The next step is to get the intersection(s) quickly > (which I proposed), and then you would just need to do a quick series of > get()s to pull back the list of rows. > > If I understand the OP's problem, its not a co-processor type of problem. > > Its one of where you submit a m/r job. Within your toolRunner, you would > actually do the fetches against the indexes and then build the ultimate > result set. then you just need a map job to take your result set as an > input. > > Drawback... if the list of rows is very, very long, you may run out of > memory. So you need to resolve that... > (Which is why I was suggesting on using a temp table and then you can use > the rows in the temp table as input in to your fetch... > > While not something I would use for 'real time' its something where I can > really shrink the number of rows you have to fetch for further processing. > So if your full table scan takes an hour, but we can do N get()s to get > the rows in the Index, find the intersection I and then do I.size() get()s > to fetch the data. This should take much less time. > > > Again, I don't see this in a coprocessor based solution, however, the N > get()s and intersection could be done at the start of the job, or could be > part of a Map only job. > > Kind of an interesting problem... but if anyone has a large set of data > and some time to play, you will end up solving a problem that you can' do > in an RDBMS easily. > > On May 16, 2012, at 1:17 PM, Andrew Purtell wrote: > > >> On May 16, 2012, at 1:12 AM, fding hbase wrote: > >>> But sadly, HBase ipc doesn't allow coprocessor chaining mechanism... > >>> Someone mentioned on > >>> > http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples > >>> : > >>> > >>> If a RegionObserver issues RPC to another table from any of the hooks > that > >>> are called out of RPC handlers (for Gets, Puts, Deletes, etc.), you > risk > >>> deadlock. Whatever activity you want to check should be in the same > >>> region as account data to avoid that. > >>> (Or HBase RPC needs to change.) > >>> > >>> So, that means, the deadlock is inevitable under current circumstance. > The > >>> coprocessors are still limited. > >>> > >>> What I'm seeking is possible extensions of coprocessors or workaround > for > >>> such situations that extra RPC is needed in the RPC handlers. > > > > This isn't a limitation, this is a design choice. Such extensions of > > coprocessors most likely won't happen. What a RegionObserver allows > > you to do is exactly this: Intercept and potentially modify lifecycle > > or user operations on that single region alone. If it helps, think of > > each region as its own independent database. > > > > If you need to take cross-region actions according to some user +
Dave Revell 2012-05-16, 21:40
-
Re: EndPoint Coprocessor could be dealocked?Andrew Purtell 2012-05-16, 22:28
On Wed, May 16, 2012 at 2:40 PM, Dave Revell <[EMAIL PROTECTED]> wrote:
> Many people will probably try to use coprocessors as a way of implementing > app logic on top of HBase without the headaches of writing a daemon. > Sometimes client-side approaches are inadvisable; for example, there may be > several client languages/runtimes and the app logic should not be > reimplemented in each. No, but abstracting to a common client library seems reasonable for many cases, or building a DAO, which may happen anyway if you want to hedge. > It's understandable that people wouldn't want to deal with setting up a > daemon and RPC mechanism if they can piggyback on the existing HBase > coprocessor mechanism. Which they can certainly if the scope of access within the RegionObserver is the region. > Are HBase coprocessors explicitly wrong for this use case if the app logic > needs to access multiple regions in a single call? Not coprocessors in general. The client side support for Endpoints (Exec, etc.) gives the developer the fiction of addressing the cluster as a range of rows, and will parallelize per-region Endpoint invocations, and collect the responses, and can return them all to the caller as "a single call". However for RegionObservers, if you want to do something cross-region, so therefore issue one or more RPCs which must complete *before you can complete the RPC* you are currently processing, then this is inherently problematic and deadlock prone. If on the other hand you schedule the cross-region work with an Executor or similar and return on the current RPC, that would be ok. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) +
Andrew Purtell 2012-05-16, 22:28
-
Re: EndPoint Coprocessor could be dealocked?fding hbase 2012-05-17, 01:43
On Thu, May 17, 2012 at 6:28 AM, Andrew Purtell <[EMAIL PROTECTED]> wrote:
> Are HBase coprocessors explicitly wrong for this use case if the app > logic > needs to access multiple regions in a single call? Not coprocessors in general. The client side support for Endpoints > (Exec, etc.) gives the developer the fiction of addressing the cluster > as a range of rows, and will parallelize per-region Endpoint > invocations, and collect the responses, and can return them all to the > caller as "a single call". But on the deadlock problem the Endpoint behaves the same way as Observer. Endpoints are also executed via RPC handlers of RegionServer. > However for RegionObservers, if you want to > do something cross-region, so therefore issue one or more RPCs which > must complete *before you can complete the RPC* you are currently > processing, then this is inherently problematic and deadlock prone. If > on the other hand you schedule the cross-region work with an Executor > or similar and return on the current RPC, that would be ok. This means that once RPC handlers are blocked then the cluster can be considered to be dead, because the coprocessors are written by users and any kind of code may appear on the server side. If the Executor is also feasible for Endpoint, then how to return the results the client is waiting for? Maybe extra loop is needed in the client issues RPCs to retrieve the results constantly. It also means that Endpoint has to keep the results on server. Then the Endpoint has to be stateful. This is another question that I doubt about. Should any of coprocessors be stateful or stateless? What if the client just dies before it can retrieve the results? Should another lease be created for that results, just like RegionServer does for scanners? It looks messy, but any way that is possible. -- Best Regards! Fei Ding [EMAIL PROTECTED] +
fding hbase 2012-05-17, 01:43
-
Re: EndPoint Coprocessor could be dealocked?Andrew Purtell 2012-05-17, 01:49
On Wed, May 16, 2012 at 6:43 PM, fding hbase <[EMAIL PROTECTED]> wrote:
>> Not coprocessors in general. The client side support for Endpoints >> (Exec, etc.) gives the developer the fiction of addressing the cluster >> as a range of rows, and will parallelize per-region Endpoint >> invocations, and collect the responses, and can return them all to the >> caller as "a single call". > > But on the deadlock problem the Endpoint behaves the same way as Observer. > Endpoints are also executed via RPC handlers of RegionServer. Reread what I wrote. I'm not talking about the server side above. Regarding the RPC issues, yes the behavior is the same. My other point was there is no RPC deadlock if you schedule your additional work (which issues RPCs) in some background thread or Executor and return to the client immediately. But that is not what you have claimed you want to do, you want to do some distributed indexed join if I understood it correctly *first* (via RPC) and *then* return to the client. That is how you would get deadlocks. > the coprocessors are written by users and any kind of > code may appear on the server side. You should not let just any user run coprocessors on the server. That's madness. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) +
Andrew Purtell 2012-05-17, 01:49
-
Re: EndPoint Coprocessor could be dealocked?Michael Segel 2012-05-17, 17:39
> You should not let just any user run coprocessors on the server. That's madness.
> > Best regards, > > - Andy Fei Ding, I'm a little confused. Are you trying to solve the problem of querying data efficiently from a table, or are you trying to find an example of where and when to use co-processors? You actually have an interesting problem that isn't easily solved in relational databases, but I don't think its an appropriate problem if you want to stress the use of coprocessors. Yes with Indexes you want to use coprocessors as a way to keep the index in synch with the underlying table. However beyond that... the solution is really best run as a M/R job. Considering that HBase has two different access methods. One is as part of M/R jobs, the other is a client/server model. If you wanted to, you could create a service/engine/app that would allow you to efficiently query and return result sets from your database, as well as manage indexes. In part, coprocessors make this a lot easier. If you consider the general flow of my solution earlier in this thread, you now have a really great way to implement this. Note: we're really talking about allowing someone to query data from a table using multiple indexes and index types. Think alternate table (key/value pair) , Lucene/SOLR, and GeoSpatial. You could even bench mark it against an Oracle implementation, and probably smoke it. You could also do efficient joins between tables. So yeah, I would encourage you to work on your initial problem... ;-) Just Saying... ;-) -Mike On May 16, 2012, at 8:49 PM, Andrew Purtell wrote: > On Wed, May 16, 2012 at 6:43 PM, fding hbase <[EMAIL PROTECTED]> wrote: >>> Not coprocessors in general. The client side support for Endpoints >>> (Exec, etc.) gives the developer the fiction of addressing the cluster >>> as a range of rows, and will parallelize per-region Endpoint >>> invocations, and collect the responses, and can return them all to the >>> caller as "a single call". >> >> But on the deadlock problem the Endpoint behaves the same way as Observer. >> Endpoints are also executed via RPC handlers of RegionServer. > > Reread what I wrote. I'm not talking about the server side above. > > Regarding the RPC issues, yes the behavior is the same. My other point > was there is no RPC deadlock if you schedule your additional work > (which issues RPCs) in some background thread or Executor and return > to the client immediately. But that is not what you have claimed you > want to do, you want to do some distributed indexed join if I > understood it correctly *first* (via RPC) and *then* return to the > client. That is how you would get deadlocks. > >> the coprocessors are written by users and any kind of >> code may appear on the server side. > > You should not let just any user run coprocessors on the server. That's madness. > > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet > Hein (via Tom White) > +
Michael Segel 2012-05-17, 17:39
-
Re: EndPoint Coprocessor could be dealocked?fding hbase 2012-05-18, 00:38
Hi Michel,
On Fri, May 18, 2012 at 1:39 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > > You should not let just any user run coprocessors on the server. That's > madness. > > > > Best regards, > > > > - Andy > > Fei Ding, > > I'm a little confused. > Are you trying to solve the problem of querying data efficiently from a > table, or are you trying to find an example of where and when to use > co-processors? > > I'm trying to solve the problem of querying data efficiently. Coprocessor is one of the possible solutions that I've tried. > You actually have an interesting problem that isn't easily solved in > relational databases, but I don't think its an appropriate problem if you > want to stress the use of coprocessors. > > Yes with Indexes you want to use coprocessors as a way to keep the index > in synch with the underlying table. > > However beyond that... the solution is really best run as a M/R job. > > Considering that HBase has two different access methods. One is as part of > M/R jobs, the other is a client/server model. If you wanted to, you could > create a service/engine/app that would allow you to efficiently query and > return result sets from your database, as well as manage indexes. > In part, coprocessors make this a lot easier. > I'm not using the coprocessors to maintain index tables, but using extended client to do this. > > If you consider the general flow of my solution earlier in this thread, > you now have a really great way to implement this. > > Note: we're really talking about allowing someone to query data from a > table using multiple indexes and index types. Think alternate table > (key/value pair) , Lucene/SOLR, and GeoSpatial. > > You could even bench mark it against an Oracle implementation, and > probably smoke it. > You could also do efficient joins between tables. > > So yeah, I would encourage you to work on your initial problem... ;-) > > Alternate table is also one of the possible solutions, however, it's not that easy too. I'm still working on it. ;-) -- Best Regards! Fei Ding [EMAIL PROTECTED] +
fding hbase 2012-05-18, 00:38
-
Re: EndPoint Coprocessor could be dealocked?Michael Segel 2012-05-18, 10:40
Fei DIng,
I think you're making the solution harder than it should be. To start with, the only think you need to do is use co-processors to keep the indexes in sync with the underlying table. The code called from the co-processor will depend on the type of action and the type of index you are using. Then you need to only focus on how you use the index and then how you implement the intersection of the result sets. One idea I had was to invert the intersection table so that you would have N rows where each row would contain the result set. Then you fetch one row to get your row keys. So if you have 3 indexes where you would want to find the intersection, fetch the row key value of 3 would yield the intersection, rather than do a scan of the key values and fetch the intersection count. (This could work, but you may have issues with very large result sets. (How many columns can you have? ) The point is that if you place your focus first on the problem and then secondly on the mechanics you will have an easier time solving the problem. The only catch is that you have to be able to work in the abstract. HTH -Mike PS. This really is an interesting problem which when solved will help with the evolution of HBase more as a Database than as a persistent object store. On May 17, 2012, at 7:38 PM, fding hbase wrote: > Hi Michel, > On Fri, May 18, 2012 at 1:39 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > >>> You should not let just any user run coprocessors on the server. That's >> madness. >>> >>> Best regards, >>> >>> - Andy >> >> Fei Ding, >> >> I'm a little confused. >> Are you trying to solve the problem of querying data efficiently from a >> table, or are you trying to find an example of where and when to use >> co-processors? >> >> > I'm trying to solve the problem of querying data efficiently. Coprocessor > is one of the possible solutions that I've tried. > > >> You actually have an interesting problem that isn't easily solved in >> relational databases, but I don't think its an appropriate problem if you >> want to stress the use of coprocessors. >> >> Yes with Indexes you want to use coprocessors as a way to keep the index >> in synch with the underlying table. >> >> However beyond that... the solution is really best run as a M/R job. >> >> Considering that HBase has two different access methods. One is as part of >> M/R jobs, the other is a client/server model. If you wanted to, you could >> create a service/engine/app that would allow you to efficiently query and >> return result sets from your database, as well as manage indexes. >> In part, coprocessors make this a lot easier. >> > > I'm not using the coprocessors to maintain index tables, but using extended > client to do this. > > >> >> If you consider the general flow of my solution earlier in this thread, >> you now have a really great way to implement this. >> >> Note: we're really talking about allowing someone to query data from a >> table using multiple indexes and index types. Think alternate table >> (key/value pair) , Lucene/SOLR, and GeoSpatial. >> >> You could even bench mark it against an Oracle implementation, and >> probably smoke it. >> You could also do efficient joins between tables. >> >> So yeah, I would encourage you to work on your initial problem... ;-) >> >> > Alternate table is also one of the possible solutions, however, it's not > that easy too. I'm still working on it. ;-) > > -- > > Best Regards! > > Fei Ding > [EMAIL PROTECTED] +
Michael Segel 2012-05-18, 10:40
-
Re: EndPoint Coprocessor could be dealocked?Michael Segel 2012-05-16, 22:16
David,
Its not a question of a daemon, its a question of the problem you are trying to solve. Using this as an example.. you are not always going to select data from a given table always using the same query. So you will not always want to use the index on column A and then the index on column D. If you were, then you'd save yourself a lot of headaches by just using a composite index. Again, what I am suggesting is that you step away from the mechanics of the OPs attempt of solving a problem, and focus on his problem. He wants to use two secondary indexes to further filter the resulting data set. An excellent example is if you want to filter your data set using two orthogonal indexes on the underlying data set. Think about doing an index on one field that is a string, and a second field that is geo-spatial data. Does this belong inside a co-processor? maybe, maybe not. I would think that in terms of coprocessor use, one would want to use them to keep the indexes in sync not use them for queries. Does that make sense? BTW, would you consider making a call to an external system from within a coprocessor? I mean would you want your coprocessor calling something like an external lucene index? I don't think it would be a good idea. But that's a different conversation. With respect to the OP's initial problem. I really don't think you want to do this as a co-processor problem. On May 16, 2012, at 4:40 PM, Dave Revell wrote: > Many people will probably try to use coprocessors as a way of implementing > app logic on top of HBase without the headaches of writing a daemon. > Sometimes client-side approaches are inadvisable; for example, there may be > several client languages/runtimes and the app logic should not be > reimplemented in each. > > It's understandable that people wouldn't want to deal with setting up a > daemon and RPC mechanism if they can piggyback on the existing HBase > coprocessor mechanism. > > Are HBase coprocessors explicitly wrong for this use case if the app logic > needs to access multiple regions in a single call? > > Cheers, > Dave > > On Wed, May 16, 2012 at 12:07 PM, Michael Segel > <[EMAIL PROTECTED]>wrote: > >> >> I think we need to look at the base problem that is trying to be solved. >> >> I mean the discussion on the RPC mechanism. but the problem that the OP is >> trying to solve is how to use multiple indexes in a 'query'. >> >> Note: I put ' ' around query because its a m/r job or a single thread >> where the user is trying to get a result set which is a significantly >> smaller subset, using more than 1 index. >> >> So the idea is to do a quick get() against each index and the result would >> be a list of row keys. The next step is to get the intersection(s) quickly >> (which I proposed), and then you would just need to do a quick series of >> get()s to pull back the list of rows. >> >> If I understand the OP's problem, its not a co-processor type of problem. >> >> Its one of where you submit a m/r job. Within your toolRunner, you would >> actually do the fetches against the indexes and then build the ultimate >> result set. then you just need a map job to take your result set as an >> input. >> >> Drawback... if the list of rows is very, very long, you may run out of >> memory. So you need to resolve that... >> (Which is why I was suggesting on using a temp table and then you can use >> the rows in the temp table as input in to your fetch... >> >> While not something I would use for 'real time' its something where I can >> really shrink the number of rows you have to fetch for further processing. >> So if your full table scan takes an hour, but we can do N get()s to get >> the rows in the Index, find the intersection I and then do I.size() get()s >> to fetch the data. This should take much less time. >> >> >> Again, I don't see this in a coprocessor based solution, however, the N >> get()s and intersection could be done at the start of the job, or could be +
Michael Segel 2012-05-16, 22:16
|