Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - EndPoint Coprocessor could be dealocked?


Copy link to this message
-
Re: EndPoint Coprocessor could be dealocked?
Michael Segel 2012-05-16, 22:16
David,

Its not a question of a daemon, its a question of the problem you are trying to solve.
Using this as an example.. you are not always going to select data from a given table always using the same query. So you will not always want to use the index on column A and then the index on column D.

If you were, then you'd save yourself a lot of headaches by just using a composite index.

Again, what I am suggesting is that you step away from the mechanics of the OPs attempt of solving a problem, and focus on his problem.

He wants to use two secondary indexes to further filter the resulting data set.

An excellent example is if you want to filter your data set using two orthogonal indexes on the underlying data set. Think about doing an index on one field that is a string, and a second field that  is geo-spatial data.

Does this belong inside a co-processor? maybe, maybe not.

I would think that in terms of coprocessor use, one would want to use them to keep the indexes in sync not use them for queries.

Does that make sense?

BTW, would you consider making a call to an external system from within a coprocessor? I mean would you want your coprocessor calling something like an external lucene index? I don't think it would be a good idea. But that's a different conversation.

With respect to the OP's initial problem. I really don't think you want to do this as a co-processor problem.
On May 16, 2012, at 4:40 PM, Dave Revell wrote:

> Many people will probably try to use coprocessors as a way of implementing
> app logic on top of HBase without the headaches of writing a daemon.
> Sometimes client-side approaches are inadvisable; for example, there may be
> several client languages/runtimes and the app logic should not be
> reimplemented in each.
>
> It's understandable that people wouldn't want to deal with setting up a
> daemon and RPC mechanism if they can piggyback on the existing HBase
> coprocessor mechanism.
>
> Are HBase coprocessors explicitly wrong for this use case if the app logic
> needs to access multiple regions in a single call?
>
> Cheers,
> Dave
>
> On Wed, May 16, 2012 at 12:07 PM, Michael Segel
> <[EMAIL PROTECTED]>wrote:
>
>>
>> I think we need to look at the base problem that is trying to be solved.
>>
>> I mean the discussion on the RPC mechanism. but the problem that the OP is
>> trying to solve is how to use multiple indexes in a 'query'.
>>
>> Note: I put ' ' around query because its a m/r job or a single thread
>> where the user is trying to get a result set which is a significantly
>> smaller subset, using more than 1 index.
>>
>> So the idea is to do a quick get() against each index and the result would
>> be a list of row keys. The next step is to get the intersection(s) quickly
>> (which I proposed), and then you would just need to do a quick series of
>> get()s  to pull back the list of rows.
>>
>> If I understand the OP's problem, its not a co-processor type of problem.
>>
>> Its one of where you submit a m/r job. Within your toolRunner, you would
>> actually do the fetches against the indexes and then build the ultimate
>> result set. then you just need a map job to take your result set as an
>> input.
>>
>> Drawback... if the list of rows is very, very long, you may run out of
>> memory. So you need to resolve that...
>> (Which is why I was suggesting on using a temp table and then you can use
>> the rows in the temp table as input in to your fetch...
>>
>> While not something I would use for 'real time' its something where I can
>> really shrink the number of rows you have to fetch for further processing.
>> So if your full table scan takes an hour, but we can do N get()s to get
>> the rows in the Index, find the intersection I and then do I.size() get()s
>> to fetch the data.  This should take much less time.
>>
>>
>> Again, I don't see this in a coprocessor based solution, however, the N
>> get()s and intersection could be done at the start of the job, or could be