Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Coprocessors


Copy link to this message
-
Re: Coprocessors
Thanks for the additional info, Sudarshan. This would fit well with the
implementation of Phoenix's skip scan.

CREATE TABLE t (
     object_id INTEGER NOT NULL,
     field_type INTEGER NOT NULL,
     attrib_id INTEGER NOT NULL,
     value BIGINT
     CONSTRAINT pk PRIMARY KEY (object_id, field_type, attribute_id));

SELECT count(value), sum(value),avg(value) FROM t
WHERE object_id IN (?,?,?) AND field_type IN (?,?,?) AND attribute_type
IN (?,?,?)

and then your client would do whatever additional computation it needed
on the results it got back.

Would that fit with what you're trying to do?

     James

On 04/25/2013 03:36 PM, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) wrote:
> Michael: Fair enough. Let me see what relevant information I can add to what I've already said:
>
> 1. To Lars' point, my 250K keys are unlikely to fall into fewer than 250K sub-ranges.
> 2. Here's a bit more about my schema:
>   2.1 My rowkeys are composed of 2 entities - let's call it object-id and field-type. An object (O1) has 100s of field types (F1,F2,F3...). Each object-id - field-type pair has 100s of attributes (A1,A2,A3).
>   2.2 My rowkeys are O1-F1, O1-F2, O1-F3, etc.
>   2.3 My primary application (not the one my original post was about) accesses by these rowkeys.
>   2.4 My application that does aggregation is given a bunch of objects <O1, O2, O3>, a field-type <F1>, a bunch of attributes <A1,A2> and some computation to perform.
>   2.5 As you can see, scans are unlikely to be useful when fetching O1-F1, O2-F1, O3-F1 etc.
>
> Viral: How do I tackle aggregation using observers? Let's say I override the postGet method. I do a multi-get from my client and my method gets called on each region server for each row. What is the next step with this approach?
>
>
> ----- Original Message -----
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Cc: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN)
> At: Apr 25 2013 18:12:46
>
> I don't think Phoenix will solve his problem.
>
> He also needs to explain more about his problem before we can start to think about the problem.
>
>
> On Apr 25, 2013, at 4:54 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
>> You might want to have a look at Phoenix (https://github.com/forcedotcom/phoenix), which does that and more, and gives a SQL/JDBC interface.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>> From: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]
>> Sent: Thursday, April 25, 2013 2:44 PM
>> Subject: Coprocessors
>>
>>
>> Folks:
>>
>> This is my first post on the HBase user mailing list.
>>
>> I have the following scenario:
>> I've a HBase table of upto a billion keys. I'm looking to support an application where on some user action, I'd need to fetch multiple columns for upto 250K keys and do some sort of aggregation on it. Fetching all that data and doing the aggregation in my application takes about a minute.
>>
>> I'm looking to co-locate the aggregation logic with the region servers to
>> a. Distribute the aggregation
>> b. Avoid having to fetch large amounts of data over the network (this could potentially be cross-datacenter)
>>
>> Neither observers nor aggregation endpoints work for this use case. Observers don't return data back to the client while aggregation endpoints work in the context of scans not a multi-get (Are these correct assumptions?).
>>
>> I'm looking to write a service that runs alongside the region servers and acts a proxy b/w my application and the region servers.
>>
>> I plan to use the logic in HBase client's HConnectionManager, to segment my request of 1M rowkeys into sub-requests per region-server. These are sent over to the proxy which fetches the data from the region server, aggregates locally and sends data back. Does this sound reasonable or even a useful thing to pursue?
>>
>> Regards,
>> -sudarshan