Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Coprocessors


+
Sudarshan Kadambi 2013-04-25, 21:57
+
lars hofhansl 2013-04-25, 22:06
+
Sudarshan Kadambi 2013-04-25, 21:44
+
lars hofhansl 2013-04-25, 21:54
+
Michael Segel 2013-04-25, 22:12
+
Viral Bajaria 2013-04-25, 22:28
+
Gary Helmling 2013-04-25, 22:35
+
James Taylor 2013-04-25, 22:44
+
Sudarshan Kadambi 2013-04-25, 22:36
+
Michael Segel 2013-04-26, 02:43
+
James Taylor 2013-04-25, 23:00
+
Sudarshan Kadambi 2013-04-25, 23:19
Copy link to this message
-
Re: Coprocessors
Our performance engineer, Mujtaba Chohan has agreed to put together a
benchmark for you. We only have a four node cluster of pretty average
boxes, but it should give you an idea.

No performance impact for the attrib_id not being part of the PK since
you're not filtering on them (if I understand things correctly).

A few more questions for you:
- How many rows should be use? 1B?
- How many rows would be filtered by object_id and field_type?
- Any particular key distribution or is random fine?
- What's the minimum key size we should use for object_id and
field_type? 2 bytes each?
- Any particular kind of aggregation? count(attrib1)? sum(attrib1)? A
sample query would be helpful

Since you're upgrading, use the latest on the 0.94 branch, 0.94.7.

Thanks,

James

On 04/25/2013 04:19 PM, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) wrote:
> James: First of all, this looks quite promising.
>
> The table schema outlined in your other message is correct except that attrib_id will not be in the primary key. Will that be a problem with respect to the skip-scan filter's performance? (it doesn't seem like it...)
>
> Could you share any sort of benchmark numbers? I want to try this out right away, but I've to wait for my cluster administrator to upgrade us from HBase 0.92 first!
>
> ----- Original Message -----
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> At: Apr 25 2013 18:45:14
>
> On 04/25/2013 03:35 PM, Gary Helmling wrote:
>>> I'm looking to write a service that runs alongside the region servers and
>>> acts a proxy b/w my application and the region servers.
>>>
>>> I plan to use the logic in HBase client's HConnectionManager, to segment
>>> my request of 1M rowkeys into sub-requests per region-server. These are
>>> sent over to the proxy which fetches the data from the region server,
>>> aggregates locally and sends data back. Does this sound reasonable or even
>>> a useful thing to pursue?
>>>
>>>
>> This is essentially what coprocessor endpoints (called through
>> HTable.coprocessorExec()) basically do.  (One difference is that there is a
>> parallel request per-region, not per-region server, though that is a
>> potential optimization that could be made as well).
>>
>> The tricky part I see for the case you describe is splitting your full set
>> of row keys up correctly per region.  You could send the full set of row
>> keys to each endpoint invocation, and have the endpoint implementation
>> filter down to only those keys present in the current region.  But that
>> would be a lot of overhead on the request side.  You could split the row
>> keys into per-region sets on the client side, but I'm not sure we provide
>> sufficient context for the Batch.Callable instance you provide to
>> coprocessorExec() to determine which region it is being invoked against.
> Sudarshan,
> In our head branch of Phoenix (we're targeting this for a 1.2 release in
> two weeks), we've implemented a skip scan filter that functions similar
> to a batched get, except:
> 1) it's more flexible in that it can jump not only from a single key to
> another single key, but also from range to range
> 2) it's faster, about 3-4x.
> 3) you can use it in combination with aggregation, since it's a filter
>
> The scan is chunked up by region and only the keys in each region are
> sent, along the lines as you and Gary have described. Then the results
> are merged together by the client automatically.
>
> How would you decompose your row key into columns? Is there a time
> component? Let me walk you through an example where you might have a
> LONG id value plus perhaps a timestamp (it work equally well if you only
> had a single column in your PK). If you provide a bit more info on your
> use case, I can tailor it more exactly.
>
> Create a schema:
>       CREATE TABLE t (key BIGINT NOT NULL, ts DATE NOT NULL, data VARCHAR
> CONSTRAINT pk PRIMARY KEY (key, ts));
>
> Populate your data using our UPSERT statement.
>
> Aggregate over a set of keys like this:
>
+
James Taylor 2013-05-02, 00:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB