Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Coprocessors


Copy link to this message
-
Re: Coprocessors
Sudarshan,
Below are the results that Mujtaba put together. He put together two
version of your schema: one with the ATTRIBID as part of the row key
and one with it as a key value. He also benchmarked the query time both
when all of the data was in the cache versus when all of the data was
read off of disk.

Let us know if you have any questions/follow up.

Thanks,

James (& Mujtaba)

          Compute Average over 250K random rows in 1B row table

                                  ATTRIBID in row key
                      Data from HBase cache       Data loaded from disk
Phoenix Skip Scan          1.4 sec                     31 sec
HBase Batched Gets         3.8 sec                     58 sec
HBase Range Scan            -                          10+ min

                                  ATTRIBID as key value
                      Data from HBase cache       Data loaded from disk
Phoenix Skip Scan          1.7 sec                     37 sec
HBase Batched Gets         4.0 sec                     82 sec
HBase Range Scan            -                          10+ min

Details
-------
HBase 0.94.7 Hadoop 1.04
Total number of regions: 30 spread on 4 Region Servers (6 core W3680 Xeon 3.3GHz) with 8GB heap.

Data:
20 FIELDTYPE, 50M OBJECTID for each FIELDTYPE, 10 ATTRIBID. VAL is random integer.

Query:
SELECT AVG(VAL) FROM T1
WHERE OBJECTID IN (250K RANDOM OBJECTIDs) AND FIELDTYPE = 'F1' AND ATTRIBID = '1'

Create table DML:

1. CREATE TABLE IF NOT EXISTS T1 (
        OBJECTID INTEGER NOT NULL,
        FIELDTYPE CHAR(2) NOT NULL,
        ATTRIBID INTEGER NOT NULL,
        CF.VAL INTEGER
        CONSTRAINT PK PRIMARY KEY (OBJECTID,FIELDTYPE,ATTRIBID))
    COMPRESSION='GZ', BLOCKSIZE='4096'

2. CREATE TABLE IF NOT EXISTS T2 (
        OBJECTID INTEGER NOT NULL,
        FIELDTYPE CHAR(2) NOT NULL,
        CF.ATTRIBID INTEGER,
        CF.VAL INTEGER
        CONSTRAINT PK PRIMARY KEY (OBJECTID,FIELDTYPE))
    COMPRESSION='GZ', BLOCKSIZE='4096'

On 04/25/2013 04:19 PM, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) wrote:

> James: First of all, this looks quite promising.
>
> The table schema outlined in your other message is correct except that attrib_id will not be in the primary key. Will that be a problem with respect to the skip-scan filter's performance? (it doesn't seem like it...)
>
> Could you share any sort of benchmark numbers? I want to try this out right away, but I've to wait for my cluster administrator to upgrade us from HBase 0.92 first!
>
> ----- Original Message -----
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> At: Apr 25 2013 18:45:14
>
> On 04/25/2013 03:35 PM, Gary Helmling wrote:
>>> I'm looking to write a service that runs alongside the region servers and
>>> acts a proxy b/w my application and the region servers.
>>>
>>> I plan to use the logic in HBase client's HConnectionManager, to segment
>>> my request of 1M rowkeys into sub-requests per region-server. These are
>>> sent over to the proxy which fetches the data from the region server,
>>> aggregates locally and sends data back. Does this sound reasonable or even
>>> a useful thing to pursue?
>>>
>>>
>> This is essentially what coprocessor endpoints (called through
>> HTable.coprocessorExec()) basically do.  (One difference is that there is a
>> parallel request per-region, not per-region server, though that is a
>> potential optimization that could be made as well).
>>
>> The tricky part I see for the case you describe is splitting your full set
>> of row keys up correctly per region.  You could send the full set of row
>> keys to each endpoint invocation, and have the endpoint implementation
>> filter down to only those keys present in the current region.  But that
>> would be a lot of overhead on the request side.  You could split the row
>> keys into per-region sets on the client side, but I'm not sure we provide
>> sufficient context for the Batch.Callable instance you provide to
>> coprocessorExec() to determine which region it is being invoked against.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB