Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Copy some records from Huge hbase table to another table


Copy link to this message
-
Re: Copy some records from Huge hbase table to another table
Hi Riyaz,
You can do this with a single SQL command using Apache Phoenix, a SQL
engine on top of HBase, and you'll get better performance than if you hand
coded it using the HBase client APIs. Depending on your current schema, you
may be able to run this command with no change to your data. Let's assume
you have an MD5 hash of the website and the date/time in the row key with
your website and counts in key values. That schema could be modeled like
this in Phoenix:

CREATE VIEW WEBSITE_STATS (
    WEBSITE_MD5 BINARY(16) NOT NULL,
    DATE_COLLECTED UNSIGNED_DATE NOT NULL,
    WEBSITE_URL VARCHAR,
    HIT_COUNT UNSIGNED_LONG,
    CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED));

You could issue this create view statement and map directly to your HBase
table. I used the UNSIGNED types above as they match the serialization you
get when you use the HBase Bytes utility methods. Phoenix normalizes column
names by upper casing them, so if your column qualifiers are lower case,
you'd want to put the column names above in double quotes.

Next, you'd create a table to hold the top10 information:

CREATE TABLE WEBSITE_TOP10 (
    WEBSITE_URL VARCHAR PRIMARY KEY,
    TOTAL_HIT_COUNT BIGINT);

Then you'd run an UPSERT SELECT command like this:

UPSERT INTO WEBSITE_TOP10
SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS
GROUP BY WEBSITE_URL
ORDER BY SUM(HIT_COUNT)
LIMIT 10;

Phoenix will run the SELECT part of this in parallel on the client and use
a coprocessor on the server side to aggregate over the WEBSITE_URLs
returning the distinct set of urls per region with a final merge sort
happening o the client to compute the total sum. Then the client will hold
on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10 table.

HTH,
James
On Fri, May 23, 2014 at 5:14 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB