Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Row Key Design in time based aplication


Copy link to this message
-
Re: Row Key Design in time based aplication
How does avoid memory hogging the region server when multiple queries with
group by are executed, which is done in Hbase jvm. I know that Hbase does
not handle well when heap space is set beyond 12G, and combined with
compactions happening concurrently with queries, it creates a memory
competition. Is there a any way to control this in Pheonix? Is it scalable
in terms of the number of concurrent queries running in a cluster?

On Monday, February 18, 2013, James Taylor wrote:

> Michael is right - Phoenix wouldn't automatically solve these issues for
> you - it would just a) decrease the amount of code you need to write while
> still giving you coprocessor-speed performance, and b) give you an industry
> standard API to read/write your data.
>
> However, since the date is not the leading part of the key, it wouldn't be
> a problem for it to be monotonically increasing. If project_id is, then you
> could reverse the bytes on the way in and on the way out to prevent hot
> spotting on writes (basically taking the same approach as when you'd use
> the HBase native APIs). If you wanted to do it in SQL, you could add your
> own built-in function to Phoenix. I'll blog about how to do this soon.
>
> James
> http://phoenix-hbase.blogspot.**com/ <http://phoenix-hbase.blogspot.com/>
>
> On 02/17/2013 03:18 PM, Michael Segel wrote:
>
>> I'm not sure how a SQL interface above HBase will solve some of the
>> issues with regional hot spotting when using time as the key. Or the
>> problem with always adding data to the right of the last row.
>>
>> The same would apply with the project id, assuming that it too is a
>> number that grows incrementally with each project.
>> On Feb 17, 2013, at 4:50 PM, James Taylor <[EMAIL PROTECTED]> wrote:
>>
>>  Hello,
>>> Have you considered using Phoenix (https://github.com/**
>>> forcedotcom/phoenix <https://github.com/forcedotcom/phoenix>) for this
>>> use case? Phoenix is a SQL layer on top of HBase. For this use case, you'd
>>> connect to your cluster like this:
>>>
>>> Class.forName("com.salesforce.**phoenix.jdbc.PhoenixDriver"); //
>>> register driver
>>> Connection conn = DriverManager..getConnection("**jdbc:phoenix:localhost");
>>> // connect to local HBase
>>>
>>> Create a table like this (adding additional columns that you want to
>>> measure, like txn_count below):
>>>
>>> conn.createStatement().**execute(
>>>     "CREATE TABLE event_log (\n" +
>>>     "     project_id INTEGER NOT NULL, \n" +
>>>     "    time DATE NOT NULL,\n" +
>>>     "txn_count LONG\n" +
>>>     "CONSTRAINT pk PRIMARY KEY (project_id, time))");
>>>
>>> Then to insert data you'd do this:
>>>
>>> PreparedStatement preparedStmt = conn.prepareStatement(
>>>     "UPSERT INTO event_log VALUES(?,?,0)");
>>>
>>> and you'd bind the values in JDBC like this:
>>>
>>> preparedStmt.setInt(1, projectId);
>>> preparedStmt.setDate(2, time);
>>> preparedStmt.execute();
>>>
>>> conn.commit(); // If upserting many values, you'd want to commit after
>>> upserting maybe 1000-10000 rows
>>>
>>> Then at query data time, assuming you want to report on this data by
>>> grouping into different "time buckets", you could do as show below. Phoenix
>>> stores your date values at the millisecond granularity and you can decide a
>>> query time how you'd like to roll it up:
>>>
>>> // Query with time bucket at the hour granularity
>>> conn.createStatement().**execute(
>>>    "SELECT\n" +
>>>    "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
>>>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>    "GROUP BY project_id, TRUNC(time,'HOUR')");
>>>
>>> // Query with time bucket at the day granularity
>>> conn.createStatement().**execute(
>>>     "SELECT\n" +
>>>     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>>     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>     "GROUP BY project_id, TRUNC(time,'DAY')");
>>>
>>> You could, of course include a WHERE clause in the query to filter based