Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Rowkey design and presplit table


Copy link to this message
-
Re: Rowkey design and presplit table
James Taylor 2013-03-07, 08:42
Another possible solution for you: use Phoenix:
https://github.com/forcedotcom/phoenix

Phoenix would allow you to model your scenario using SQL through JDBC,
like this:

Connection conn = DriverManager.connect("jdbc:phoenix:<your zookeeper
quorum>");
Statement stmt = conn.createStatement(
     "CREATE TABLE article_discussion (" +
     "    category_id BIGINT not null," +
     "    article_id BIGINT not null," +
     "    comment_id BIGINT not null
     "    CONSTRAINT pk PRIMARY KEY (category_id, article_id,
comment_id))");
stmt.execute();

You could, of course, add other non primary key columns to the above.  
You could also pre-split the table and/or pass through any HBase
configuration parameters necessary.

Then, you populate your table like this:

PrepareStatement stmt = conn.prepareStatement(
     "UPSERT INTO article_discussion VALUES(?,?,?)");
stmt.setLong("category_id", categoryId);
stmt.setLong("article_id", articleId);
stmt.setLong("comment_id", commentId);

And to query for all comments in a given category, you'd do this:

PrepareStatement stmt1 = conn.prepareStatement(
     "SELECT * FROM article_discussion WHERE category_id = ?");
stmt1.setLong("category_id", categoryId);
ResultSet rs = stmt1.executeQuery();
while (rs.next()) {
     System.out.println(
         "category_id=" + rs.getLong("category_id") +
         ",article_id=" + rs.getLong("article_id") +
         ",comment_id=" + rs.getLong("comment_id"));
}
and for all comments for a given category and article:

PrepareStatement stmt1 = conn.prepareStatement(
     "SELECT * FROM article_discussion WHERE category_id = ? AND
article_id = ?");
stmt1.setLong("category_id", categoryId);
stmt1.setLong("article_id", articleId);

you could use arbitrary where clauses like this:

SELECT * FROM article_discussion
WHERE (article_name LIKE '% baseball %' OR article_name LIKE '% food %')
AND category_id IN (?, ?, ?)
AND published_date >= to_date('2013-01-01 00:00:00')
AND published_date <= to_date('2013-01-31 00:00:00')

or do aggregation like this:

SELECT category_id, count(article_id), count(comment_id)
FROM article_discussion
WHERE published_date > ?
GROUP BY category_id
HAVING count(comment_id) > 100

Regards,

     James

On 03/06/2013 11:42 PM, Asaf Mesika wrote:
> I would convert each id to long and then use Bytes.toBytes to convert this
> long to a byte array. If it is an int then even better.
> Now, write all 3 longs one after another to one array which will be your
> rowkey.
> This gives you:
> * fixed size
> * small row key - 3*8 bytes if you use long and 3*4 for int.
>
> Why do you need to use prefix split policy?
>
> On Monday, March 4, 2013, Lukáš Drbal wrote:
>
>> Hi,
>>
>> i have one question about rowkey design and presplit table.
>>
>> My usecase:
>> I need store a lot of comments where each comment are for one article and
>> this article has one category.
>>
>> What i need:
>> 1) read one comment by id (where i know commentId, articleId and
>> categoryId)
>> 2) read all coments for article (i know categoryId and articleId)
>> 3) read all comments for category (i know categoryId)
>>
>>  From this read pattern i see one good rowkey:
>> <categoryId>_<articleId>_<commentId>
>>
>> But here i don't have fixed size of rowkey, so i don't know how to define
>> split pattern. How can be this solved?
>> This id's come from external system and grow very fast, so add some like
>> "padding" for each part are hard.
>>
>> Maybe i can use hash function for each part
>> md5(<categoryId>_md5(<articleId>)_md5(<commentId>), but this rowkey is very
>> long (3*32+2 bytes), i don't have experience with this long rowkeys.
>>
>> Can someone give me a suggestions please?
>>
>> Regards
>>
>> Lukas Drbal
>>