Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - performance of Get from MR Job


Copy link to this message
-
Re: performance of Get from MR Job
Michael Segel 2012-06-21, 12:33
I think the version issue is the killer factor here.

Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve.  

As JD C points out, if your storing temporal data, you should make time part of your schema.

On Jun 20, 2012, at 12:36 PM, Jean-Daniel Cryans wrote:

> Yeah I've overlooked the versions issue.
>
> What I usually recommend is that if the timestamp is part of your data
> model, it should be in the row key, a qualifier or a value. Since you
> seem to rely on the timestamp for querying, it should definitely be
> part of the row key but not at the beginning like you proposed. See
> http://hbase.apache.org/book.html#rowkey.design
>
> J-D
>
> On Tue, Jun 19, 2012 at 11:35 PM, Marcin Cylke <[EMAIL PROTECTED]> wrote:
>> On 19/06/12 19:31, Jean-Daniel Cryans wrote:
>>> This is a common but hard problem. I do not have a good answer.
>>
>> Thanks for Your writeup. You've given a few suggestions, that I will
>> surely follow.
>>
>> But what is bothering me, is my use of timestamps. As mentioned before,
>> my column family has 2147483646 versions allowed. I store data there
>> using those timestamps - a few rows with the same key but different
>> timestamp. Preparing GETs with timestamp, for TimeRange {0, Timestamp}
>> my performance is slopy (~130/sec). But setting doing sth like
>> {timestamp-10000, timestamp} results in great speed improvement (~400/sec).
>>
>> Despite the {timestamp-10000, timestamp} being unrealistic in my
>> situation, the whole issue seems strange, and thus related in some way
>> to the use of timestamps.
>>
>> Would You recommend trying with complex keys - build of timestamp+my
>> current key? Or this shouldn't change that much?
>>
>>
>>> Finally kind of like Paul said, if you can emit your rows and somehow
>>> batch them reducer-side in order to either do short scans or multi-get
>>> (see HTable.get(List<Get>)) it could be faster.
>>
>> I'll try this solution, but I'm not that optimistic about it. I'll let
>> You know whether this helped or not.
>>
>> Regards
>> Marcin
>>
>