Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> performance of Get from MR Job


Copy link to this message
-
Re: performance of Get from MR Job
I think the version issue is the killer factor here.

Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve.  

As JD C points out, if your storing temporal data, you should make time part of your schema.

On Jun 20, 2012, at 12:36 PM, Jean-Daniel Cryans wrote:

> Yeah I've overlooked the versions issue.
>
> What I usually recommend is that if the timestamp is part of your data
> model, it should be in the row key, a qualifier or a value. Since you
> seem to rely on the timestamp for querying, it should definitely be
> part of the row key but not at the beginning like you proposed. See
> http://hbase.apache.org/book.html#rowkey.design
>
> J-D
>
> On Tue, Jun 19, 2012 at 11:35 PM, Marcin Cylke <[EMAIL PROTECTED]> wrote:
>> On 19/06/12 19:31, Jean-Daniel Cryans wrote:
>>> This is a common but hard problem. I do not have a good answer.
>>
>> Thanks for Your writeup. You've given a few suggestions, that I will
>> surely follow.
>>
>> But what is bothering me, is my use of timestamps. As mentioned before,
>> my column family has 2147483646 versions allowed. I store data there
>> using those timestamps - a few rows with the same key but different
>> timestamp. Preparing GETs with timestamp, for TimeRange {0, Timestamp}
>> my performance is slopy (~130/sec). But setting doing sth like
>> {timestamp-10000, timestamp} results in great speed improvement (~400/sec).
>>
>> Despite the {timestamp-10000, timestamp} being unrealistic in my
>> situation, the whole issue seems strange, and thus related in some way
>> to the use of timestamps.
>>
>> Would You recommend trying with complex keys - build of timestamp+my
>> current key? Or this shouldn't change that much?
>>
>>
>>> Finally kind of like Paul said, if you can emit your rows and somehow
>>> batch them reducer-side in order to either do short scans or multi-get
>>> (see HTable.get(List<Get>)) it could be faster.
>>
>> I'll try this solution, but I'm not that optimistic about it. I'll let
>> You know whether this helped or not.
>>
>> Regards
>> Marcin
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB