Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: [ hbase ] performance of Get from MR Job


Copy link to this message
-
Re: [ hbase ] performance of Get from MR Job
I'm not sure as to what you are attempting to do with your data.

There are a couple of things to look at.

Looking at the issue, you have (K,V) pair. That's Key, Value.
But the value isn't necessarily a single element. It could be a set of elements.

You have to consider that rather than store versions of a cell using the timestamp to indicate the revision of the data. (There are some design issues with this concept) Or you could incorporate the timestamp in your column name.

While a get() is really a scan() that returns one row, it should be faster than what you are experiencing.
Schema design is a bit tricky to master because its going to be data dependent along with your use case.
On Jun 25, 2012, at 2:32 AM, Marcin Cylke wrote:

> On 21/06/12 14:33, Michael Segel wrote:
>> I think the version issue is the killer factor here.
>> Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve.  
>>
>> As JD C points out, if your storing temporal data, you should make time part of your schema.
>
> I've rewritten my job to load data and not fill individual timestamps
> for columns, but rather add timestamp to rowkey. Now it looks like this
>
> [previous key][Long.MAX_VALUE-timestamp]
> (without braces)
>
> My keys look like this now:
>
> 488892772259223372035596613844
>
> and I'm issuing a scan like this:
>
> Scan scan = new Scan("488892772259");
> scan.setMaxVersions(1);
>
> So I'm searching for my key without timestamp part added. What I'm
> getting back is all the rows that start with "488892772259".
>
> Now the performance is even worse than before (with versioned data).
>
> What I'm also observing is the "hugeness" of my tables and influence of
> compression on the performance:
>
> My initial data - stored in Hive table - is ~ 1.5GB. When I load it into
> HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size
> down to ~1.5GB, but it also dramatically reduces performance.
>
> To sum up, here are rough times of execution and rates of requests that
> I've been observing (for each option I've added GET/SCAN throughput and
> rough execution time):
>
> - versioned data (uncompressed table)
>    - with misses (asking for non-existent key) - ~400 gets/sec - ~1h
>    - with hits (asking for existing keys) - ~150gets/sec - ~20h
> - single version (with complex key)
>    - uncompressed - ~30 scans/sec - ~25h
>    - compressed with LZO - ~15 scans/sec - ~30h
>
> If that would be necessary I could provide complete data - with time
> distribution of the number of gets/scans.
>
> This performance issues are very strange to me - do You have any
> suggestions as to what's causing so big increase in the time of execution?
>
> Regards
> Marcin
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB