Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: [ hbase ] performance of Get from MR Job


Copy link to this message
-
Re: [ hbase ] performance of Get from MR Job
Michael Segel 2012-06-27, 15:48
I'm not sure as to what you are attempting to do with your data.

There are a couple of things to look at.

Looking at the issue, you have (K,V) pair. That's Key, Value.
But the value isn't necessarily a single element. It could be a set of elements.

You have to consider that rather than store versions of a cell using the timestamp to indicate the revision of the data. (There are some design issues with this concept) Or you could incorporate the timestamp in your column name.

While a get() is really a scan() that returns one row, it should be faster than what you are experiencing.
Schema design is a bit tricky to master because its going to be data dependent along with your use case.
On Jun 25, 2012, at 2:32 AM, Marcin Cylke wrote:

> On 21/06/12 14:33, Michael Segel wrote:
>> I think the version issue is the killer factor here.
>> Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve.  
>>
>> As JD C points out, if your storing temporal data, you should make time part of your schema.
>
> I've rewritten my job to load data and not fill individual timestamps
> for columns, but rather add timestamp to rowkey. Now it looks like this
>
> [previous key][Long.MAX_VALUE-timestamp]
> (without braces)
>
> My keys look like this now:
>
> 488892772259223372035596613844
>
> and I'm issuing a scan like this:
>
> Scan scan = new Scan("488892772259");
> scan.setMaxVersions(1);
>
> So I'm searching for my key without timestamp part added. What I'm
> getting back is all the rows that start with "488892772259".
>
> Now the performance is even worse than before (with versioned data).
>
> What I'm also observing is the "hugeness" of my tables and influence of
> compression on the performance:
>
> My initial data - stored in Hive table - is ~ 1.5GB. When I load it into
> HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size
> down to ~1.5GB, but it also dramatically reduces performance.
>
> To sum up, here are rough times of execution and rates of requests that
> I've been observing (for each option I've added GET/SCAN throughput and
> rough execution time):
>
> - versioned data (uncompressed table)
>    - with misses (asking for non-existent key) - ~400 gets/sec - ~1h
>    - with hits (asking for existing keys) - ~150gets/sec - ~20h
> - single version (with complex key)
>    - uncompressed - ~30 scans/sec - ~25h
>    - compressed with LZO - ~15 scans/sec - ~30h
>
> If that would be necessary I could provide complete data - with time
> distribution of the number of gets/scans.
>
> This performance issues are very strange to me - do You have any
> suggestions as to what's causing so big increase in the time of execution?
>
> Regards
> Marcin
>