-Re: [ hbase ] performance of Get from MR Job
Michael Segel 2012-06-27, 15:48
I'm not sure as to what you are attempting to do with your data.
There are a couple of things to look at.
Looking at the issue, you have (K,V) pair. That's Key, Value.
But the value isn't necessarily a single element. It could be a set of elements.
You have to consider that rather than store versions of a cell using the timestamp to indicate the revision of the data. (There are some design issues with this concept) Or you could incorporate the timestamp in your column name.
While a get() is really a scan() that returns one row, it should be faster than what you are experiencing.
Schema design is a bit tricky to master because its going to be data dependent along with your use case.
On Jun 25, 2012, at 2:32 AM, Marcin Cylke wrote:
> On 21/06/12 14:33, Michael Segel wrote:
>> I think the version issue is the killer factor here.
>> Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve.
>> As JD C points out, if your storing temporal data, you should make time part of your schema.
> I've rewritten my job to load data and not fill individual timestamps
> for columns, but rather add timestamp to rowkey. Now it looks like this
> [previous key][Long.MAX_VALUE-timestamp]
> (without braces)
> My keys look like this now:
> and I'm issuing a scan like this:
> Scan scan = new Scan("488892772259");
> So I'm searching for my key without timestamp part added. What I'm
> getting back is all the rows that start with "488892772259".
> Now the performance is even worse than before (with versioned data).
> What I'm also observing is the "hugeness" of my tables and influence of
> compression on the performance:
> My initial data - stored in Hive table - is ~ 1.5GB. When I load it into
> HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size
> down to ~1.5GB, but it also dramatically reduces performance.
> To sum up, here are rough times of execution and rates of requests that
> I've been observing (for each option I've added GET/SCAN throughput and
> rough execution time):
> - versioned data (uncompressed table)
> - with misses (asking for non-existent key) - ~400 gets/sec - ~1h
> - with hits (asking for existing keys) - ~150gets/sec - ~20h
> - single version (with complex key)
> - uncompressed - ~30 scans/sec - ~25h
> - compressed with LZO - ~15 scans/sec - ~30h
> If that would be necessary I could provide complete data - with time
> distribution of the number of gets/scans.
> This performance issues are very strange to me - do You have any
> suggestions as to what's causing so big increase in the time of execution?