|
|
-
Re: [ hbase ] performance of Get from MR JobMichael Segel 2012-06-27, 15:48
I'm not sure as to what you are attempting to do with your data.
There are a couple of things to look at. Looking at the issue, you have (K,V) pair. That's Key, Value. But the value isn't necessarily a single element. It could be a set of elements. You have to consider that rather than store versions of a cell using the timestamp to indicate the revision of the data. (There are some design issues with this concept) Or you could incorporate the timestamp in your column name. While a get() is really a scan() that returns one row, it should be faster than what you are experiencing. Schema design is a bit tricky to master because its going to be data dependent along with your use case. On Jun 25, 2012, at 2:32 AM, Marcin Cylke wrote: > On 21/06/12 14:33, Michael Segel wrote: >> I think the version issue is the killer factor here. >> Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve. >> >> As JD C points out, if your storing temporal data, you should make time part of your schema. > > I've rewritten my job to load data and not fill individual timestamps > for columns, but rather add timestamp to rowkey. Now it looks like this > > [previous key][Long.MAX_VALUE-timestamp] > (without braces) > > My keys look like this now: > > 488892772259223372035596613844 > > and I'm issuing a scan like this: > > Scan scan = new Scan("488892772259"); > scan.setMaxVersions(1); > > So I'm searching for my key without timestamp part added. What I'm > getting back is all the rows that start with "488892772259". > > Now the performance is even worse than before (with versioned data). > > What I'm also observing is the "hugeness" of my tables and influence of > compression on the performance: > > My initial data - stored in Hive table - is ~ 1.5GB. When I load it into > HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size > down to ~1.5GB, but it also dramatically reduces performance. > > To sum up, here are rough times of execution and rates of requests that > I've been observing (for each option I've added GET/SCAN throughput and > rough execution time): > > - versioned data (uncompressed table) > - with misses (asking for non-existent key) - ~400 gets/sec - ~1h > - with hits (asking for existing keys) - ~150gets/sec - ~20h > - single version (with complex key) > - uncompressed - ~30 scans/sec - ~25h > - compressed with LZO - ~15 scans/sec - ~30h > > If that would be necessary I could provide complete data - with time > distribution of the number of gets/scans. > > This performance issues are very strange to me - do You have any > suggestions as to what's causing so big increase in the time of execution? > > Regards > Marcin > |