Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Essential column family performance


Copy link to this message
-
Re: Essential column family performance
Ted Yu 2013-04-10, 01:21
bq. with only 10000 rows that would all fit in the memstore.

This aspect should be enhanced in the test.

Cheers

On Tue, Apr 9, 2013 at 6:17 PM, Lars Hofhansl <[EMAIL PROTECTED]> wrote:

> Also the unittest tests with only 10000 rows that would all fit in the
> memstore. Seek vs reseek should make little difference for the memstore.
>
> We tested with 1m and 10m rows, and flushed the memstore  and compacted
> the store.
>
> Will do some more verification later tonight.
>
> -- Lars
>
>
> Lars H <[EMAIL PROTECTED]> wrote:
>
> >Your slow scanner performance seems to vary as well. How come? Slow is
> with the feature off.
> >
> >I don't how reseek can be slower than seek in any scenario.
> >
> >-- Lars
> >
> >Ted Yu <[EMAIL PROTECTED]> schrieb:
> >
> >>I tried using reseek() as suggested, along with my patch from HBASE-8306
> (30%
> >>selection rate, random distribution and FAST_DIFF encoding on both column
> >>families).
> >>I got uneven results:
> >>
> >>2013-04-09 16:59:01,324 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Slow scanner finished in 7.529083 seconds, got 1546 rows
> >>
> >>2013-04-09 16:59:06,760 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Joined scanner finished in 5.43579 seconds, got 1546 rows
> >>...
> >>2013-04-09 16:59:12,711 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Slow scanner finished in 5.95016 seconds, got 1546 rows
> >>
> >>2013-04-09 16:59:20,240 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Joined scanner finished in 7.529044 seconds, got 1546 rows
> >>
> >>FYI
> >>
> >>On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >>
> >>> We did some tests here.
> >>> I ran this through the profiler against a local RegionServer and found
> the
> >>> part that causes the slowdown is a seek called here:
> >>>              boolean mayHaveData > >>>               (nextJoinedKv != null &&
> >>> nextJoinedKv.matchingRow(currentRow, offset, length))
> >>>               ||
> >>> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset,
> length))
> >>>                   && joinedHeap.peek() != null
> >>>                   && joinedHeap.peek().matchingRow(currentRow, offset,
> >>> length));
> >>>
> >>> Looking at the code, this is needed because the joinedHeap can fall
> >>> behind, and hence we have to catch it up.
> >>> The key observation, though, is that the joined heap can only ever be
> >>> behind, and hence we do not need a seek, but only a reseek.
> >>>
> >>> Deploying a RegionServer with the seek replaced with reseek we see an
> >>> improvement in *all* cases.
> >>>
> >>> I'll file a jira with a fix later.
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>>  From: James Taylor <[EMAIL PROTECTED]>
> >>> To: [EMAIL PROTECTED]
> >>> Sent: Monday, April 8, 2013 6:53 PM
> >>> Subject: Re: Essential column family performance
> >>>
> >>> Good idea, Sergey. We'll rerun with larger non essential column family
> >>> values and see if there's a crossover point. One other difference for
> us
> >>> is that we're using FAST_DIFF encoding. We'll try with no encoding too.
> >>> Our table has 20 million rows across four regions servers.
> >>>
> >>> Regarding the parallelization we do, we run multiple scans in parallel
> >>> instead of one single scan over the table. We use the region boundaries
> >>> of the table to divide up the work evenly, adding a start/stop key for
> >>> each scan that corresponds to the region boundaries. Our client then
> >>> does a final merge/aggregation step (i.e. adding up the count it gets
> >>> back from the scan for each region).
> >>>
> >>> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
> >>> > IntegrationTestLazyCfLoading uses randomly distributed keys with the
> >>> > following condition for filtering:
> >>> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where
> rowKey
> >>> > is hex string of MD5 key.
> >>> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.