Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Poor HBase random read performance


+
Varun Sharma 2013-06-29, 19:13
+
lars hofhansl 2013-06-29, 22:09
+
lars hofhansl 2013-06-29, 22:24
+
Varun Sharma 2013-06-29, 22:39
+
Varun Sharma 2013-06-29, 23:10
+
Vladimir Rodionov 2013-07-01, 18:08
+
lars hofhansl 2013-07-01, 19:05
+
lars hofhansl 2013-07-01, 19:10
Copy link to this message
-
Re: Poor HBase random read performance
Varun Sharma 2013-07-01, 23:10
Going back to leveldb vs hbase, I am not sure if we can come with a clean
way to identify HFiles containing more recent data in the wake of
compactions

I though wonder if this works with minor compactions, lets say you compact
a really old file with a new file. Now since this file's most recent
timestamp is very recent because of the new file, you look into this file,
but then retrieve something from the "old" portion of this file. So you end
with older data.

I guess one way would be just order the files by time ranges. Non
intersecting time range files can be ordered in reverse time order.
Intersecting stuff can be seeked together.

     File1
|-----------------|
                          File2
                     |---------------|
                                       File3
                             |-----------------------------|
                                                                     File4

 |--------------------|

So in this case, we seek

[File1], [File2, File3], [File4]

I think for random single key value looks (row, col)->key - this could lead
to good savings for time ordered clients (which are quite common). Unless
File1 and File4 get compacted, in which case, we always need to seek into
both.

On Mon, Jul 1, 2013 at 12:10 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Sorry. Hit enter too early.
>
> Some discussion here:
> http://apache-hbase.679495.n3.nabble.com/keyvalue-cache-td3882628.html
> but no actionable outcome.
>
> -- Lars
> ________________________________
> From: lars hofhansl <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Monday, July 1, 2013 12:05 PM
> Subject: Re: Poor HBase random read performance
>
>
> This came up a few times before.
>
>
>
> ________________________________
> From: Vladimir Rodionov <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <
> [EMAIL PROTECTED]>
> Sent: Monday, July 1, 2013 11:08 AM
> Subject: RE: Poor HBase random read performance
>
>
> I would like to remind that in original BigTable's design  there is scan
> cache to take care of random reads and this
> important feature is still missing in HBase.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [EMAIL PROTECTED]
>
> ________________________________________
> From: lars hofhansl [[EMAIL PROTECTED]]
> Sent: Saturday, June 29, 2013 3:24 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Poor HBase random read performance
>
> Should also say that random reads this way are somewhat of a worst case
> scenario.
>
> If the working set is much larger than the block cache and the reads are
> random, then each read will likely have to bring in an entirely new block
> from the OS cache,
> even when the KVs are much smaller than a block.
>
> So in order to read a (say) 1k KV HBase needs to bring 64k (default block
> size) from the OS cache.
> As long as the dataset fits into the block cache this difference in size
> has no performance impact, but as soon as the dataset does not fit, we have
> to bring much more data from the OS cache than we're actually interested in.
>
> Indeed in my test I found that HBase brings in about 60x the data size
> from the OS cache (used PE with ~1k KVs). This can be improved with smaller
> block sizes; and with a more efficient way to instantiate HFile blocks in
> Java (which we need to work on).
>
>
> -- Lars
>
> ________________________________
> From: lars hofhansl <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Saturday, June 29, 2013 3:09 PM
> Subject: Re: Poor HBase random read performance
>
>
> I've seen the same bad performance behavior when I tested this on a real
> cluster. (I think it was in 0.94.6)
>
>
> Instead of en/disabling the blockcache, I tested sequential and random
> reads on a data set that does not fit into the (aggregate) block cache.
> Sequential reads were drastically faster than Random reads (7 vs 34
+
Vladimir Rodionov 2013-07-01, 23:57
+
Vladimir Rodionov 2013-07-02, 00:09
+
Ted Yu 2013-07-01, 23:27
+
Jean-Daniel Cryans 2013-07-01, 16:55
+
Varun Sharma 2013-07-01, 17:50
+
Lars Hofhansl 2013-06-30, 07:45
+
Vladimir Rodionov 2013-07-01, 18:26
+
Varun Sharma 2013-07-01, 18:30