Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Performance issues of prepending a table

Copy link to this message
Re: Performance issues of prepending a table
I would guess that this approach would be susceptible to the same kind of "hot spotting" as inserting sequential keys; if you're prepending globally (i.e. there's one global "first" row), then all activity will be taking place on the same region server, so you wouldn't be taking advantage of the natural parallelism of a clustered system like HBase.

That aside, I can't think of anything architectural about HBase that would making it perform poorly to be continually inserting rows that sort before other rows; I think the log structured merge trees that hbase uses for storage will handle any kind of insert activity more or less identically, and write to the WAL and the memstore with equal speed regardless of row key position (and, flushes to storefiles on disk are based on the sorted arrangement in memory, which has already taken place by that point). There may be some smaller order differences in the speed of inserting into the memstore, depending on position, but that'd be something you'd have to benchmark, and my guess is you'd get nothing discernible. But as always, the best way to know is to try it. :)


On Apr 18, 2012, at 8:59 AM, de Souza Medeiros Andre wrote:

Hi all,

For some specific reason, I have a HBase table that should be frequently prepended. The row keys in this table are long integers (converted to bytes of course). "Prepend" is an operation that does the following:
1. Scans the table just for the purpose of getting the row key X of the first row, then stops the scan.
2. CheckAndSet on X-1, checking if row X-1 is null and putting data at row key X-1.
3. If the CAS failed, try CAS on X-2, etc.

I'd like to know if there are any obvious performance drawbacks with this approach, compared to inserting rows randomly in the table. With "obvious performance drawbacks" I mean something that doesn't need to be benchmarked to know its effects.

I am aware that scanning plus CAS will be slower than a simple Put, but I'd like to know if prepending has any negative effect regarding region management and misc.

Thank you,
Andre Medeiros