Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - talk list table

Kireet Reddy 2013-04-15, 13:09
Copy link to this message
Re: talk list table
Ted Yu 2013-04-15, 17:28
I assume you would select HBase (the latest release) for this

For #1, write performance would be lower if you choose to use Append (vs.
using Put).

bq. Can appends be batched by the client or do they execute immediately?
This depends on your use case. Take a look at the following method in
HTable where you can send a list of actions (Appends):

  public void batch(final List<?extends Row> actions, final Object[]
For #2
bq. The other would be to prefix the timestamp row key with a random
leading byte.

This technique has been used elsewhere and is better than the first one.


On Mon, Apr 15, 2013 at 6:09 AM, Kireet Reddy <[EMAIL PROTECTED]> wrote:

> I are planning to create a "scheduled task list" table in our hbase
> cluster. Essentially we will define a table with key timestamp and then the
> row contents will be all the tasks that need to be processed within that
> second (or whatever time period). I am trying to do the "reasonably wide
> rows" design mentioned in the hbasecon opentsdb talk. A couple of questions:
> 1. Should we use append or put to create tasks? Since these rows will not
> live forever, storage space in not a concern, read/write performance is
> more important. As concurrency increases I would guess the row lock may
> become an issue in append? Can appends be batched by the client or do they
> execute immediately?
> 2. I am a little worried about hotspots. This basic design may cause
> issues in terms of the table's performance. Many tasks will execute and
> reschedule themselves using the same interval, t + 1 hour for example. So
> many the writes may all go to the same block.  Also, we have a lot of other
> data so I am worried it may impact performance of unrelated data if the
> region server gets too busy servicing the task list table. I can think of 2
> strategies to avoid this. One would be to create N different tables and
> read/write tasks to them randomly. This may spread load across servers, but
> there is no guarantee hbase will place the tables on different region
> servers, correct? The other would be to prefix the timestamp row key with a
> random leading byte. Then when reading from the task list table, consumers
> could scan from any/all possible values of the random byte + current
> timestamp to obtain tasks. Both strategies seem like they could spread out
> load, but at the cost of more work/complexity to read tasks from the table.
> Do either of those approaches make sense?
> On the read side, it seems like a similar problem exists in that all
> consumers will be reading rows based on the current timestamp. Is this good
> because the block will very likely be cached or bad because the region
> server may become overloaded? I have a feeling the answer is going to be
> "it depends". :)
> I did see the previous posts on queues and the tips there - use zookeeper
> for coordination, schedule major compactions, etc. Sorry if these questions
> are basic, I am pretty new to hbase. Thanks!
Kireet 2013-04-15, 18:15
Ted Yu 2013-04-15, 20:18
Amit Sela 2013-04-20, 15:24
Otis Gospodnetic 2013-04-20, 23:10