In my opinion, it really depends on your queries.
The first one achieves data locality. There is no additional data transmit
between different nodes. But this strategy sacrifices parallelism and the
node which stores A will be a hot node if too many applications try to
The second approach gives you parallelism but you need somehow to merge the
data together to generate the final results. So, you can see there is a
trade off between data locality and parallelism. So the performance of
query will be influenced by following factors:
1. data size;
2. data access frequency;
3. data access pattern, full scan or index scan;
4. network bandwidth.
So the best solution for one situation may not fit for the others.
On Tue, Dec 17, 2013 at 3:56 AM, Tao Xiao <[EMAIL PROTECTED]> wrote:
> Sometimes row key design is a trade-off issue between load-balance and
> query : if you design row key such that you can query it very fast and
> convenient, maybe the records are not spread evenly across the nodes; if
> you design row key such that the records are spread evenly across the
> nodes, maybe it's not convenient to query or impossible to get the record
> through row key directly (say you have a random number as the row key's
> You can have a look at secondary index. Secondary index is very helpful.
> 2013/12/16 Wilm Schumacher <[EMAIL PROTECTED]>
> > Hi,
> > I'm a newbie to hbase and have a question on the rowkey design and I
> > hope this question isn't to newbie-like for this list. I have a question
> > which cannot be answered by knoledge of code but by experience with
> > large databases, thus this mail.
> > For the sake of explaination I create a small example. Suppose you want
> > to design a small "blogging" plattform. You just want to store the name
> > of the user and a small text. And of course you want to get all postings
> > of one user.
> > Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
> > that the length of the username is fixed). Now let's say the A,B,C and D
> > have N postings, and D has 6*N postings. BUT: the data of A is 3 times
> > more often fetched than the data from the other users each!
> > If you create a hbase cluster with 10 nodes, every node is holding N
> > postings (of course I know, that the data is hold redundantly, but this
> > is not so important for the question).
> > Rowkey design #1:
> > the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
> > The table just would be: "create 'postings' , 'text'"
> > For this rowkey design the first node would hold the data of A, the
> > second of B, the third of C and the fourth to the tenth node the data of
> > Fetching of data would be very easy, but half of the traffic would hit
> > the first node.
> > Rowkey design #2
> > the rowkey would be random, e.g. an uuid. The table design would be now:
> > "create 'postings' , 'user' , 'text'"
> > the fetching of the data would be a "real" map-reduce job, checking for
> > the user and emit etc..
> > So, if a fetching takes place I have to do more computation cycles and
> > IO. But in this scenario all traffic would hit all 10 servers.
> > If the number of N (number of postings) is large enough that the disk
> > space is critical, I'm also not able to adjust the key regions in a way
> > that e.g. the data of D is only on the last server and the key space of
> > A would span the first 5 nodes. Or making replication very broad (e.g.
> > 10 times in this case)
> > So basically the question is: What's the better plan? Trying to avoid
> > computation cycles of map reducing and get the key design straight, or
> > trying to scale the computation, but doing more IO?
> > I hope that the small example helped to make the question more vivid.
> > Best wishes
> > Wilm