|
|
-
Help needed! Performance related questions
William Kang 2010-10-14, 17:53
Hi all, I am in a hurry to finish a report about whether or not we should host our data in HBase. After many readings and diggings, there still are some questions I cannot find answers. Sorry for brining them up again if you have seen them before. :) If you could answer any of these following questions, I would greatly grateful for that.
1. For cell size, why it should not be larger than 20m in general?
2. What is the block size if the cell is 20m? Can a cell covers multiple blocks?
3. For single cell column family (it has only one cell), does it share the same size limit as cell? In other words, does single column family should be smaller than 20m?
4. Is there any advantage to put rows close in HBase, if these rows have a high chance to be queried together?
5. Any general rule for row size?
6. Where does the HReigion host the row keys in HFile or other files?
Many thanks! Your answers would be highly appreciated. William
-
Re: Help needed! Performance related questions
Amandeep Khurana 2010-10-14, 18:14
> > 4. Is there any advantage to put rows close in HBase, if these rows > have a high chance to be queried together? > > Yes.. rows are stored contiguously, sorted by the RowID+ColFam+ColQual+Timestamp. So, your reads are faster if you access contiguous rows (and avoid disk seeks).. You can scan a set of rows and retrieve them.. > 5. Any general rule for row size? >
If a row is bigger than the max region size you have given, the region wont split. In other words, rows don't span regions. > > 6. Where does the HReigion host the row keys in HFile or other files? > > It is in HFile..
-
Re: Help needed! Performance related questions
Jean-Daniel Cryans 2010-10-14, 18:16
> If you could answer any of these > following questions, I would greatly grateful for that.
People usually give me beer in exchange for quick help, let me know if that works for you ;)
> > 1. For cell size, why it should not be larger than 20m in general?
General answer: it pokes HBase in all the corner cases. You have to change a lot of default configs in order to keep some sort of efficiency.
> > 2. What is the block size if the cell is 20m? Can a cell covers multiple blocks?
No, one HFile block per cell (KeyValue) in this case. It basically gives you a perfect index.
> > 3. For single cell column family (it has only one cell), does it share > the same size limit as cell? In other words, does single column family > should be smaller than 20m?
It's the same to me.
> > 4. Is there any advantage to put rows close in HBase, if these rows > have a high chance to be queried together?
If you do Scans, then you want your rows together right?
> > 5. Any general rule for row size?
Try not to go into the MBs, it's currently missing some optimizations that would make this use case work perfectly.
> > 6. Where does the HReigion host the row keys in HFile or other files?
Block index in HFile, not all the row keys are there if a single block fits more than one row.
J-D
-
Re: Help needed! Performance related questions
William Kang 2010-10-14, 18:42
Hi guys, Thanks so much for answering my questions. I really appreciate that. They helps a lot!
I have a few more follow up questions though. 1. about the row searching mechanism, I understand the part before the HBase locate where the row resides in which region. I am confused after that. So, I am going to write down what I understand so far, please correct me if it's wrong. a. The HRegion Store identifies where the row is in which HFile. b. There is a block index in HFile identify which block this row resides. c. If the row size is smaller than block size (which mean a block has multiple rows), HBase has to traverse in that block to locate the row matching the key. The traverse is sequence traverse.
2. And if the row size is larger than the block size, what's going to happen? Does the block index in HFile point to multiple blocks which contains different cells of that row?
3. Does a column family has to reside inside one block, which means a column family cannot be larger than a block?
Many thanks! William
On Thu, Oct 14, 2010 at 2:16 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: >> If you could answer any of these >> following questions, I would greatly grateful for that. > > People usually give me beer in exchange for quick help, let me know if > that works for you ;) > >> >> 1. For cell size, why it should not be larger than 20m in general? > > General answer: it pokes HBase in all the corner cases. You have to > change a lot of default configs in order to keep some sort of > efficiency. > >> >> 2. What is the block size if the cell is 20m? Can a cell covers multiple blocks? > > No, one HFile block per cell (KeyValue) in this case. It basically > gives you a perfect index. > >> >> 3. For single cell column family (it has only one cell), does it share >> the same size limit as cell? In other words, does single column family >> should be smaller than 20m? > > It's the same to me. > >> >> 4. Is there any advantage to put rows close in HBase, if these rows >> have a high chance to be queried together? > > If you do Scans, then you want your rows together right? > >> >> 5. Any general rule for row size? > > Try not to go into the MBs, it's currently missing some optimizations > that would make this use case work perfectly. > >> >> 6. Where does the HReigion host the row keys in HFile or other files? > > Block index in HFile, not all the row keys are there if a single block > fits more than one row. > > J-D >
-
Re: Help needed! Performance related questions
Jean-Daniel Cryans 2010-10-14, 18:51
> 1. about the row searching mechanism, I understand the part before the > HBase locate where the row resides in which region. I am confused > after that. So, I am going to write down what I understand so far, > please correct me if it's wrong. > a. The HRegion Store identifies where the row is in which HFile. > b. There is a block index in HFile identify which block this row resides. > c. If the row size is smaller than block size (which mean a block has > multiple rows), HBase has to traverse in that block to locate the row > matching the key. The traverse is sequence traverse.
More or less.
> > 2. And if the row size is larger than the block size, what's going to > happen? Does the block index in HFile point to multiple blocks which > contains different cells of that row?
The block index stores full keys, row+family+qualifier+timestamp, so it's not talking in terms of total row size. A single row can have multiple blocks (in multiple files) with possibly as many entries in the block index. If a single cell is larger than the block size, then the size of that block will be the size of that cell.
> > 3. Does a column family has to reside inside one block, which means a > column family cannot be larger than a block?
My previous answer covers this.
J-D
-
Re: Help needed! Performance related questions
William Kang 2010-10-14, 19:44
Hey J-D, Thanks a lot! That has cleared a lot of my confusions. :) I really appreciate it. William On Thu, Oct 14, 2010 at 2:51 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: >> 1. about the row searching mechanism, I understand the part before the >> HBase locate where the row resides in which region. I am confused >> after that. So, I am going to write down what I understand so far, >> please correct me if it's wrong. >> a. The HRegion Store identifies where the row is in which HFile. >> b. There is a block index in HFile identify which block this row resides. >> c. If the row size is smaller than block size (which mean a block has >> multiple rows), HBase has to traverse in that block to locate the row >> matching the key. The traverse is sequence traverse. > > More or less. > >> >> 2. And if the row size is larger than the block size, what's going to >> happen? Does the block index in HFile point to multiple blocks which >> contains different cells of that row? > > The block index stores full keys, row+family+qualifier+timestamp, so > it's not talking in terms of total row size. A single row can have > multiple blocks (in multiple files) with possibly as many entries in > the block index. If a single cell is larger than the block size, then > the size of that block will be the size of that cell. > >> >> 3. Does a column family has to reside inside one block, which means a >> column family cannot be larger than a block? > > My previous answer covers this. > > J-D >
|
|