|
weliam.cloud@...
2010-10-10, 01:42
Himanshu Vashishtha
2010-10-10, 02:27
Ryan Rawson
2010-10-10, 05:44
William Kang
2010-10-11, 03:36
Andrey Stepachev
2010-10-11, 08:20
Jean-Daniel Cryans
2010-10-11, 15:32
Andrey Stepachev
2010-10-11, 16:27
Jean-Daniel Cryans
2010-10-11, 16:32
Ryan Rawson
2010-10-11, 16:43
Sean Bigdatafun
2010-10-11, 21:00
Ryan Rawson
2010-10-12, 05:12
William Kang
2010-10-13, 08:03
Jean-Daniel Cryans
2010-10-14, 17:55
|
-
Number of column families vs Number of column family qualifiersweliam.cloud@... 2010-10-10, 01:42
Hi folks,
I have a question about the scheme design for Hbase. In general, should I prefer to have more column families with less column family qualifiers or should I prefer to have less column families with more column family qualifiers? For example, I could have one column family with four qualifiers inside or I could have four column families with one qualifier in each of them, which one should I use? I understand that each column family is going to be stored in a store. So, my understanding is that: performance wise, it would be reasonable to choose 1 column family with 4 qualifiers in the example above; considering the sparse storage space, it would be reasonable to choose 4 column families with 1 qualifier in the example above. Is this correct? Many thanks. William
-
Re: Number of column families vs Number of column family qualifiersHimanshu Vashishtha 2010-10-10, 02:27
isn't depends on your app data access pattern? Are you reading all those
columns against a pk simultaneously or not. That would help in discerning which way to go. :) Himanshu. On Sat, Oct 9, 2010 at 7:42 PM, <[EMAIL PROTECTED]> wrote: > Hi folks, > I have a question about the scheme design for Hbase. In general, should I > prefer to have more column families with less column family qualifiers or > should I prefer to have less column families with more column family > qualifiers? > > For example, I could have one column family with four qualifiers inside or > I could have four column families with one qualifier in each of them, which > one should I use? > > I understand that each column family is going to be stored in a store. So, > my understanding is that: performance wise, it would be reasonable to choose > 1 column family with 4 qualifiers in the example above; considering the > sparse storage space, it would be reasonable to choose 4 column families > with 1 qualifier in the example above. Is this correct? > > Many thanks. > > > William >
-
Re: Number of column families vs Number of column family qualifiersRyan Rawson 2010-10-10, 05:44
Also depends on value size. For large values (1-10k and beyond) I'd consider
families since it will let you scab different families without performance hit. If the values are small or always fetched together then just use 1 family. On Oct 9, 2010 10:27 PM, "Himanshu Vashishtha" <[EMAIL PROTECTED]> wrote: > isn't depends on your app data access pattern? Are you reading all those > columns against a pk simultaneously or not. That would help in discerning > which way to go. :) > > Himanshu. > > On Sat, Oct 9, 2010 at 7:42 PM, <[EMAIL PROTECTED]> wrote: > >> Hi folks, >> I have a question about the scheme design for Hbase. In general, should I >> prefer to have more column families with less column family qualifiers or >> should I prefer to have less column families with more column family >> qualifiers? >> >> For example, I could have one column family with four qualifiers inside or >> I could have four column families with one qualifier in each of them, which >> one should I use? >> >> I understand that each column family is going to be stored in a store. So, >> my understanding is that: performance wise, it would be reasonable to choose >> 1 column family with 4 qualifiers in the example above; considering the >> sparse storage space, it would be reasonable to choose 4 column families >> with 1 qualifier in the example above. Is this correct? >> >> Many thanks. >> >> >> William >>
-
Re: Number of column families vs Number of column family qualifiersWilliam Kang 2010-10-11, 03:36
Hi Ryan,
Can you tell me why the value would be an issue for performance? Is it because the optimized limit for cell size? Thanks. William On Sun, Oct 10, 2010 at 1:44 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > Also depends on value size. For large values (1-10k and beyond) I'd > consider > families since it will let you scab different families without performance > hit. If the values are small or always fetched together then just use 1 > family. > On Oct 9, 2010 10:27 PM, "Himanshu Vashishtha" <[EMAIL PROTECTED]> > wrote: > > isn't depends on your app data access pattern? Are you reading all those > > columns against a pk simultaneously or not. That would help in discerning > > which way to go. :) > > > > Himanshu. > > > > On Sat, Oct 9, 2010 at 7:42 PM, <[EMAIL PROTECTED]> wrote: > > > >> Hi folks, > >> I have a question about the scheme design for Hbase. In general, should > I > >> prefer to have more column families with less column family qualifiers > or > >> should I prefer to have less column families with more column family > >> qualifiers? > >> > >> For example, I could have one column family with four qualifiers inside > or > >> I could have four column families with one qualifier in each of them, > which > >> one should I use? > >> > >> I understand that each column family is going to be stored in a store. > So, > >> my understanding is that: performance wise, it would be reasonable to > choose > >> 1 column family with 4 qualifiers in the example above; considering the > >> sparse storage space, it would be reasonable to choose 4 column families > >> with 1 qualifier in the example above. Is this correct? > >> > >> Many thanks. > >> > >> > >> William > >> >
-
Re: Number of column families vs Number of column family qualifiersAndrey Stepachev 2010-10-11, 08:20
Hi.
One additional issue with column families: number of memstores. Each family on insert utilizies one memstory. If you'll write in several memstores at onces you get more memstores and more memory will be used by you region server. Especially with random inserts you can easy get gc timeouts or OOME. 2010/10/10 <[EMAIL PROTECTED]>: > Hi folks, > I have a question about the scheme design for Hbase. In general, should I > prefer to have more column families with less column family qualifiers or > should I prefer to have less column families with more column family > qualifiers? > > For example, I could have one column family with four qualifiers inside or I > could have four column families with one qualifier in each of them, which > one should I use? > > I understand that each column family is going to be stored in a store. So, > my understanding is that: performance wise, it would be reasonable to choose > 1 column family with 4 qualifiers in the example above; considering the > sparse storage space, it would be reasonable to choose 4 column families > with 1 qualifier in the example above. Is this correct? > > Many thanks. > > > William >
-
Re: Number of column families vs Number of column family qualifiersJean-Daniel Cryans 2010-10-11, 15:32
On Mon, Oct 11, 2010 at 4:20 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
> Hi. > > One additional issue with column families: number of memstores. Each > family on insert utilizies > one memstory. If you'll write in several memstores at onces you get > more memstores and more > memory will be used by you region server. Especially with random > inserts you can easy get > gc timeouts or OOME. Very unlikely to get OOME here, since there's a limit on the size of all the memstores inside a single region server (default is 40% of configured heap). But you don't really want to hit it since it blocks all inserts until it cleared enough room. But the "number of memstores" argument also implies that since regions flush on the total size of their memstores, filling up a few of them at the same time is very inefficient. The worst case is filling up a family with really big cells while also inserting much smaller cells into other families. In one case on a troublesome cluster I saw regions flushing one ~58MB file along with 5 ~100KB-1MB files. Flushing individual families instead of whole regions would be a fix in this case, but it has other side effects. I personally don't recommend using multiple families unless they are used separately almost all the time. J-D
-
Re: Number of column families vs Number of column family qualifiersAndrey Stepachev 2010-10-11, 16:27
2010/10/11 Jean-Daniel Cryans <[EMAIL PROTECTED]>:
> On Mon, Oct 11, 2010 at 4:20 AM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: >> Hi. >> Yes. I agree. OOME unlikely. I misinterpreted my current problem. I found, that this (gc timeout) on my 0.89-stumpbleupon hbase occurs only if writeToWAL=false. My RS eats all available memory (5GB), but don't get OOME. I try ti figure out what is going on. > But the "number of memstores" argument also implies that since regions > flush on the total size of their memstores, filling up a few of them > at the same time is very inefficient. The worst case is filling up a > family with really big cells while also inserting much smaller cells > into other families. In one case on a troublesome cluster I saw > regions flushing one ~58MB file along with 5 ~100KB-1MB files. This is my case. My design flaw was to use separate family for each entity (which i have now 9). And i got especially what you describe. > > Flushing individual families instead of whole regions would be a fix > in this case, but it has other side effects. Hm.. How I can flush family from client side? I don't see any api in 0.20.x. Is it 0.89 api changes? (don't dig into 0.89 yet). > > I personally don't recommend using multiple families unless they are > used separately almost all the time. Totally agree, because I stepped on this rake. Sorry for wrong information. Andrey.
-
Re: Number of column families vs Number of column family qualifiersJean-Daniel Cryans 2010-10-11, 16:32
> Yes. I agree. OOME unlikely. I misinterpreted my current problem.
> I found, that this (gc timeout) on my 0.89-stumpbleupon hbase occurs > only if writeToWAL=false. My RS eats all available memory (5GB), but > don't get OOME. I try ti figure out what is going on. Long GC pauses happens for many different reasons, first make sure that your IO, CPU, and RAM aren't over committed and that there's no swap. > Hm.. How I can flush family from client side? I don't see any api in 0.20.x. > Is it 0.89 api changes? (don't dig into 0.89 yet). > You can't, I was talking about a possible fix in the code. > > Sorry for wrong information. No problem :) J-D
-
Re: Number of column families vs Number of column family qualifiersRyan Rawson 2010-10-11, 16:43
The reason I talk about value size is one area where multiple families
are good is when you have really large values in one column and smaller values in different columns. So if you want to just read the small values without scanning through the big values you can use separate column families. -ryan On Mon, Oct 11, 2010 at 9:32 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: >> Yes. I agree. OOME unlikely. I misinterpreted my current problem. >> I found, that this (gc timeout) on my 0.89-stumpbleupon hbase occurs >> only if writeToWAL=false. My RS eats all available memory (5GB), but >> don't get OOME. I try ti figure out what is going on. > > Long GC pauses happens for many different reasons, first make sure > that your IO, CPU, and RAM aren't over committed and that there's no > swap. > >> Hm.. How I can flush family from client side? I don't see any api in 0.20.x. >> Is it 0.89 api changes? (don't dig into 0.89 yet). >> > > You can't, I was talking about a possible fix in the code. > >> >> Sorry for wrong information. > > No problem :) > > J-D >
-
Re: Number of column families vs Number of column family qualifiersSean Bigdatafun 2010-10-11, 21:00
I think this is a good suggestion too.
HBase linearly scans through the 64KB that is bring to memory. If big data payload (yet unused in a query/scan) is mixed with small data payload, it will be rather ineffective, I think? On Mon, Oct 11, 2010 at 9:43 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > The reason I talk about value size is one area where multiple families > are good is when you have really large values in one column and > smaller values in different columns. So if you want to just read the > small values without scanning through the big values you can use > separate column families. > > -ryan > > On Mon, Oct 11, 2010 at 9:32 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> > wrote: > >> Yes. I agree. OOME unlikely. I misinterpreted my current problem. > >> I found, that this (gc timeout) on my 0.89-stumpbleupon hbase occurs > >> only if writeToWAL=false. My RS eats all available memory (5GB), but > >> don't get OOME. I try ti figure out what is going on. > > > > Long GC pauses happens for many different reasons, first make sure > > that your IO, CPU, and RAM aren't over committed and that there's no > > swap. > > > >> Hm.. How I can flush family from client side? I don't see any api in > 0.20.x. > >> Is it 0.89 api changes? (don't dig into 0.89 yet). > >> > > > > You can't, I was talking about a possible fix in the code. > > > >> > >> Sorry for wrong information. > > > > No problem :) > > > > J-D > > >
-
Re: Number of column families vs Number of column family qualifiersRyan Rawson 2010-10-12, 05:12
Yes this is spot on. When hbase scans we read a block, iterate through the
keys in the block then goes to the next block. We try to be as efficient as possible, but the inescapable fact remains we must read all the intervening data. We can do tricks (in 0.90) to use the block index to skip some blocks, but it is not always possible. On Oct 11, 2010 5:01 PM, "Sean Bigdatafun" <[EMAIL PROTECTED]> wrote: > I think this is a good suggestion too. > > HBase linearly scans through the 64KB that is bring to memory. If big data > payload (yet unused in a query/scan) is mixed with small data payload, it > will be rather ineffective, I think? > > On Mon, Oct 11, 2010 at 9:43 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > >> The reason I talk about value size is one area where multiple families >> are good is when you have really large values in one column and >> smaller values in different columns. So if you want to just read the >> small values without scanning through the big values you can use >> separate column families. >> >> -ryan >> >> On Mon, Oct 11, 2010 at 9:32 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> >> wrote: >> >> Yes. I agree. OOME unlikely. I misinterpreted my current problem. >> >> I found, that this (gc timeout) on my 0.89-stumpbleupon hbase occurs >> >> only if writeToWAL=false. My RS eats all available memory (5GB), but >> >> don't get OOME. I try ti figure out what is going on. >> > >> > Long GC pauses happens for many different reasons, first make sure >> > that your IO, CPU, and RAM aren't over committed and that there's no >> > swap. >> > >> >> Hm.. How I can flush family from client side? I don't see any api in >> 0.20.x. >> >> Is it 0.89 api changes? (don't dig into 0.89 yet). >> >> >> > >> > You can't, I was talking about a possible fix in the code. >> > >> >> >> >> Sorry for wrong information. >> > >> > No problem :) >> > >> > J-D >> > >>
-
Re: Number of column families vs Number of column family qualifiersWilliam Kang 2010-10-13, 08:03
Hi Ryan,
Thanks for your reply. So, even if I use get.addColumn(byte[] family, byte[] qualifier) for a certain cell, the HBase will have to traverse from the beginning of the column family to the qualifier I defined? Is it because HBase has to traverse all the blocks in the HFile to find the row key or the qualifier? I am confused here, in the keyvalue pairs in the data block, does the key refer to the row key or it refer to qualifier? Where is the row key and where is the qualifier? This has bothered me for a while. It would be nice to figure it out. Many thanks. William On Tue, Oct 12, 2010 at 1:12 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > > Yes this is spot on. When hbase scans we read a block, iterate through the > keys in the block then goes to the next block. We try to be as efficient as > possible, but the inescapable fact remains we must read all the intervening > data. We can do tricks (in 0.90) to use the block index to skip some blocks, > but it is not always possible. > On Oct 11, 2010 5:01 PM, "Sean Bigdatafun" <[EMAIL PROTECTED]> > wrote: > > I think this is a good suggestion too. > > > > HBase linearly scans through the 64KB that is bring to memory. If big data > > payload (yet unused in a query/scan) is mixed with small data payload, it > > will be rather ineffective, I think? > > > > On Mon, Oct 11, 2010 at 9:43 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > > > >> The reason I talk about value size is one area where multiple families > >> are good is when you have really large values in one column and > >> smaller values in different columns. So if you want to just read the > >> small values without scanning through the big values you can use > >> separate column families. > >> > >> -ryan > >> > >> On Mon, Oct 11, 2010 at 9:32 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> > >> wrote: > >> >> Yes. I agree. OOME unlikely. I misinterpreted my current problem. > >> >> I found, that this (gc timeout) on my 0.89-stumpbleupon hbase occurs > >> >> only if writeToWAL=false. My RS eats all available memory (5GB), but > >> >> don't get OOME. I try ti figure out what is going on. > >> > > >> > Long GC pauses happens for many different reasons, first make sure > >> > that your IO, CPU, and RAM aren't over committed and that there's no > >> > swap. > >> > > >> >> Hm.. How I can flush family from client side? I don't see any api in > >> 0.20.x. > >> >> Is it 0.89 api changes? (don't dig into 0.89 yet). > >> >> > >> > > >> > You can't, I was talking about a possible fix in the code. > >> > > >> >> > >> >> Sorry for wrong information. > >> > > >> > No problem :) > >> > > >> > J-D > >> > > >>
-
Re: Number of column families vs Number of column family qualifiersJean-Daniel Cryans 2010-10-14, 17:55
> So, even if I use get.addColumn(byte[] family, byte[] qualifier) for a
> certain cell, the HBase will have to traverse from the beginning of > the column family to the qualifier I defined? Is it because HBase has > to traverse all the blocks in the HFile to find the row key or the > qualifier? The answer is different for 0.20 and 0.90, but the short version would be: sometimes yes and sometimes not all of the KVs will be read. HBase is getting better at this but there's still work to do. > I am confused here, in the keyvalue pairs in the data block, does the > key refer to the row key or it refer to qualifier? Where is the row > key and where is the qualifier? > This has bothered me for a while. It would be nice to figure it out. > Many thanks. > Down in the HBase internals we use KeyValue where the key is basically row + family + qualifier + timestamp. See http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/KeyValue.html J-D |