|
|
-
Disk Seeks and Column families
Praveen Sripati 2012-01-21, 07:08
Hi, 1) According to the this url (1), HBase performs well for two or three column families. Why is it so? 2) Dump of a HFile, looks like below. The contents of a row stay together like a regular row-oriented database. If the column family has 100 column family qualifiers and is dense then the data for a particular column family qualifier is spread wide. If I want to do an aggregation on a particular column identifier, the disk seeks doesn't seems to be much better than a regular row-oriented database. Please correct me if I am wrong. K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 (1) - http://hbase.apache.org/book/number.of.cfs.htmlThanks, Praveen
+
Praveen Sripati 2012-01-21, 07:08
-
Re: Disk Seeks and Column families
Andrey Stepachev 2012-01-21, 12:47
2012/1/21 Praveen Sripati <[EMAIL PROTECTED]>: > Hi, > > 1) According to the this url (1), HBase performs well for two or three > column families. Why is it so? Frist, each column family stored in separate location, so, as stated in '6.2.1. Cardinality of ColumnFamilies', such schema design can lead to many small pieces for small column family and aggregate should perform slowly. Second, if region split, all column families will split too, in case of large number of them whis can be inefficient. Third, related to number of memstores. Each column family has it's own memstore, so it is more likely to hit forced flush and bloсked writes. > > 2) Dump of a HFile, looks like below. The contents of a row stay together > like a regular row-oriented database. If the column family has 100 column > family qualifiers and is dense then the data for a particular column family > qualifier is spread wide. If I want to do an aggregation on a particular > column identifier, the disk seeks doesn't seems to be much better than a > regular row-oriented database. You don't need seeks for each column, hbase reads blocks and filter out uneeded data. And most pefromance gained from collocated keys and compression. BTW, hbase is not so good in case of wide tables, hbase prefers tall tables. > > Please correct me if I am wrong. > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > (1) - http://hbase.apache.org/book/number.of.cfs.html> > Thanks, > Praveen -- Andrey.
+
Andrey Stepachev 2012-01-21, 12:47
-
Re: Disk Seeks and Column families
Doug Meil 2012-01-21, 13:52
Also, for #2 Hbase supports large-scale aggregation through MapReduce. On 1/21/12 7:47 AM, "Andrey Stepachev" <[EMAIL PROTECTED]> wrote: >2012/1/21 Praveen Sripati <[EMAIL PROTECTED]>: >> Hi, >> >> 1) According to the this url (1), HBase performs well for two or three >> column families. Why is it so? > >Frist, each column family stored in separate location, so, as stated in >'6.2.1. Cardinality of ColumnFamilies', such schema design can lead >to many small pieces for small column family and aggregate should >perform slowly. >Second, if region split, all column families will split too, >in case of large number of them whis can be inefficient. >Third, related to number of memstores. Each column family >has it's own memstore, so it is more likely to hit forced flush >and bloсked writes. > >> >> 2) Dump of a HFile, looks like below. The contents of a row stay >>together >> like a regular row-oriented database. If the column family has 100 >>column >> family qualifiers and is dense then the data for a particular column >>family >> qualifier is spread wide. If I want to do an aggregation on a particular >> column identifier, the disk seeks doesn't seems to be much better than a >> regular row-oriented database. > >You don't need seeks for each column, hbase reads blocks and filter >out uneeded data. >And most pefromance gained from collocated keys and compression. >BTW, hbase is not so good in case of wide tables, hbase prefers tall >tables. > >> >> Please correct me if I am wrong. >> >> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 >> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 >> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 >> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 >> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 >> >> (1) - http://hbase.apache.org/book/number.of.cfs.html>> >> Thanks, >> Praveen > > > >-- >Andrey. >
+
Doug Meil 2012-01-21, 13:52
-
Re: Disk Seeks and Column families
Doug Meil 2012-01-21, 15:16
One other "big picture" comment: Hbase scales by having lots of servers, and servers with multiple drives. While single-read performance is obviously important, there is more to Hbase than a single-server RDBMS drag-race comparison. It's a distributed architecture (as with MapReduce). re: "hbase is not so good in case of wide tables, hbase prefers tall tables" Per... http://hbase.apache.org/book.html#schema.smackdown this is absolutely true in the extreme cases as described in the book, but I wouldn't consider hundreds or thousands of attributes to be in that category as the definition of "wide" tends to be subjective. On 1/21/12 8:52 AM, "Doug Meil" <[EMAIL PROTECTED]> wrote: > >Also, for #2 Hbase supports large-scale aggregation through MapReduce. > > > > >On 1/21/12 7:47 AM, "Andrey Stepachev" <[EMAIL PROTECTED]> wrote: > >>2012/1/21 Praveen Sripati <[EMAIL PROTECTED]>: >>> Hi, >>> >>> 1) According to the this url (1), HBase performs well for two or three >>> column families. Why is it so? >> >>Frist, each column family stored in separate location, so, as stated in >>'6.2.1. Cardinality of ColumnFamilies', such schema design can lead >>to many small pieces for small column family and aggregate should >>perform slowly. >>Second, if region split, all column families will split too, >>in case of large number of them whis can be inefficient. >>Third, related to number of memstores. Each column family >>has it's own memstore, so it is more likely to hit forced flush >>and bloсked writes. >> >>> >>> 2) Dump of a HFile, looks like below. The contents of a row stay >>>together >>> like a regular row-oriented database. If the column family has 100 >>>column >>> family qualifiers and is dense then the data for a particular column >>>family >>> qualifier is spread wide. If I want to do an aggregation on a >>>particular >>> column identifier, the disk seeks doesn't seems to be much better than >>>a >>> regular row-oriented database. >> >>You don't need seeks for each column, hbase reads blocks and filter >>out uneeded data. >>And most pefromance gained from collocated keys and compression. >>BTW, hbase is not so good in case of wide tables, hbase prefers tall >>tables. >> >>> >>> Please correct me if I am wrong. >>> >>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 >>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 >>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 >>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 >>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 >>> >>> (1) - http://hbase.apache.org/book/number.of.cfs.html>>> >>> Thanks, >>> Praveen >> >> >> >>-- >>Andrey. >> > > >
+
Doug Meil 2012-01-21, 15:16
-
Re: Disk Seeks and Column families
Andrey Stepachev 2012-01-21, 18:58
21 января 2012 г. 19:16 п��льзователь Doug Meil <[EMAIL PROTECTED]> написал: > > One other "big picture" comment: Hbase scales by having lots of servers, > and servers with multiple drives. While single-read performance is > obviously important, there is more to Hbase than a single-server RDBMS > drag-race comparison. It's a distributed architecture (as with MapReduce). > > re: "hbase is not so good in case of wide tables, hbase prefers tall > tables" > > Per... http://hbase.apache.org/book.html#schema.smackdown this is > absolutely true in the extreme cases as described in the book, but I > wouldn't consider hundreds or thousands of attributes to be in that > category as the definition of "wide" tends to be subjective. This statement mostly related to schemas, where column name is a subkey. For example: timeseries for some attribute. Such situation not scales well, and not handled well by hbase. (due of splits, which are performed on rows boundary). > > > > > On 1/21/12 8:52 AM, "Doug Meil" <[EMAIL PROTECTED]> wrote: > >> >>Also, for #2 Hbase supports large-scale aggregation through MapReduce. >> >> >> >> >>On 1/21/12 7:47 AM, "Andrey Stepachev" <[EMAIL PROTECTED]> wrote: >> >>>2012/1/21 Praveen Sripati <[EMAIL PROTECTED]>: >>>> Hi, >>>> >>>> 1) According to the this url (1), HBase performs well for two or three >>>> column families. Why is it so? >>> >>>Frist, each column family stored in separate location, so, as stated in >>>'6.2.1. Cardinality of ColumnFamilies', such schema design can lead >>>to many small pieces for small column family and aggregate should >>>perform slowly. >>>Second, if region split, all column families will split too, >>>in case of large number of them whis can be inefficient. >>>Third, related to number of memstores. Each column family >>>has it's own memstore, so it is more likely to hit forced flush >>>and bloсked writes. >>> >>>> >>>> 2) Dump of a HFile, looks like below. The contents of a row stay >>>>together >>>> like a regular row-oriented database. If the column family has 100 >>>>column >>>> family qualifiers and is dense then the data for a particular column >>>>family >>>> qualifier is spread wide. If I want to do an aggregation on a >>>>particular >>>> column identifier, the disk seeks doesn't seems to be much better than >>>>a >>>> regular row-oriented database. >>> >>>You don't need seeks for each column, hbase reads blocks and filter >>>out uneeded data. >>>And most pefromance gained from collocated keys and compression. >>>BTW, hbase is not so good in case of wide tables, hbase prefers tall >>>tables. >>> >>>> >>>> Please correct me if I am wrong. >>>> >>>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 >>>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 >>>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 >>>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 >>>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 >>>> >>>> (1) - http://hbase.apache.org/book/number.of.cfs.html>>>> >>>> Thanks, >>>> Praveen >>> >>> >>> >>>-- >>>Andrey. >>> >> >> >> > > -- Andrey.
+
Andrey Stepachev 2012-01-21, 18:58
-
Re: Disk Seeks and Column families
yuzhihong@... 2012-01-21, 15:33
Have you considered using AggregationProtocol to perform aggregation ? Thanks On Jan 20, 2012, at 11:08 PM, Praveen Sripati <[EMAIL PROTECTED]> wrote: > Hi, > > 1) According to the this url (1), HBase performs well for two or three > column families. Why is it so? > > 2) Dump of a HFile, looks like below. The contents of a row stay together > like a regular row-oriented database. If the column family has 100 column > family qualifiers and is dense then the data for a particular column family > qualifier is spread wide. If I want to do an aggregation on a particular > column identifier, the disk seeks doesn't seems to be much better than a > regular row-oriented database. > > Please correct me if I am wrong. > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > (1) - http://hbase.apache.org/book/number.of.cfs.html> > Thanks, > Praveen
+
yuzhihong@... 2012-01-21, 15:33
-
Re: Disk Seeks and Column families
Praveen Sripati 2012-01-21, 17:49
Thanks for the response. > The contents of a row stay together like a regular row-oriented database. > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 Is the above statement true for a HFile? Also from the above example, the data for the column family qualifier are not adjacent to take advantage of compression ( http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this a proper statement? Regards, Praveen On Sat, Jan 21, 2012 at 9:03 PM, <[EMAIL PROTECTED]> wrote: > Have you considered using AggregationProtocol to perform aggregation ? > > Thanks > > > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > > 1) According to the this url (1), HBase performs well for two or three > > column families. Why is it so? > > > > 2) Dump of a HFile, looks like below. The contents of a row stay together > > like a regular row-oriented database. If the column family has 100 column > > family qualifiers and is dense then the data for a particular column > family > > qualifier is spread wide. If I want to do an aggregation on a particular > > column identifier, the disk seeks doesn't seems to be much better than a > > regular row-oriented database. > > > > Please correct me if I am wrong. > > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > > > (1) - http://hbase.apache.org/book/number.of.cfs.html> > > > Thanks, > > Praveen >
+
Praveen Sripati 2012-01-21, 17:49
-
Re: Disk Seeks and Column families
Doug Meil 2012-01-21, 18:06
Compression is at the block level within the StoreFile (Hfile), so yes, they can take advantage of compression. On 1/21/12 12:49 PM, "Praveen Sripati" <[EMAIL PROTECTED]> wrote: >Thanks for the response. > >> The contents of a row stay together like a regular row-oriented >>database. > >> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 >> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 >> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 >> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 >> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > >Is the above statement true for a HFile? > >Also from the above example, the data for the column family qualifier are >not adjacent to take advantage of compression ( > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this a >proper statement? > >Regards, >Praveen > >On Sat, Jan 21, 2012 at 9:03 PM, <[EMAIL PROTECTED]> wrote: > >> Have you considered using AggregationProtocol to perform aggregation ? >> >> Thanks >> >> >> >> On Jan 20, 2012, at 11:08 PM, Praveen Sripati <[EMAIL PROTECTED]> >> wrote: >> >> > Hi, >> > >> > 1) According to the this url (1), HBase performs well for two or three >> > column families. Why is it so? >> > >> > 2) Dump of a HFile, looks like below. The contents of a row stay >>together >> > like a regular row-oriented database. If the column family has 100 >>column >> > family qualifiers and is dense then the data for a particular column >> family >> > qualifier is spread wide. If I want to do an aggregation on a >>particular >> > column identifier, the disk seeks doesn't seems to be much better >>than a >> > regular row-oriented database. >> > >> > Please correct me if I am wrong. >> > >> > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 >> > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 >> > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 >> > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 >> > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 >> > >> > (1) - http://hbase.apache.org/book/number.of.cfs.html>> > >> > Thanks, >> > Praveen >>
+
Doug Meil 2012-01-21, 18:06
-
Re: Disk Seeks and Column families
M. C. Srivas 2012-01-22, 06:32
Praveen, basically you are correct on all counts. If there are too many columns, HBase will have to issue more disk-seeks to extract only the particular columns you need ... and since the data is laid out horizontally there are fewer common substrings in a single HBase-block and compression quality starts to degrade due to reduced redundancy. On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati <[EMAIL PROTECTED]>wrote: > Thanks for the response. > > > The contents of a row stay together like a regular row-oriented database. > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > Is the above statement true for a HFile? > > Also from the above example, the data for the column family qualifier are > not adjacent to take advantage of compression ( > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this a > proper statement? > > Regards, > Praveen > > On Sat, Jan 21, 2012 at 9:03 PM, <[EMAIL PROTECTED]> wrote: > > > Have you considered using AggregationProtocol to perform aggregation ? > > > > Thanks > > > > > > > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <[EMAIL PROTECTED]> > > wrote: > > > > > Hi, > > > > > > 1) According to the this url (1), HBase performs well for two or three > > > column families. Why is it so? > > > > > > 2) Dump of a HFile, looks like below. The contents of a row stay > together > > > like a regular row-oriented database. If the column family has 100 > column > > > family qualifiers and is dense then the data for a particular column > > family > > > qualifier is spread wide. If I want to do an aggregation on a > particular > > > column identifier, the disk seeks doesn't seems to be much better than > a > > > regular row-oriented database. > > > > > > Please correct me if I am wrong. > > > > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > > > > > (1) - http://hbase.apache.org/book/number.of.cfs.html> > > > > > Thanks, > > > Praveen > > >
+
M. C. Srivas 2012-01-22, 06:32
-
Re: Disk Seeks and Column families
Praveen Sripati 2012-01-24, 06:15
Thanks for the response. I am just getting started with HBase. And before getting into the code/api level details, I am trying to understand the problem area HBase is trying to address through it's architecture/design. 1) So, what are the recommendations for having many columns and with dense data? Is HBase not the right tool? 2) Also, if the data for a column is spread wide across blocks and maybe even across nodes how will HBase help in aggregation? 3) Also, about storing data using an incremental row key, initially there will be a hot stop with the data getting to a single region. Even after a split of the region into two, the first one won't be getting any data (in incremental row key) and the second one will be hammered. One of the approach to alleviate this is not to insert incremental row keys from the client and have the row keys scattered for better load balancing. But, this approach is not efficient if I want to get events in a time sequence, in which case I have to use some filters to scan the entire data. 4) Still not clear why I can't have 10 column families in HBase and why only 2 or 3 of them according to this link (1)? (1) - http://hbase.apache.org/book/number.of.cfs.htmlPraveen On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote: > Praveen, > > basically you are correct on all counts. If there are too many columns, > HBase will have to issue more disk-seeks to extract only the particular > columns you need ... and since the data is laid out horizontally there are > fewer common substrings in a single HBase-block and compression quality > starts to degrade due to reduced redundancy. > > > On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati > <[EMAIL PROTECTED]>wrote: > > > Thanks for the response. > > > > > The contents of a row stay together like a regular row-oriented > database. > > > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > > > Is the above statement true for a HFile? > > > > Also from the above example, the data for the column family qualifier are > > not adjacent to take advantage of compression ( > > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this > a > > proper statement? > > > > Regards, > > Praveen > > > > On Sat, Jan 21, 2012 at 9:03 PM, <[EMAIL PROTECTED]> wrote: > > > > > Have you considered using AggregationProtocol to perform aggregation ? > > > > > > Thanks > > > > > > > > > > > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati < > [EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi, > > > > > > > > 1) According to the this url (1), HBase performs well for two or > three > > > > column families. Why is it so? > > > > > > > > 2) Dump of a HFile, looks like below. The contents of a row stay > > together > > > > like a regular row-oriented database. If the column family has 100 > > column > > > > family qualifiers and is dense then the data for a particular column > > > family > > > > qualifier is spread wide. If I want to do an aggregation on a > > particular > > > > column identifier, the disk seeks doesn't seems to be much better > than > > a > > > > regular row-oriented database. > > > > > > > > Please correct me if I am wrong. > > > > > > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > > > > > > > (1) - http://hbase.apache.org/book/number.of.cfs.html> > > > > > > > Thanks, > > > > Praveen > > > > > >
+
Praveen Sripati 2012-01-24, 06:15
-
Re: Disk Seeks and Column families
Andrey Stepachev 2012-01-24, 06:51
2012/1/24 Praveen Sripati <[EMAIL PROTECTED]>: > Thanks for the response. I am just getting started with HBase. And before > getting into the code/api level details, I am trying to understand the > problem area HBase is trying to address through it's architecture/design. > > 1) So, what are the recommendations for having many columns and with dense > data? Is HBase not the right tool? Split them by prefixing keys. (i.e. key->a,b,c => a_key, b_key, c_key). and aggregate as independent values. (if possible) > > 2) Also, if the data for a column is spread wide across blocks and maybe > even across nodes how will HBase help in aggregation? Think and optimize your data layout for your tasks. HBase is not an rdbs. You should plan you schema in a way, that suites best for your queries. > > 3) Also, about storing data using an incremental row key, initially there > will be a hot stop with the data getting to a single region. Even after a > split of the region into two, the first one won't be getting any data (in > incremental row key) and the second one will be hammered. a) As in 1), add something to key. For example each 5 minutes. Later your can issue 16 queries and merge them (for realtime) b) If this data for mapreduce, you can do key day + (md5(time)) later MR task collect all data in right place for aggregation. (as usual you must tradeoff write speed and query speed). c) split your incoming data by other field, for example host or metric. You can look at data model of the http://opentsdb.net/> > One of the approach to alleviate this is not to insert incremental row keys > from the client and have the row keys scattered for better load balancing. > But, this approach is not efficient if I want to get events in a time > sequence, in which case I have to use some filters to scan the entire data. > > 4) Still not clear why I can't have 10 column families in HBase and why > only 2 or 3 of them according to this link (1)? You can. But a) you should tune a bunch of parameters hbase.hregion.memstore.block.multiplier, hbase.hstore.blockingStoreFiles and others to get it works at high write load. But according to architecture of memstore and splits less families performs better. b) you can write small benchmark and see, that 2 family is significally faster then 10. > > (1) - http://hbase.apache.org/book/number.of.cfs.html> > Praveen > > On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote: > >> Praveen, >> >> basically you are correct on all counts. If there are too many columns, >> HBase will have to issue more disk-seeks to extract only the particular >> columns you need ... and since the data is laid out horizontally there are >> fewer common substrings in a single HBase-block and compression quality >> starts to degrade due to reduced redundancy. >> >> >> On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati >> <[EMAIL PROTECTED]>wrote: >> >> > Thanks for the response. >> > >> > > The contents of a row stay together like a regular row-oriented >> database. >> > >> > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 >> > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 >> > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 >> > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 >> > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 >> > >> > Is the above statement true for a HFile? >> > >> > Also from the above example, the data for the column family qualifier are >> > not adjacent to take advantage of compression ( >> > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this >> a >> > proper statement? >> > >> > Regards, >> > Praveen >> > >> > On Sat, Jan 21, 2012 at 9:03 PM, <[EMAIL PROTECTED]> wrote: >> > >> > > Have you considered using AggregationProtocol to perform aggregation ? >> > > >> > > Thanks >> > > >> > > >> > > >> > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati < >> [EMAIL PROTECTED]> >> > > wrote: >> > > >> > > > Hi, >> > > Andrey.
+
Andrey Stepachev 2012-01-24, 06:51
-
Re: Disk Seeks and Column families
Andrey Stepachev 2012-01-24, 06:52
2012/1/24 Andrey Stepachev <[EMAIL PROTECTED]>: > 2012/1/24 Praveen Sripati <[EMAIL PROTECTED]>: > > a) As in 1), add something to key. For example each 5 minutes. Later your > can issue 16 queries and merge them (for realtime)
eah... 3 minutes :)
-- Andrey.
+
Andrey Stepachev 2012-01-24, 06:52
-
Re: Disk Seeks and Column families
Jason Frantz 2012-01-24, 09:30
On Tue, Jan 24, 2012 at 11:45 AM, Praveen Sripati <[EMAIL PROTECTED]>wrote: > Thanks for the response. I am just getting started with HBase. And before > getting into the code/api level details, I am trying to understand the > problem area HBase is trying to address through it's architecture/design. > > 1) So, what are the recommendations for having many columns and with dense > data? Is HBase not the right tool? > HBase's data model works great if your set of columns can be split into separate column families that are only accessed together. If you often randomly access individual columns, then it might make sense to put your column qualifiers inside your key. 2) Also, if the data for a column is spread wide across blocks and maybe > even across nodes how will HBase help in aggregation? > If a column family doesn't contain the columns your aggregation wants, then HBase doesn't need to look at files for those column families. If you want to run the aggregation on a subset of your key's range, then HBase doesn't need to look at nodes that only have data outside that range. In addition, aggregation can often be done locally at each node using endpoint coprocessors. For example, if I want to count all the rows in my table, a coprocessor can count all the rows on each node in parallel, and then those counts are the only thing sent back to node running the query. To get the total count, I just need to sum the per-node counts. http://ofps.oreilly.com/titles/9781449396107/clientapisadv.html> 3) Also, about storing data using an incremental row key, initially there > will be a hot stop with the data getting to a single region. Even after a > split of the region into two, the first one won't be getting any data (in > incremental row key) and the second one will be hammered. > Can you split your incremental row key into a hash component and a range component? Here's a DynamoDB post explaining a use case: http://aws.typepad.com/aws/2012/01/amazon-dynamodb-internet-scale-data-storage-the-nosql-way.htmlThis does mean that range scan is only efficient when it stays within a hash prefix, though. 4) Still not clear why I can't have 10 column families in HBase and why > only 2 or 3 of them according to this link (1)? > > (1) - http://hbase.apache.org/book/number.of.cfs.html> See HBASE-3149, for starters. There are probably other JIRAs out there. -Jason > Praveen > > On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote: > > > Praveen, > > > > basically you are correct on all counts. If there are too many columns, > > HBase will have to issue more disk-seeks to extract only the particular > > columns you need ... and since the data is laid out horizontally there > are > > fewer common substrings in a single HBase-block and compression quality > > starts to degrade due to reduced redundancy. > > > > > > On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati > > <[EMAIL PROTECTED]>wrote: > > > > > Thanks for the response. > > > > > > > The contents of a row stay together like a regular row-oriented > > database. > > > > > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50 > > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50 > > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51 > > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51 > > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52 > > > > > > Is the above statement true for a HFile? > > > > > > Also from the above example, the data for the column family qualifier > are > > > not adjacent to take advantage of compression ( > > > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is > this > > a > > > proper statement?across all of the data. > > > > > > > Regards, > > > Praveen > > > > > > On Sat, Jan 21, 2012 at 9:03 PM, <[EMAIL PROTECTED]> wrote: > > > > > > > Have you considered using AggregationProtocol to perform aggregation > ? > > > > > > > > Thanks > > > > > > > > > > > > > > > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
+
Jason Frantz 2012-01-24, 09:30
|
|