Hbase creates HFile per column-family. Having 130 column-family is really not recommended. It will increase number of file pointer ( open file count) underneath.
If you are sure which columns are "frequently" accessed by users, you could consider putting them in one column family. And "Non frequently" ones in another. Btw, ~5MB size of column value is something to consider. We should wait for some expert advise here!! Thanks Alok On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim < [EMAIL PROTECTED]> wrote:
I can decrease the size of column value if it's not good for HBase. BTW, The values are for a point on a grid cell on a map. 250000 is 500x500, and 500x500 is somewhat related to the size of the client screen that displays the values on a map. Normally a client requests the values for the area that is displayed on the screen.
Plus, Since most of the time a client will display the area that does not fit in 500x500, Scan operations are required. (Get is not enough) So, I'm worried that on scanning, many irrelevant column data (those have the same rowkey, which is the position on the grid) would be read into the block cache, unless the columns are separated by individual column family.
You could narrow the number of rows to scan by using Filters. I don't think, you could reach/optimize to column level I/O.
Block Cache is related to actual data read from HDFS per column family. If your scan is fetching random (all) columns, then you are any way going to hit all the column-family-blocks and "irrelevant" data in block cache!! You could limit or set columns you want to fetch on client side after scan, that will save network IO.
Do you have 130 * 5 = 650MB of row size?
On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim < [EMAIL PROTECTED]> wrote:
I think you need to go back a bit further from the problem and ask yourself when would you want to have the same row key used for disjoint data. That is data that refers to the same object, yet the data in each column family is never or rarely used with data from another column family.
To give you a concrete example... one that I've used in a class... An order entry system.
Think of the life cycle of your order.
You enter the order, the company then generates pick slips from the warehouse(s), then the warehouse(s) issue shipping slips, then as the product ships, invoices are issued and the billing process occurs.
In each part of the process, information that could be shared could be copied so that you have an inquiry in to the order, you would see what was done and when, but in each process like managing the pick slip, you dont need to bring up the entire order.
Does that make sense?
In that example, you have 4 column families.
There are other examples, but that should help you put column families in perspective.
On Aug 5, 2014, at 11:52 AM, Ted Yu <[EMAIL PROTECTED]> wrote: The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
One way to model the data would be to use a composite key that is made up of the RDMS primary_key + "." + field_name. Then just have a single column that contains the value of the field. Individual field lookups will be a simple get and to get all of fields of a record, you would do a scan with startrow => primary_key + ".!", endrow => primary_key + ".~"
- Having 130 column families is too much. Don't do that. - While scanning, an entire row will be read for filtering, unless HBASE-5416 technique is applied which makes only relevant column family is loaded. (But it seems that still one can't load just a column needed while scanning) - Big row size is maybe not good.
Currently it seems appropriate to follow the one-column solution that Alok Singh suggested, in part since currently there is no reasonable grouping of the fields.
Here is my current thinking:
- One column family, one column. Field name will be included in rowkey. - Eliminate filtering altogether (in most case) by properly ordering rowkey components. - If a filtering is absolutely needed, add a 'dummy' column family and apply HBASE-5416 technique to minimize disk read, since the field value can be large(~5MB). (This dummy column thing may not be right, I'm not sure, since I have not read the filtering section of the book I'm reading yet)
Hope that I am not missing or misunderstanding something... (I'm a total newbie. I've started to read a HBase book since last week...)
The 'dummy' column will always hold the value '1' (or even an empty string), that only signifies that this row exists. (And the real value is in the other 'big' column family) The value is irrelevant since with current schema the filtering will be done by rowkey components alone. No column value is needed. (I will begin reading the filtering section shortly - it is only 6 pages ahead. So sorry for my premature thoughts)
Now I finished reading the filtering section and the source code of TestJoinedScanners(0.94).
- While scanning, an entire row will be read even for a rowkey filtering. (Since a rowkey is not a physically separate entity and stored in KeyValue object, it's natural. Am I right?) - The key API for the essential column family support is setLoadColumnFamiliesOnDemand().
So, now I have questions:
On rowkey filtering, which column family's KeyValue object is read? If HBase just reads a KeyValue from a randomly selected (or just the first) column family, how is setLoadColumnFamiliesOnDemand() affected? Can HBase select a smaller column family intelligently?
If setLoadColumnFamiliesOnDemand() can be applied to a rowkey filtering, a 'dummy' column family can be used to minimize the scan cost.
Hi, the description of hbase-5416 stated why it was introduced, if you only have 1 CF, dummy CF does not help. it is helpful for multi-CF case, e.g. "putting them in one column family. And "Non frequently" ones in another. "
bq. "Field name will be included in rowkey." Please read the chapter 9 "Advanced usage" in book "HBase Definitive Guide" about how hbase store data on disk and how to design rowkey based on specific scenario.(rowkey is the only index you can use, so take care)
bq. "The table is read-only. It is bulk-loaded once. When a new data is ready, A new table is created and the old table is deleted." the scenario is quite different. as hbase is designed for random read/write. the limitation described at http://hbase.apache.org/book/number.of.cfs.html is to consider the write case(flush&compaction), perhaps you could try 140 CFs, as long as you can presplit your regions well? after that, since no write, there will be no flush/compaction...anyway, any idea better be tested with your real data. On Wed, Aug 6, 2014 at 7:00 PM, innowireless TaeYun Kim < [EMAIL PROTECTED]> wrote:
bq. While scanning, an entire row will be read even for a rowkey filtering
If you specify essential column family in your filter, the above would not be true - only the essential column family would be loaded into memory first. Once the filter passes, the other family would be loaded.
Cheers On Wed, Aug 6, 2014 at 4:00 AM, innowireless TaeYun Kim < [EMAIL PROTECTED]> wrote:
But RowFilter class has no method that can be uses to set which column family is essential. (Actually no built-in filter class provides such a method)
So, if I (ever) want to apply the 'dummy' column family technique(?), it seems that I must do as follows:
- Write my own filter that's a subclass of the RowFilter. - In that filter class, override isFamilyEssential() method to return true only when the name of the 'dummy' column family is passed as an argument.
Now, HBase calls isFamilyEssential() method of my filter object for all the column families including the 'dummy' column family, and in result only loads the 'dummy' column family and happily filters rowkey using the KeyValue objects from the 'dummy' column family HFile(s).
Am I right?
BTW, it would be nice to have a method like 'setEssentialColumnFamilies(byte names)' to set the essential families manually, since no built-in filter intelligently determines which column family is essential, except for SingleColumnValueFilter.
1. Regarding HBASE-5416, I think it's purpose is simple.
"Avoid loading column families that is irrelevant to filtering while scanning." So, it can be applied to my 'dummy CF' case. That is, a dummy CF can act like an 'relevant' CF to filtering, provided that HBase can select it while applying a rowkey filter, since a dummy CF has the rowkey data in its 'dummy' KeyValue object.
2. About rowkey.
What I meant is, I would include the field name as a component when the byte array for a rowkey is constructed.
3. About read-only-ness and the number of CF.
Thank you for your suggestion. But since MemStore and BlockCache is separately managed on each column family, I'm a little concerned with the memory footprint.
Hi TaeYun, thanks for explain. On Thu, Aug 7, 2014 at 12:50 PM, innowireless TaeYun Kim < [EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext