Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - storing custom bloomfilter/BitSet


Copy link to this message
-
storing custom bloomfilter/BitSet
John 2013-09-19, 22:36
Hi,

Is there a way to store a custom BitSet for every row and add new bits
while importing? I can't use the bloomfilter that is already there because
in every columnnames are 2 elements.

Here is my scenario:
My table looks like this:
rowKey1 -> cf:<data1,data2>,  cf:<data3,data4>, ...
rowKey2 -> cf:<data234,data5>. ...

the columname includes data1 and data2.

This setup  works for me now, but I try to imrpove it. I'm using the
BulkLoad feature. At first I import a CSV file that looks like this:
ROWKEY     COLUMNFAMILY     COLUMNAME     HASH_INDEX_1     HASH_INDEX_2
rowKey1       cf                            <data1,data2>     5
              12
rowKey1       cf                            <data3,data4>     8
               5

For every hash in HASH_INDEX_1/2 I creat a new column with the index as a
name and the columnfamily "bloomfilter1" or "bloomfilter2". I store the
columname as a 4byte Integer String. For the Example above I would store
this: bloomfilter1:5 and bloomfilter2:12. This method works fine, but the
export and backtransformation to a BitSet become very slow if the
bloomfilter is to big (> 1 million). So a better solution would be to store
only the BitSet instead of a 4byte Integer for every index.

Does anyone now if it is possible to create this filter while importing the
data?

thanks