-storing custom bloomfilter/BitSet
John 2013-09-19, 22:36
Is there a way to store a custom BitSet for every row and add new bits
while importing? I can't use the bloomfilter that is already there because
in every columnnames are 2 elements.
Here is my scenario:
My table looks like this:
rowKey1 -> cf:<data1,data2>, cf:<data3,data4>, ...
rowKey2 -> cf:<data234,data5>. ...
the columname includes data1 and data2.
This setup works for me now, but I try to imrpove it. I'm using the
BulkLoad feature. At first I import a CSV file that looks like this:
ROWKEY COLUMNFAMILY COLUMNAME HASH_INDEX_1 HASH_INDEX_2
rowKey1 cf <data1,data2> 5
rowKey1 cf <data3,data4> 8
For every hash in HASH_INDEX_1/2 I creat a new column with the index as a
name and the columnfamily "bloomfilter1" or "bloomfilter2". I store the
columname as a 4byte Integer String. For the Example above I would store
this: bloomfilter1:5 and bloomfilter2:12. This method works fine, but the
export and backtransformation to a BitSet become very slow if the
bloomfilter is to big (> 1 million). So a better solution would be to store
only the BitSet instead of a 4byte Integer for every index.
Does anyone now if it is possible to create this filter while importing the