Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Fastest way to find is a row exist?


Copy link to this message
-
Re: Fastest way to find is a row exist?
Ok. I have activate them on 2 of my main tables and I will re-run the
job and see.

2 other questions then ;)

1) I have activated them that way: alter 'work_proposed', NAME => '@',
BLOOMFILTER => 'ROW' how can I remove them?
2) Should I major_compact to make sure all the hash are stored?

Thanks,

JM

2013/1/4, Adrien Mogenet <[EMAIL PROTECTED]>:
> On every Get, BloomFilter is acting as a filter (!) on top of each HFile
> and allows to check if a key is absent from the HFile. So yes, you will
> benefit from these filters.
>
>
> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari
> <[EMAIL PROTECTED]
>> wrote:
>
>> Is KeyOnlyFilter using the BloomFilters too?
>>
>> Here is, with more details, what I'm doing.
>>
>> Few questions.
>> - Can I create one single KeyOnlyFilter and give the same filter to
>> all the gets?
>> - Will bloom filters benefit in such scenario? My key is small. Let's
>> say average 128 bytes.
>>
>> The goal here is to check about 500 entries at a time to validate if
>> they already exist or not.
>>
>> In my MR, I'm starting when I have more than 100K lines to handle, and
>> each line car have up to 1K entries. So it can result up to 100M
>> gets... Job took initially 500 minutes to complete. I have added few
>> pretty good nodes and it's not taking less than 300 minutes. But I
>> would like to get under 100 minutes if I can...
>>
>> Thanks,
>>
>> JM
>>
>>         Vector<Get> gets_entry_exist = new Vector<Get>();
>>         for (Entry entry : entries.getEntries())
>>         {
>>                 Get entry_exist = new Get(entry.toKey());
>>                 entry_exist.setFilter(new KeyOnlyFilter());
>>                 gets_entry_exist.add(entry_exist);
>>         }
>>
>>         Result[] result_entry_exist = table_entry.get(gets_entry_exist);
>>
>>         int index = 0;
>>         for (Entry entry : entries.getEntries())
>>         {
>>                 boolean isEmpty =  result_entry_exist[index++].isEmpty();
>>                 if (isEmpty)
>>                 {
>>                         // Process here
>>                 }
>>         }
>>                                                 {
>>
>>
>> 2013/1/4, Damien Hardy <[EMAIL PROTECTED]>:
>> > Hello Jean-Marc,
>> >
>> > BloomFilters are just designed for that.
>> >
>> > But they say if a row doesn't exist with a ash of the key (not the
>> oposit,
>> > 2 rowkeys could have the same ash result).
>> >
>> > If you want to be sure the rowkey exists you have to search for it in
>> > the
>> > HFile ( the whole mechanism is transparent with the get() ).
>> >
>> > Their is also an KeOnlyFilter
>> >
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
>> > preventing from getting the whole columns of the existing key as return
>> > (which could be heavy).
>> >
>> > Cheers,
>> >
>> > --
>> > Damien
>> >
>> >
>> > 2013/1/4 Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>> >
>> >> Hi,
>> >>
>> >> What's the fastest way to know if a row exist?
>> >>
>> >> Today I'm doing that:
>> >>
>> >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
>> >> Result entry_exist = table_entry.get(get_entry_exist);
>> >>
>> >> But should this be faster?
>> >> Get get_entry_exist = new Get(key);
>> >> Result entry_exist = table_entry.get(get_entry_exist);
>> >>
>> >> There is only one CF and one C on my table.
>> >>
>> >> Or is there an even faster way?
>> >>
>> >> Also, is there a way to make that even faster? I think BloomFilters
>> >> can help, right?
>> >>
>> >> Thanks,
>> >>
>> >> JM
>> >>
>> >
>>
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>