Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Fastest way to find is a row exist?


Copy link to this message
-
Re: Fastest way to find is a row exist?
Finally, I looked at how exists(Get) is done and build
exists(List<Get>)... (HBASE-7503)

I will run some bench to compare what is faster. batch(List<Get>) or
exists(List<Get>)... I build it for 0.94 too and will deploy the
updated build on my cluster...

2013/1/6, Asaf Mesika <[EMAIL PROTECTED]>:
> Why not write your own filter class which you can initialize with a
> set of keys to search for.
> The HTable on the client side will split the keys based on row keys so
> it will be sent to the right regions. There your filter can utilize
> SEEK_NEXT_USING_HINT Return Code to see efficiently on those set of
> key values
> This will ensure you do this search in one rpc call.
> Your filter can also transform the KeyValue so that only the row keys
> are returned
>
> Sent from my iPad
>
> On 6 בינו 2013, at 05:46, Mohamed Ibrahim <[EMAIL PROTECTED]> wrote:
>
>> Sorry, I didn't notice your email about packing 500 operations before.
>>
>> You might actually benefit from checking with a batch of Gets vs
>> individual
>> exists.
>>
>> Best,
>> Mohamed
>>
>>
>> On Sat, Jan 5, 2013 at 8:29 AM, Jean-Marc Spaggiari
>> <[EMAIL PROTECTED]
>>> wrote:
>>
>>> Hum, very interesting!
>>>
>>> Now, what's the best option? Array of get which will retrieve more
>>> information? Or multiple HTable.exits one by one?
>>>
>>> The best will have been to have an array of gets passed to the
>>> exist... I will see how big it is to add that...
>>>
>>> JM
>>>
>>> 2013/1/4, Mohamed Ibrahim <[EMAIL PROTECTED]>:
>>>> What about HTable.exists ??
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get)
>>>>
>>>> I think that should work if the Get has only the row key.
>>>>
>>>> Mohamed
>>>>
>>>>
>>>> On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet
>>>> <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> On every Get, BloomFilter is acting as a filter (!) on top of each
>>>>> HFile
>>>>> and allows to check if a key is absent from the HFile. So yes, you
>>>>> will
>>>>> benefit from these filters.
>>>>>
>>>>>
>>>>> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari <
>>>>> [EMAIL PROTECTED]
>>>>>> wrote:
>>>>>
>>>>>> Is KeyOnlyFilter using the BloomFilters too?
>>>>>>
>>>>>> Here is, with more details, what I'm doing.
>>>>>>
>>>>>> Few questions.
>>>>>> - Can I create one single KeyOnlyFilter and give the same filter to
>>>>>> all the gets?
>>>>>> - Will bloom filters benefit in such scenario? My key is small. Let's
>>>>>> say average 128 bytes.
>>>>>>
>>>>>> The goal here is to check about 500 entries at a time to validate if
>>>>>> they already exist or not.
>>>>>>
>>>>>> In my MR, I'm starting when I have more than 100K lines to handle,
>>>>>> and
>>>>>> each line car have up to 1K entries. So it can result up to 100M
>>>>>> gets... Job took initially 500 minutes to complete. I have added few
>>>>>> pretty good nodes and it's not taking less than 300 minutes. But I
>>>>>> would like to get under 100 minutes if I can...
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>        Vector<Get> gets_entry_exist = new Vector<Get>();
>>>>>>        for (Entry entry : entries.getEntries())
>>>>>>        {
>>>>>>                Get entry_exist = new Get(entry.toKey());
>>>>>>                entry_exist.setFilter(new KeyOnlyFilter());
>>>>>>                gets_entry_exist.add(entry_exist);
>>>>>>        }
>>>>>>
>>>>>>        Result[] result_entry_exist >>>>>> table_entry.get(gets_entry_exist);
>>>>>>
>>>>>>        int index = 0;
>>>>>>        for (Entry entry : entries.getEntries())
>>>>>>        {
>>>>>>                boolean isEmpty >>>>>> result_entry_exist[index++].isEmpty();
>>>>>>                if (isEmpty)
>>>>>>                {
>>>>>>                        // Process here
>>>>>>                }
>>>>>>        }
>>>>>>                                                {
>>>>>>
>>>>>>
>>>>>> 2013/1/4, Damien Hardy <[EMAIL PROTECTED]>:
>>>>>>> Hello Jean-Marc,
>>>>>>>
>>>>>>> BloomFilters are just designed for that.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB