Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Fastest way to find is a row exist?


+
Jean-Marc Spaggiari 2013-01-04, 15:24
+
Anton Lyska 2013-01-04, 15:32
+
Damien Hardy 2013-01-04, 15:35
+
Jean-Marc Spaggiari 2013-01-04, 19:58
+
Adrien Mogenet 2013-01-04, 20:17
+
Mohamed Ibrahim 2013-01-04, 21:04
+
Jean-Marc Spaggiari 2013-01-05, 13:29
+
Mohamed Ibrahim 2013-01-05, 14:07
+
Asaf Mesika 2013-01-06, 20:27
Copy link to this message
-
Re: Fastest way to find is a row exist?
Jean-Marc Spaggiari 2013-01-07, 02:14
Finally, I looked at how exists(Get) is done and build
exists(List<Get>)... (HBASE-7503)

I will run some bench to compare what is faster. batch(List<Get>) or
exists(List<Get>)... I build it for 0.94 too and will deploy the
updated build on my cluster...

2013/1/6, Asaf Mesika <[EMAIL PROTECTED]>:
> Why not write your own filter class which you can initialize with a
> set of keys to search for.
> The HTable on the client side will split the keys based on row keys so
> it will be sent to the right regions. There your filter can utilize
> SEEK_NEXT_USING_HINT Return Code to see efficiently on those set of
> key values
> This will ensure you do this search in one rpc call.
> Your filter can also transform the KeyValue so that only the row keys
> are returned
>
> Sent from my iPad
>
> On 6 בינו 2013, at 05:46, Mohamed Ibrahim <[EMAIL PROTECTED]> wrote:
>
>> Sorry, I didn't notice your email about packing 500 operations before.
>>
>> You might actually benefit from checking with a batch of Gets vs
>> individual
>> exists.
>>
>> Best,
>> Mohamed
>>
>>
>> On Sat, Jan 5, 2013 at 8:29 AM, Jean-Marc Spaggiari
>> <[EMAIL PROTECTED]
>>> wrote:
>>
>>> Hum, very interesting!
>>>
>>> Now, what's the best option? Array of get which will retrieve more
>>> information? Or multiple HTable.exits one by one?
>>>
>>> The best will have been to have an array of gets passed to the
>>> exist... I will see how big it is to add that...
>>>
>>> JM
>>>
>>> 2013/1/4, Mohamed Ibrahim <[EMAIL PROTECTED]>:
>>>> What about HTable.exists ??
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get)
>>>>
>>>> I think that should work if the Get has only the row key.
>>>>
>>>> Mohamed
>>>>
>>>>
>>>> On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet
>>>> <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> On every Get, BloomFilter is acting as a filter (!) on top of each
>>>>> HFile
>>>>> and allows to check if a key is absent from the HFile. So yes, you
>>>>> will
>>>>> benefit from these filters.
>>>>>
>>>>>
>>>>> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari <
>>>>> [EMAIL PROTECTED]
>>>>>> wrote:
>>>>>
>>>>>> Is KeyOnlyFilter using the BloomFilters too?
>>>>>>
>>>>>> Here is, with more details, what I'm doing.
>>>>>>
>>>>>> Few questions.
>>>>>> - Can I create one single KeyOnlyFilter and give the same filter to
>>>>>> all the gets?
>>>>>> - Will bloom filters benefit in such scenario? My key is small. Let's
>>>>>> say average 128 bytes.
>>>>>>
>>>>>> The goal here is to check about 500 entries at a time to validate if
>>>>>> they already exist or not.
>>>>>>
>>>>>> In my MR, I'm starting when I have more than 100K lines to handle,
>>>>>> and
>>>>>> each line car have up to 1K entries. So it can result up to 100M
>>>>>> gets... Job took initially 500 minutes to complete. I have added few
>>>>>> pretty good nodes and it's not taking less than 300 minutes. But I
>>>>>> would like to get under 100 minutes if I can...
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>        Vector<Get> gets_entry_exist = new Vector<Get>();
>>>>>>        for (Entry entry : entries.getEntries())
>>>>>>        {
>>>>>>                Get entry_exist = new Get(entry.toKey());
>>>>>>                entry_exist.setFilter(new KeyOnlyFilter());
>>>>>>                gets_entry_exist.add(entry_exist);
>>>>>>        }
>>>>>>
>>>>>>        Result[] result_entry_exist >>>>>> table_entry.get(gets_entry_exist);
>>>>>>
>>>>>>        int index = 0;
>>>>>>        for (Entry entry : entries.getEntries())
>>>>>>        {
>>>>>>                boolean isEmpty >>>>>> result_entry_exist[index++].isEmpty();
>>>>>>                if (isEmpty)
>>>>>>                {
>>>>>>                        // Process here
>>>>>>                }
>>>>>>        }
>>>>>>                                                {
>>>>>>
>>>>>>
>>>>>> 2013/1/4, Damien Hardy <[EMAIL PROTECTED]>:
>>>>>>> Hello Jean-Marc,
>>>>>>>
>>>>>>> BloomFilters are just designed for that.
+
Jean-Marc Spaggiari 2013-01-04, 20:28
+
Bryan Beaudreault 2013-01-04, 20:45
+
Jean-Marc Spaggiari 2013-01-04, 20:54