Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Fastest way to find is a row exist?


+
Jean-Marc Spaggiari 2013-01-04, 15:24
+
Anton Lyska 2013-01-04, 15:32
+
Damien Hardy 2013-01-04, 15:35
+
Jean-Marc Spaggiari 2013-01-04, 19:58
Copy link to this message
-
Re: Fastest way to find is a row exist?
Adrien Mogenet 2013-01-04, 20:17
On every Get, BloomFilter is acting as a filter (!) on top of each HFile
and allows to check if a key is absent from the HFile. So yes, you will
benefit from these filters.
On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]
> wrote:

> Is KeyOnlyFilter using the BloomFilters too?
>
> Here is, with more details, what I'm doing.
>
> Few questions.
> - Can I create one single KeyOnlyFilter and give the same filter to
> all the gets?
> - Will bloom filters benefit in such scenario? My key is small. Let's
> say average 128 bytes.
>
> The goal here is to check about 500 entries at a time to validate if
> they already exist or not.
>
> In my MR, I'm starting when I have more than 100K lines to handle, and
> each line car have up to 1K entries. So it can result up to 100M
> gets... Job took initially 500 minutes to complete. I have added few
> pretty good nodes and it's not taking less than 300 minutes. But I
> would like to get under 100 minutes if I can...
>
> Thanks,
>
> JM
>
>         Vector<Get> gets_entry_exist = new Vector<Get>();
>         for (Entry entry : entries.getEntries())
>         {
>                 Get entry_exist = new Get(entry.toKey());
>                 entry_exist.setFilter(new KeyOnlyFilter());
>                 gets_entry_exist.add(entry_exist);
>         }
>
>         Result[] result_entry_exist = table_entry.get(gets_entry_exist);
>
>         int index = 0;
>         for (Entry entry : entries.getEntries())
>         {
>                 boolean isEmpty =  result_entry_exist[index++].isEmpty();
>                 if (isEmpty)
>                 {
>                         // Process here
>                 }
>         }
>                                                 {
>
>
> 2013/1/4, Damien Hardy <[EMAIL PROTECTED]>:
> > Hello Jean-Marc,
> >
> > BloomFilters are just designed for that.
> >
> > But they say if a row doesn't exist with a ash of the key (not the
> oposit,
> > 2 rowkeys could have the same ash result).
> >
> > If you want to be sure the rowkey exists you have to search for it in the
> > HFile ( the whole mechanism is transparent with the get() ).
> >
> > Their is also an KeOnlyFilter
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
> > preventing from getting the whole columns of the existing key as return
> > (which could be heavy).
> >
> > Cheers,
> >
> > --
> > Damien
> >
> >
> > 2013/1/4 Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> >
> >> Hi,
> >>
> >> What's the fastest way to know if a row exist?
> >>
> >> Today I'm doing that:
> >>
> >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
> >> Result entry_exist = table_entry.get(get_entry_exist);
> >>
> >> But should this be faster?
> >> Get get_entry_exist = new Get(key);
> >> Result entry_exist = table_entry.get(get_entry_exist);
> >>
> >> There is only one CF and one C on my table.
> >>
> >> Or is there an even faster way?
> >>
> >> Also, is there a way to make that even faster? I think BloomFilters
> >> can help, right?
> >>
> >> Thanks,
> >>
> >> JM
> >>
> >
>

--
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me
+
Mohamed Ibrahim 2013-01-04, 21:04
+
Jean-Marc Spaggiari 2013-01-05, 13:29
+
Mohamed Ibrahim 2013-01-05, 14:07
+
Asaf Mesika 2013-01-06, 20:27
+
Jean-Marc Spaggiari 2013-01-07, 02:14
+
Jean-Marc Spaggiari 2013-01-04, 20:28
+
Bryan Beaudreault 2013-01-04, 20:45
+
Jean-Marc Spaggiari 2013-01-04, 20:54