Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Fastest way to find is a row exist?


Copy link to this message
-
Re: Fastest way to find is a row exist?
What about HTable.exists ??
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get)

I think that should work if the Get has only the row key.

Mohamed
On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet <[EMAIL PROTECTED]>wrote:

> On every Get, BloomFilter is acting as a filter (!) on top of each HFile
> and allows to check if a key is absent from the HFile. So yes, you will
> benefit from these filters.
>
>
> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]
> > wrote:
>
> > Is KeyOnlyFilter using the BloomFilters too?
> >
> > Here is, with more details, what I'm doing.
> >
> > Few questions.
> > - Can I create one single KeyOnlyFilter and give the same filter to
> > all the gets?
> > - Will bloom filters benefit in such scenario? My key is small. Let's
> > say average 128 bytes.
> >
> > The goal here is to check about 500 entries at a time to validate if
> > they already exist or not.
> >
> > In my MR, I'm starting when I have more than 100K lines to handle, and
> > each line car have up to 1K entries. So it can result up to 100M
> > gets... Job took initially 500 minutes to complete. I have added few
> > pretty good nodes and it's not taking less than 300 minutes. But I
> > would like to get under 100 minutes if I can...
> >
> > Thanks,
> >
> > JM
> >
> >         Vector<Get> gets_entry_exist = new Vector<Get>();
> >         for (Entry entry : entries.getEntries())
> >         {
> >                 Get entry_exist = new Get(entry.toKey());
> >                 entry_exist.setFilter(new KeyOnlyFilter());
> >                 gets_entry_exist.add(entry_exist);
> >         }
> >
> >         Result[] result_entry_exist = table_entry.get(gets_entry_exist);
> >
> >         int index = 0;
> >         for (Entry entry : entries.getEntries())
> >         {
> >                 boolean isEmpty =  result_entry_exist[index++].isEmpty();
> >                 if (isEmpty)
> >                 {
> >                         // Process here
> >                 }
> >         }
> >                                                 {
> >
> >
> > 2013/1/4, Damien Hardy <[EMAIL PROTECTED]>:
> > > Hello Jean-Marc,
> > >
> > > BloomFilters are just designed for that.
> > >
> > > But they say if a row doesn't exist with a ash of the key (not the
> > oposit,
> > > 2 rowkeys could have the same ash result).
> > >
> > > If you want to be sure the rowkey exists you have to search for it in
> the
> > > HFile ( the whole mechanism is transparent with the get() ).
> > >
> > > Their is also an KeOnlyFilter
> > >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
> > > preventing from getting the whole columns of the existing key as return
> > > (which could be heavy).
> > >
> > > Cheers,
> > >
> > > --
> > > Damien
> > >
> > >
> > > 2013/1/4 Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> > >
> > >> Hi,
> > >>
> > >> What's the fastest way to know if a row exist?
> > >>
> > >> Today I'm doing that:
> > >>
> > >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
> > >> Result entry_exist = table_entry.get(get_entry_exist);
> > >>
> > >> But should this be faster?
> > >> Get get_entry_exist = new Get(key);
> > >> Result entry_exist = table_entry.get(get_entry_exist);
> > >>
> > >> There is only one CF and one C on my table.
> > >>
> > >> Or is there an even faster way?
> > >>
> > >> Also, is there a way to make that even faster? I think BloomFilters
> > >> can help, right?
> > >>
> > >> Thanks,
> > >>
> > >> JM
> > >>
> > >
> >
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>