Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - How to query by rowKey-infix


Copy link to this message
-
Re: How to query by rowKey-infix
Matt Corgan 2012-08-02, 23:09
Also Christian, don't forget you can read all the rows back to the client
and do the filtering there using whatever logic you like.  HBase Filters
can be thought of as an optimization (predicate push-down) over client-side
filtering.  Pulling all the rows over the network will be slower, but I
don't think we know enough about your data or speed requirements to rule it
out.
On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> Hi Christian!
>
> If to put off secondary indexes and assume you are going with "heavy
> scans", you can try two following things to make it much faster. If this is
> appropriate to your situation, of course.
>
> 1.
>
> > Is there a more elegant way to collect rows within time range X?
> > (Unfortunately, the date attribute is not equal to the timestamp that is
> stored by hbase automatically.)
>
> Can you set timestamp of the Puts to the one you have in row key? Instead
> of relying on the one that HBase puts automatically (current ts). If you
> can, this will improve reading speed a lot by setting time range on
> scanner. Depending on how you are writing your data of course, but I assume
> that you mostly write data in "time-increasing" manner.
>
> 2.
>
> If your userId has fixed length, or you can change it so that it has fixed
> length, then you can actually use smth like "wildcard"  in row key. There's
> a way in Filter implementation to fast-forward to the record with specific
> row key and by doing this skip many records. This might be used as follows:
> * suppose your userId is 5 characters in length
> * suppose you are scanning for records with time between 2012-08-01
> and 2012-08-08
> * when you scanning records and you face e.g. key
> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> Because you know that all remained records of user "aaaaa" don't fall into
> the interval you need (as the time for its records will be >= 2012-08-09).
>
> As of now, I believe you will have to implement your custom filter to do
> that.
> Pointer:
> org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> I believe I implemented similar thing some time ago. If this idea works for
> you I could look for the implementation and share it if it helps. Or may be
> even simply add it to HBase codebase.
>
> Hope this helps,
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
>
> On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <[EMAIL PROTECTED]
> >wrote:
>
> >
> >
> > Excuse my double posting.
> > Here is the complete mail:
> >
> >
> > OK,
> >
> > at first I will try the scans.
> >
> > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2)
> > to be able to use coprocessors.
> >
> >
> > Currently I'm stuck at the scans because it requires two steps (therefore
> > maybe some kind of filter chaining is required)
> >
> >
> > The key:  userId-dateInMillis-sessionId
> >
> > At first I need to extract dateInMllis with regex or substring (using
> > special delimiters for date)
> >
> > Second, the extracted value must be parsed to Long and set to a RowFilter
> > Comparator like this:
> >
> > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> >
> > How to chain that?
> > Do I have to write a custom filter?
> > (Would like to avoid that due to deployment)
> >
> > regards
> > Chris
> >
> > ----- Ursprüngliche Message -----
> > Von: Michael Segel <[EMAIL PROTECTED]>
> > An: [EMAIL PROTECTED]
> > CC:
> > Gesendet: 13:52 Mittwoch, 1.August 2012
> > Betreff: Re: How to query by rowKey-infix
> >
> > Actually w coprocessors you can create a secondary index in short order.
> > Then your cost is going to be 2 fetches. Trying to do a partial table
> scan
> > will be more expensive.
> >
> > On Jul 31, 2012, at 12:41 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: