Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - question about merge-join (or AND operator betwween colums)


Copy link to this message
-
Re: question about merge-join (or AND operator betwween colums)
Andrey Stepachev 2011-01-10, 22:46
2011/1/9 Jack Levin <[EMAIL PROTECTED]>

> Future wise we plan to have millions of rows, probably across multiple
> regions, even if IO is not a problem, doing millions of filter operations
> does not make much sense.
>

It depends on selectivity of your photo column. If it is rare case (1% of
rows has fotos), it is more wise to scan only photo family and then get
another families. If selectivity is high, you will have small amount of
mismatches.

But I agree, that hbase doesn't have feature like "first check this family,
and if it has
value, proceed others", and in some case it can be very usefull (for inplace
indexing).
>
> -Jack
>
> On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
>
> > Ok. Understand.
> >
> > But do you check is it really an issue? I think that it is only 1 IO
> here,
> > (especially
> > if compression used)? You have big rows?
> >
> >
> >
> > 2011/1/9 Jack Levin <[EMAIL PROTECTED]>
> >
> > > Sorting is not the issue, the location of data can be in the beginning,
> > > middle or end, or any combination of thereof.  I only given the worst
> > case
> > > scenario example, I understand that filtering will produce results we
> > want
> > > but at cost of examining every row and offloading AND/join logic to the
> > > application.
> > >
> > > -Jack
> > >
> > > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > More details on binary sorting you can read
> > > >
> > > >
> > >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> > > >
> > > > 2011/1/8 Jack Levin <[EMAIL PROTECTED]>
> > > >
> > > > > Basic problem described:
> > > > >
> > > > > user uploads 1 image and creates some text -10 days ago, then
> creates
> > > > 1000
> > > > > text messages on between 9 days ago and today:
> > > > >
> > > > >
> > > > > row key          | fm:type --> value
> > > > >
> > > > >
> > > > > 00days:uid     | type:text --> text_id
> > > > >
> > > > > .
> > > > >
> > > > > .
> > > > >
> > > > > 09days:uid | type:text --> text_id
> > > > >
> > > > >
> > > > > 10days:uid     | type:photo --> URL
> > > > >
> > > > >          | type:text --> text_id
> > > > >
> > > > >
> > > > > Skip all the way to 10days:uid row, without reading 00days:id -
> > 09:uid
> > > > > rows.
> > > > >  Ideally we do not want to read all 1000 entries that have _only_
> > text.
> > > >  We
> > > > > want to get to last entry in the most efficient way possible.
> > > > >
> > > > >
> > > > > -Jack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <[EMAIL PROTECTED]> wrote:
> > > > > > Strike that.  This is a Scan, so can't do blooms + filter.
>  Sorry.
> > > > > > Sounds like a coprocessor then.  You'd have your query 'lean' on
> > the
> > > > > > column that you know has the lesser items and then per item,
> you'd
> > do
> > > > > > a get inside the coprocessor against the column of many entries.
> >  The
> > > > > > get would go via blooms.
> > > > > >
> > > > > > St.Ack
> > > > > >
> > > > > >
> > > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <[EMAIL PROTECTED]> wrote:
> > > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <[EMAIL PROTECTED]>
> > > > wrote:
> > > > > >>> Yes, we thought about using filters, the issue is, if one
> family
> > > > > >>> column has 1ml values, and second family column has 10 values
> at
> > > the
> > > > > >>> bottom, we would end up scanning and filtering 99990 records
> and
> > > > > >>> throwing them away, which seems inefficient.
> > > > > >>
> > > > > >> Blooms+filters?
> > > > > >> St.Ack
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>