Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> question about merge-join (or AND operator betwween colums)


Copy link to this message
-
Re: question about merge-join (or AND operator betwween colums)
2011/1/9 Jack Levin <[EMAIL PROTECTED]>

> Future wise we plan to have millions of rows, probably across multiple
> regions, even if IO is not a problem, doing millions of filter operations
> does not make much sense.
>

It depends on selectivity of your photo column. If it is rare case (1% of
rows has fotos), it is more wise to scan only photo family and then get
another families. If selectivity is high, you will have small amount of
mismatches.

But I agree, that hbase doesn't have feature like "first check this family,
and if it has
value, proceed others", and in some case it can be very usefull (for inplace
indexing).
>
> -Jack
>
> On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
>
> > Ok. Understand.
> >
> > But do you check is it really an issue? I think that it is only 1 IO
> here,
> > (especially
> > if compression used)? You have big rows?
> >
> >
> >
> > 2011/1/9 Jack Levin <[EMAIL PROTECTED]>
> >
> > > Sorting is not the issue, the location of data can be in the beginning,
> > > middle or end, or any combination of thereof.  I only given the worst
> > case
> > > scenario example, I understand that filtering will produce results we
> > want
> > > but at cost of examining every row and offloading AND/join logic to the
> > > application.
> > >
> > > -Jack
> > >
> > > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > More details on binary sorting you can read
> > > >
> > > >
> > >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> > > >
> > > > 2011/1/8 Jack Levin <[EMAIL PROTECTED]>
> > > >
> > > > > Basic problem described:
> > > > >
> > > > > user uploads 1 image and creates some text -10 days ago, then
> creates
> > > > 1000
> > > > > text messages on between 9 days ago and today:
> > > > >
> > > > >
> > > > > row key          | fm:type --> value
> > > > >
> > > > >
> > > > > 00days:uid     | type:text --> text_id
> > > > >
> > > > > .
> > > > >
> > > > > .
> > > > >
> > > > > 09days:uid | type:text --> text_id
> > > > >
> > > > >
> > > > > 10days:uid     | type:photo --> URL
> > > > >
> > > > >          | type:text --> text_id
> > > > >
> > > > >
> > > > > Skip all the way to 10days:uid row, without reading 00days:id -
> > 09:uid
> > > > > rows.
> > > > >  Ideally we do not want to read all 1000 entries that have _only_
> > text.
> > > >  We
> > > > > want to get to last entry in the most efficient way possible.
> > > > >
> > > > >
> > > > > -Jack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <[EMAIL PROTECTED]> wrote:
> > > > > > Strike that.  This is a Scan, so can't do blooms + filter.
>  Sorry.
> > > > > > Sounds like a coprocessor then.  You'd have your query 'lean' on
> > the
> > > > > > column that you know has the lesser items and then per item,
> you'd
> > do
> > > > > > a get inside the coprocessor against the column of many entries.
> >  The
> > > > > > get would go via blooms.
> > > > > >
> > > > > > St.Ack
> > > > > >
> > > > > >
> > > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <[EMAIL PROTECTED]> wrote:
> > > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <[EMAIL PROTECTED]>
> > > > wrote:
> > > > > >>> Yes, we thought about using filters, the issue is, if one
> family
> > > > > >>> column has 1ml values, and second family column has 10 values
> at
> > > the
> > > > > >>> bottom, we would end up scanning and filtering 99990 records
> and
> > > > > >>> throwing them away, which seems inefficient.
> > > > > >>
> > > > > >> Blooms+filters?
> > > > > >> St.Ack
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB