Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> question about merge-join (or AND operator betwween colums)


Copy link to this message
-
Re: question about merge-join (or AND operator betwween colums)
Ok. Understand.

But do you check is it really an issue? I think that it is only 1 IO here,
(especially
if compression used)? You have big rows?

2011/1/9 Jack Levin <[EMAIL PROTECTED]>

> Sorting is not the issue, the location of data can be in the beginning,
> middle or end, or any combination of thereof.  I only given the worst case
> scenario example, I understand that filtering will produce results we want
> but at cost of examining every row and offloading AND/join logic to the
> application.
>
> -Jack
>
> On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote:
>
> > More details on binary sorting you can read
> >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> >
> > 2011/1/8 Jack Levin <[EMAIL PROTECTED]>
> >
> > > Basic problem described:
> > >
> > > user uploads 1 image and creates some text -10 days ago, then creates
> > 1000
> > > text messages on between 9 days ago and today:
> > >
> > >
> > > row key          | fm:type --> value
> > >
> > >
> > > 00days:uid     | type:text --> text_id
> > >
> > > .
> > >
> > > .
> > >
> > > 09days:uid | type:text --> text_id
> > >
> > >
> > > 10days:uid     | type:photo --> URL
> > >
> > >          | type:text --> text_id
> > >
> > >
> > > Skip all the way to 10days:uid row, without reading 00days:id - 09:uid
> > > rows.
> > >  Ideally we do not want to read all 1000 entries that have _only_ text.
> >  We
> > > want to get to last entry in the most efficient way possible.
> > >
> > >
> > > -Jack
> > >
> > >
> > >
> > >
> > > On Sat, Jan 8, 2011 at 11:43 AM, Stack <[EMAIL PROTECTED]> wrote:
> > > > Strike that.  This is a Scan, so can't do blooms + filter.  Sorry.
> > > > Sounds like a coprocessor then.  You'd have your query 'lean' on the
> > > > column that you know has the lesser items and then per item, you'd do
> > > > a get inside the coprocessor against the column of many entries.  The
> > > > get would go via blooms.
> > > >
> > > > St.Ack
> > > >
> > > >
> > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack <[EMAIL PROTECTED]> wrote:
> > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin <[EMAIL PROTECTED]>
> > wrote:
> > > >>> Yes, we thought about using filters, the issue is, if one family
> > > >>> column has 1ml values, and second family column has 10 values at
> the
> > > >>> bottom, we would end up scanning and filtering 99990 records and
> > > >>> throwing them away, which seems inefficient.
> > > >>
> > > >> Blooms+filters?
> > > >> St.Ack
> > > >>
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB