Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - intersection of row ids


Copy link to this message
-
Re: intersection of row ids
Dave Latham 2011-03-11, 17:23
If the ordering of the row ids is the same in both tables and both are of
the same order of magnitude of size, I would recommend opening scanners on
both tables, then compare the current row in each scanner, and advance
whichever scanner is behind.  Whenever you hit a match, you output it and
advance both scanners.

If you need to do it faster, you can move the same approach into a MR job,
where you use TableInputFormat for one scanner, and open the other one
manually each Mapper.

If one table is order of magnitudes smaller than the other, or the rows ids
are formatted differently and not ordered the same in each table, then scan
the smaller table and issue gets to check for each row in the larger table.

Dave

On Thu, Mar 10, 2011 at 8:08 PM, Vishal Kapoor
<[EMAIL PROTECTED]>wrote:

> Friends,
> how do I best achieve intersection of sets of row ids
> suppose I have two tables with similar row ids
> how can I get the row ids present in one and not in the other?
> does things get better if I have row ids as values in some qualifier/
> qualifier itself?
> I hope the question is not too confusing...
>
> intersection of {1, 2, 3} and {2, 3, 4} is {2, 3}.
> while {1,2,3} are row ids from a table, {2,3,4} may come from other table
> as
> qualifiers in some row.
>
> thanks,
> Vishal
>