Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Thanks everyone for all the helpful insights!

-eran

On Wed, Jun 1, 2011 at 03:41, Jason Rutherglen
<[EMAIL PROTECTED]>wrote:

> > I'd imagine that join operations do not require realtime-ness, and so
> > faster batch jobs using Hive -> frozen HBase files in HDFS could be
> > the optimal way to go?
>
> In addition to lessening the load on the perhaps live RegionServer.
> There's no Jira for this, I'm tempted to open one.
>
> On Tue, May 31, 2011 at 5:18 PM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
> >> The Hive-HBase integration allows you to create Hive tables that are
> backed
> >> by HBase
> >
> > In addition, HBase can be made to go faster for MapReduce jobs, if the
> > HFile's could be used directly in HDFS, rather than proxying through
> > the RegionServer.
> >
> > I'd imagine that join operations do not require realtime-ness, and so
> > faster batch jobs using Hive -> frozen HBase files in HDFS could be
> > the optimal way to go?
> >
> > On Tue, May 31, 2011 at 1:41 PM, Patrick Angeles <[EMAIL PROTECTED]>
> wrote:
> >> On Tue, May 31, 2011 at 3:19 PM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> >>
> >>> For my need I don't really need the general case, but even if I did I
> think
> >>> it can probably be done simpler.
> >>> The main problem is getting the data from both tables into the same MR
> job,
> >>> without resorting to lookups. So without the theoretical
> >>> MutliTableInputFormat, I could just copy all the data from both tables
> into
> >>> a temp table, just append the source table name to the row keys to make
> >>> sure
> >>> there are no conflicts. When all the data from both tables is in the
> same
> >>> temp table, run a MR job. For each row the mapper should emit a key
> which
> >>> is
> >>> composed of all the values of the join fields in that row (the value
> can be
> >>> emitted as is). This will cause all the rows from both tables, with
> same
> >>> join field values to arrive at the reducer together. The reducer could
> then
> >>> iterate over them and produce the Cartesian product as needed.
> >>>
> >>> I still don't like having to copy all the data into a temp table just
> >>> because I can't feed two tables into the MR job.
> >>>
> >>
> >> Loading the smaller table in memory is called a map join, versus a
> >> reduce-side join (a.k.a. common join). One reason to prefer a map join
> is
> >> you avoid the shuffle phase which potentially involves several trips to
> disk
> >> for the intermediate records due to spills, and also once through the
> >> network to get each intermediate KV pair to the right reducer. With a
> map
> >> join, everything is local, except for the part where you load the small
> >> table.
> >>
> >>
> >>>
> >>> As Jason Rutherglen mentioned above, Hive can do joins. I don't know if
> it
> >>> can do them for HBase and it will not suit my needs, but it would be
> >>> interesting to know how is it doing them, if anyone knows.
> >>>
> >>
> >> The Hive-HBase integration allows you to create Hive tables that are
> backed
> >> by HBase. You can do joins on those tables (and also with standard Hive
> >> tables). It might be worth trying out in your case as it lets you easily
> see
> >> the load characteristics and the job runtime without much coding
> investment.
> >>
> >> There are probably some specific optimizations that can be applied to
> your
> >> situation, but it's hard to say without knowing your use-case.
> >>
> >> Regards,
> >>
> >> - Patrick
> >>
> >>
> >>> -eran
> >>>
> >>>
> >>>
> >>> On Tue, May 31, 2011 at 22:02, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>> > The Cartesian product often makes an honest-to-god join not such a
> good
> >>> > idea
> >>> > on large data.  The common alternative is co-group
> >>> > which is basically like doing the hard work of the join, but involves
> >>> > stopping just before emitting the cartesian product.  This allows
> >>> > you to inject whatever cleverness you need at this point.
> >>> >
> >>> > Common kinds of cleverness include down-sampling of problematically