Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Thanks everyone for all the helpful insights!

-eran

On Wed, Jun 1, 2011 at 03:41, Jason Rutherglen
<[EMAIL PROTECTED]>wrote:

> > I'd imagine that join operations do not require realtime-ness, and so
> > faster batch jobs using Hive -> frozen HBase files in HDFS could be
> > the optimal way to go?
>
> In addition to lessening the load on the perhaps live RegionServer.
> There's no Jira for this, I'm tempted to open one.
>
> On Tue, May 31, 2011 at 5:18 PM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
> >> The Hive-HBase integration allows you to create Hive tables that are
> backed
> >> by HBase
> >
> > In addition, HBase can be made to go faster for MapReduce jobs, if the
> > HFile's could be used directly in HDFS, rather than proxying through
> > the RegionServer.
> >
> > I'd imagine that join operations do not require realtime-ness, and so
> > faster batch jobs using Hive -> frozen HBase files in HDFS could be
> > the optimal way to go?
> >
> > On Tue, May 31, 2011 at 1:41 PM, Patrick Angeles <[EMAIL PROTECTED]>
> wrote:
> >> On Tue, May 31, 2011 at 3:19 PM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> >>
> >>> For my need I don't really need the general case, but even if I did I
> think
> >>> it can probably be done simpler.
> >>> The main problem is getting the data from both tables into the same MR
> job,
> >>> without resorting to lookups. So without the theoretical
> >>> MutliTableInputFormat, I could just copy all the data from both tables
> into
> >>> a temp table, just append the source table name to the row keys to make
> >>> sure
> >>> there are no conflicts. When all the data from both tables is in the
> same
> >>> temp table, run a MR job. For each row the mapper should emit a key
> which
> >>> is
> >>> composed of all the values of the join fields in that row (the value
> can be
> >>> emitted as is). This will cause all the rows from both tables, with
> same
> >>> join field values to arrive at the reducer together. The reducer could
> then
> >>> iterate over them and produce the Cartesian product as needed.
> >>>
> >>> I still don't like having to copy all the data into a temp table just
> >>> because I can't feed two tables into the MR job.
> >>>
> >>
> >> Loading the smaller table in memory is called a map join, versus a
> >> reduce-side join (a.k.a. common join). One reason to prefer a map join
> is
> >> you avoid the shuffle phase which potentially involves several trips to
> disk
> >> for the intermediate records due to spills, and also once through the
> >> network to get each intermediate KV pair to the right reducer. With a
> map
> >> join, everything is local, except for the part where you load the small
> >> table.
> >>
> >>
> >>>
> >>> As Jason Rutherglen mentioned above, Hive can do joins. I don't know if
> it
> >>> can do them for HBase and it will not suit my needs, but it would be
> >>> interesting to know how is it doing them, if anyone knows.
> >>>
> >>
> >> The Hive-HBase integration allows you to create Hive tables that are
> backed
> >> by HBase. You can do joins on those tables (and also with standard Hive
> >> tables). It might be worth trying out in your case as it lets you easily
> see
> >> the load characteristics and the job runtime without much coding
> investment.
> >>
> >> There are probably some specific optimizations that can be applied to
> your
> >> situation, but it's hard to say without knowing your use-case.
> >>
> >> Regards,
> >>
> >> - Patrick
> >>
> >>
> >>> -eran
> >>>
> >>>
> >>>
> >>> On Tue, May 31, 2011 at 22:02, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>> > The Cartesian product often makes an honest-to-god join not such a
> good
> >>> > idea
> >>> > on large data.  The common alternative is co-group
> >>> > which is basically like doing the hard work of the join, but involves
> >>> > stopping just before emitting the cartesian product.  This allows
> >>> > you to inject whatever cleverness you need at this point.
> >>> >
> >>> > Common kinds of cleverness include down-sampling of problematically
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB