Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Quering RDBMS table in a Hive query

Ruslan Al-Fakikh 2012-06-13, 21:36
Jan Dolinár 2012-06-14, 06:03
Ruslan Al-Fakikh 2012-06-15, 11:57
Jan Dolinár 2012-06-15, 12:35
Ruslan Al-Fakikh 2012-06-15, 17:28
Esteban Gutierrez 2012-06-15, 17:43
Ruslan Al-Fakikh 2012-06-16, 00:40
Copy link to this message
Re: Quering RDBMS table in a Hive query
Hi Ruslan

The solution Esteban pointed out was
1. Import look up data from RDBMS to hdfs/hive (you can fire any adhoc query here). If the data is just a few mbs one or two maps/connections are enough.

2. A look up on this smaller data can be achieved in terms of joining that with larger table

Now since the look up table is small, enable map joins so that the look up table is in the distributed cache and that data is used by map tasks for join.

The two sequential steps mentioned above can be scheduled using a workflow manager as oozie.

In simple terms you can place these steps in order in a shell script and just execute the script.

Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Ruslan Al-Fakikh <[EMAIL PROTECTED]>
Date: Sat, 16 Jun 2012 04:40:36
Subject: Re: Quering RDBMS table in a Hive query

Hi Esteban,

Your solution is what I am trying to avoid, having to keep the hdfs
data up-to-date. I know I can easily schedule a dependency between the
Sqoop import job and the hive query job and currently we have a
scheduling tool (opswise) for such things. But what if I just want to
run an ad hoc query and forget to re-import the lookup data, etc?
Maybe there is a way to put the Sqoop import as a hook for a
particular hive table making it run before every query?
But I understand the problem of having too many connections. I would
like to have it only once and distribute it over all the mappers in a
distributed cache or something like it. Isn't there a way for it?


On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <[EMAIL PROTECTED]> wrote:
> Hi Ruslan,
> Jan's approach sounds like a good workaround only if you can use the output
> in a mapjoin, but I don't think it will scale nicely if you have a very
> large number of  tasks since that will translate as  DB connections to
> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
> Hive query once the transfer was done.
> cheers!
> esteban.
> --
> Cloudera, Inc.
> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>
> wrote:
>> Thanks Jan
>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
>> > On 6/15/12, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>> >> I didn't know InputFormat and LineReader could help, though I didn't
>> >> look at them closely. I was thinking about implementing a
>> >> Table-Generating Function (UDTF) if there is no an already implemented
>> >> solution.
>> >
>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>> > you need. I actually use both - in Input format I load lists of
>> > allowed values to check the data and in UDF I query some other
>> > database for values necessary only in some queries. Generally, I'd use
>> >  InputFormat for situations where all jobs over given table would
>> > require the additional data from RDBMS. Oppositely, in situations
>> > where only few jobs out of many requires the RDBMS connection, I would
>> > use UDF.
>> >
>> > I think that the difference in performance between the two is rather
>> > small, if any. Also UDF is easier to write, so it might be the "weapon
>> > of choice", at least if you don't already use custom InputFormat.
>> >
>> > Jan
Ruslan Al-Fakikh 2012-06-18, 10:51