-Re: Quering RDBMS table in a Hive query
Esteban Gutierrez 2012-06-15, 17:43
Jan's approach sounds like a good workaround only if you can use the output
in a mapjoin, but I don't think it will scale nicely if you have a very
large number of tasks since that will translate as DB connections to
MySQL. I think a more scalable and reliable way is just to schedule an
Oozie workflow to transfer the data from MySQL to HDFS using Sqoop and
trigger the Hive query once the transfer was done.
On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:
> Thanks Jan
> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
> > On 6/15/12, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
> >> I didn't know InputFormat and LineReader could help, though I didn't
> >> look at them closely. I was thinking about implementing a
> >> Table-Generating Function (UDTF) if there is no an already implemented
> >> solution.
> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
> > you need. I actually use both - in Input format I load lists of
> > allowed values to check the data and in UDF I query some other
> > database for values necessary only in some queries. Generally, I'd use
> > InputFormat for situations where all jobs over given table would
> > require the additional data from RDBMS. Oppositely, in situations
> > where only few jobs out of many requires the RDBMS connection, I would
> > use UDF.
> > I think that the difference in performance between the two is rather
> > small, if any. Also UDF is easier to write, so it might be the "weapon
> > of choice", at least if you don't already use custom InputFormat.
> > Jan