-Re: Quering RDBMS table in a Hive query
Ruslan Al-Fakikh 2012-06-16, 00:40
Your solution is what I am trying to avoid, having to keep the hdfs
data up-to-date. I know I can easily schedule a dependency between the
Sqoop import job and the hive query job and currently we have a
scheduling tool (opswise) for such things. But what if I just want to
run an ad hoc query and forget to re-import the lookup data, etc?
Maybe there is a way to put the Sqoop import as a hook for a
particular hive table making it run before every query?
But I understand the problem of having too many connections. I would
like to have it only once and distribute it over all the mappers in a
distributed cache or something like it. Isn't there a way for it?
On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <[EMAIL PROTECTED]> wrote:
> Hi Ruslan,
> Jan's approach sounds like a good workaround only if you can use the output
> in a mapjoin, but I don't think it will scale nicely if you have a very
> large number of tasks since that will translate as DB connections to
> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
> Hive query once the transfer was done.
> Cloudera, Inc.
> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>
>> Thanks Jan
>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
>> > On 6/15/12, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>> >> I didn't know InputFormat and LineReader could help, though I didn't
>> >> look at them closely. I was thinking about implementing a
>> >> Table-Generating Function (UDTF) if there is no an already implemented
>> >> solution.
>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>> > you need. I actually use both - in Input format I load lists of
>> > allowed values to check the data and in UDF I query some other
>> > database for values necessary only in some queries. Generally, I'd use
>> > InputFormat for situations where all jobs over given table would
>> > require the additional data from RDBMS. Oppositely, in situations
>> > where only few jobs out of many requires the RDBMS connection, I would
>> > use UDF.
>> > I think that the difference in performance between the two is rather
>> > small, if any. Also UDF is easier to write, so it might be the "weapon
>> > of choice", at least if you don't already use custom InputFormat.
>> > Jan