Again, I do understand those two steps, and I do understand that I
have a lot of options of making them run in sequence, but from the
very beginning my point was to avoid having two steps. I want to have
a dataset in the hive warehouse that I could query at any time with
just a hive query without any preliminary imports/queries. So
implementing a custom UDF/InputFormat looks best for now except for
having too many rdbms connections (one connection per mapper as far as
On Sat, Jun 16, 2012 at 6:04 AM, Bejoy KS <[EMAIL PROTECTED]> wrote:
> Hi Ruslan
> The solution Esteban pointed out was
> 1. Import look up data from RDBMS to hdfs/hive (you can fire any adhoc query here). If the data is just a few mbs one or two maps/connections are enough.
> 2. A look up on this smaller data can be achieved in terms of joining that with larger table
> Now since the look up table is small, enable map joins so that the look up table is in the distributed cache and that data is used by map tasks for join.
> The two sequential steps mentioned above can be scheduled using a workflow manager as oozie.
> In simple terms you can place these steps in order in a shell script and just execute the script.
> Bejoy KS
> Sent from handheld, please excuse typos.
> -----Original Message-----
> From: Ruslan Al-Fakikh <[EMAIL PROTECTED]>
> Date: Sat, 16 Jun 2012 04:40:36
> To: <[EMAIL PROTECTED]>
> Reply-To: [EMAIL PROTECTED]
> Subject: Re: Quering RDBMS table in a Hive query
> Hi Esteban,
> Your solution is what I am trying to avoid, having to keep the hdfs
> data up-to-date. I know I can easily schedule a dependency between the
> Sqoop import job and the hive query job and currently we have a
> scheduling tool (opswise) for such things. But what if I just want to
> run an ad hoc query and forget to re-import the lookup data, etc?
> Maybe there is a way to put the Sqoop import as a hook for a
> particular hive table making it run before every query?
> But I understand the problem of having too many connections. I would
> like to have it only once and distribute it over all the mappers in a
> distributed cache or something like it. Isn't there a way for it?
> On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <[EMAIL PROTECTED]> wrote:
>> Hi Ruslan,
>> Jan's approach sounds like a good workaround only if you can use the output
>> in a mapjoin, but I don't think it will scale nicely if you have a very
>> large number of tasks since that will translate as DB connections to
>> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
>> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
>> Hive query once the transfer was done.
>> Cloudera, Inc.
>> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>
>>> Thanks Jan
>>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
>>> > On 6/15/12, Ruslan Al-Fakikh <[EMAIL PROTECTED]> wrote:
>>> >> I didn't know InputFormat and LineReader could help, though I didn't
>>> >> look at them closely. I was thinking about implementing a
>>> >> Table-Generating Function (UDTF) if there is no an already implemented
>>> >> solution.
>>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>>> > you need. I actually use both - in Input format I load lists of
>>> > allowed values to check the data and in UDF I query some other
>>> > database for values necessary only in some queries. Generally, I'd use
>>> > InputFormat for situations where all jobs over given table would
>>> > require the additional data from RDBMS. Oppositely, in situations
>>> > where only few jobs out of many requires the RDBMS connection, I would
>>> > use UDF.
>>> > I think that the difference in performance between the two is rather
>>> > small, if any. Also UDF is easier to write, so it might be the "weapon