Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Reflect MySQL updates into Hive


Copy link to this message
-
Re: Reflect MySQL updates into Hive
Ibrahim Yakti 2012-12-24, 15:34
Thanks Dean for the great reply, setting incremental import should be easy,
if I partitioned my data how hive will get me the updated rows only
considering that the row may have multiple fields that will be updated over
time? and how will I manage the tables that based on multiple sources? and
do you recommend to import the data to HDFS instead of Hive directly? Won't
we have a lot of duplicated records then?

Regarding automation we were thinking to use sqoop-job command or crons as
you suggested.

So, the suggested flow as follows:

MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in Hive
as External Table ---(Transform Data & Join Tables)--> Save it in Hive
tables for reporting.
Correct?

Appreciated.
--
Ibrahim
On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <
[EMAIL PROTECTED]> wrote:

> This is not as hard as it sounds. The hardest part is setting up the
> incremental query against your MySQL database. Then you can write the
> results to new files in the HDFS directory for the table and Hive will see
> them immediately. Yes, even though Hive doesn't support updates, it doesn't
> care how many files are in the directory. The trick is to avoid lots of
> little files.
>
> As others have suggested, you should consider partitioning the data,
> perhaps by time. Say you import about a few HDFS blocks-worth of data each
> day, then use year/month/day partitioning to speed up your Hive queries.
> You'll need to add the partitions to the table as you go, but actually, you
> can add those once a month, for example, for all partitions. Hive doesn't
> care if the partition directories don't exist yet or the directories are
> empty. I also recommend using an external table, which gives you more
> flexibility on directory layout, etc.
>
> Sqoop might be the easiest tool for importing the data, as it will even
> generate a Hive table schema from the original MySQL table. However, that
> feature may not be useful in this case, as you already have the table.
>
> I think Oozie is horribly complex to use and overkill for this purpose. A
> simple bash script triggered periodically by cron is all you need. If you
> aren't using a partitioned table, you have a single sqoop command to run.
> If you have partitioned data, you'll also need a hive statement in the
> script to create the partition, unless you do those in batch once a month,
> etc., etc.
>
> Hope this helps,
> dean
>
> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:
>
>> Hi All,
>>
>> We are new to hadoop and hive, we are trying to use hive to
>> run analytical queries and we are using sqoop to import data into hive, in
>> our RDBMS the data updated very frequently and this needs to be reflected
>> to hive. Hive does not support update/delete but there are many workarounds
>> to do this task.
>>
>> What's in our mind is importing all the tables into hive as is, then we
>> build the required tables for reporting.
>>
>> My questions are:
>>
>>    1. What is the best way to reflect MySQL updates into Hive with
>>    minimal resources?
>>    2. Is sqoop the right tool to do the ETL?
>>    3. Is Hive the right tool to do this kind of queries or we should
>>    search for alternatives?
>>
>> Any hint will be useful, thanks in advanced.
>>
>> --
>> Ibrahim
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>