Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Reflect MySQL updates into Hive

Copy link to this message
Re: Reflect MySQL updates into Hive
My problem is in eliminating the duplicates and only keep the correct data,
any advise please?
On Dec 24, 2012 9:13 PM, "Dean Wampler" <[EMAIL PROTECTED]>

> Looks good, but a few suggestions. If you can eliminate duplicates, etc.
> as you ingest the data into HDFS, that would eliminate a cleansing step.
> Note that if the target directory in HDFS IS the specified location for an
> external Hive table/partition, then there will be no separate step to "load
> in Hive as External Table". It's already there!
> Your "transform data..." is a common pattern; stage "raw" data into a
> location, then use Hive (or Pig) to transform it into the final form and
> INSERT INTO the final Hive table.
> dean
> On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:
>> Thanks Dean for the great reply, setting incremental import should be
>> easy, if I partitioned my data how hive will get me the updated rows only
>> considering that the row may have multiple fields that will be updated over
>> time? and how will I manage the tables that based on multiple sources? and
>> do you recommend to import the data to HDFS instead of Hive directly? Won't
>> we have a lot of duplicated records then?
>> Regarding automation we were thinking to use sqoop-job command or crons
>> as you suggested.
>> So, the suggested flow as follows:
>> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in
>> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive
>> tables for reporting.
>> Correct?
>> Appreciated.
>> --
>> Ibrahim
>> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <
>> [EMAIL PROTECTED]> wrote:
>>> This is not as hard as it sounds. The hardest part is setting up the
>>> incremental query against your MySQL database. Then you can write the
>>> results to new files in the HDFS directory for the table and Hive will see
>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>>> care how many files are in the directory. The trick is to avoid lots of
>>> little files.
>>> As others have suggested, you should consider partitioning the data,
>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>>> day, then use year/month/day partitioning to speed up your Hive queries.
>>> You'll need to add the partitions to the table as you go, but actually, you
>>> can add those once a month, for example, for all partitions. Hive doesn't
>>> care if the partition directories don't exist yet or the directories are
>>> empty. I also recommend using an external table, which gives you more
>>> flexibility on directory layout, etc.
>>> Sqoop might be the easiest tool for importing the data, as it will even
>>> generate a Hive table schema from the original MySQL table. However, that
>>> feature may not be useful in this case, as you already have the table.
>>> I think Oozie is horribly complex to use and overkill for this purpose.
>>> A simple bash script triggered periodically by cron is all you need. If you
>>> aren't using a partitioned table, you have a single sqoop command to run.
>>> If you have partitioned data, you'll also need a hive statement in the
>>> script to create the partition, unless you do those in batch once a month,
>>> etc., etc.
>>> Hope this helps,
>>> dean
>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:
>>>> Hi All,
>>>> We are new to hadoop and hive, we are trying to use hive to
>>>> run analytical queries and we are using sqoop to import data into hive, in
>>>> our RDBMS the data updated very frequently and this needs to be reflected
>>>> to hive. Hive does not support update/delete but there are many workarounds
>>>> to do this task.
>>>> What's in our mind is importing all the tables into hive as is, then we
>>>> build the required tables for reporting.
>>>> My questions are:
>>>>    1. What is the best way to reflect MySQL updates into Hive with