Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop, mail # user - Using Sqoop to merge/union databases


Copy link to this message
-
Re: Using Sqoop to merge/union databases
Kathleen Ting 2013-08-05, 19:09
Hi Shengjie, in addition to what Abe mentioned, sounds like you have a
perfect use-case for incremental mode lastmodified.

Internally, the lastmodified import consists of two standalone
MapReduce jobs. The first job will import the delta of changed data
similarly to the way normal import does. This import job will save
data in a temporary directory on HDFS. The second job will take both
the old and new data and will merge them together into the final
output, preserving only the last updated value for each row.

Here's a sample command [1]:
sqoop import \
  --connect jdbc:mysql://mysql.example.com/sqoop \
  --username sqoop \
  --password sqoop \
  --table visits \
  --incremental lastmodified \
  --check-column last_update_date \
  --last-value "2013-05-22 01:01:01"

[1] http://shop.oreilly.com/product/0636920029519.do

Hope this helps,
Kathleen

On Mon, Aug 5, 2013 at 8:40 AM, Abraham Elmahrek <[EMAIL PROTECTED]> wrote:
> Hey There,
>
> Sqoop is capable of performing incremental updates
> (http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_incremental_imports).
> You can also import into HBase
> (http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_importing_data_into_hbase).
>
> Sqoop should be able to update a single table for all three databases, but
> you'll need to make sure that the row keys sqoop generates don't overlap.
> Also, you'll likely have to manage '--last-value'
>
> I highly recommend testing such a setup first and reporting back with your
> findings!
>
> -Abe
>
>
> On Sat, Aug 3, 2013 at 2:14 PM, shengjie min <[EMAIL PROTECTED]> wrote:
>>
>> Hi All,
>>
>> I've asked this question in HBase mailing list, people suggested me better
>> off ask it here :) so here I am. I am new to sqoop and having a use case
>> where there is a few applications running in house independently, Let's say
>> applications A, B, C. Each has its own DB associated. I wanna create a
>> aggregated view on all the databases so that I don't have to jump into
>> different dbs to find the info I need. Simply example will be all three
>> applications have a table called "users", they are v similar, I wanna union
>> the "users" table.
>>
>> I've had a look at sqoop, looks like it allows me to move data from
>> database A,B,C to a single/centralised place - e.g. HBase?
>>
>> The solution I am looking for ideally need to do the followings:
>>
>> 1. the centralised storage keeps updated reasonably quick as the original
>> db (A, B, C) gets updated. By all means, I am not looking for one time bulk
>> import, I wanna have incremental updates after the initial import.
>> 2. As long as I provide a schema mapping, Can A,B,C be imported to a
>> single place, e.g. single HBase table.
>>
>> now, my question is:
>>
>> Is Sqoop a suitable tool for this? I was originally considering to use
>> mangodb and write the periodic/parallel import piece myself. But for now, I
>> am leaning towards sqoop more since in house we have hadoop running already.
>> Any advices are highly appreciated!
>>
>> Thanks,
>> Shengjie
>
>