Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop, mail # user - Incremental lastmodified with external tables


Copy link to this message
-
Incremental lastmodified with external tables
Devin Suiter RDX 2014-03-06, 16:54
Hello all,

I am working on setting up incremental imports on a table that will update
on a Last Modified Date column as the check column.

In my testing, I have created an external table, and imported sample tables
with the job created thus:

sqoop job --create import_testing_users -- import --connect
"jdbc:sqlserver://11.111.11.11;database=Testing;username=blah;password=blahblah"
Last_Modified

And this seems to be picking up the updates properly.

However, when I update the timestamps in the check column, I end up with
new rows that have the new data, and old rows that still have the old data,
and my SELECT statements are picking up multiple rows. I'm sure I could
select where last_modified = max(last_modified) but it's not ideal.

I've found mention of this as a problem, and have not found a solution
other than to put the new values into a side directory and run sqoop  merge
on the data to flatten it.

I'd like to automate this in a shell script, and was wondering if there is
some better way than to run a merge operation in the script.

Is the merge still the best solution for this situation?

Also, does importing into a external table have an impact on this operation?

Incidentally, I found it necessary to add the "--warehouse-dir
\path\to\file\ --append" argument on this job, whereas the jobs I set up
for a MySQL import (they are --incremental append" jobs) did not need it. I
was just wondering if that was a driver difference, or a job configuration
difference?

Thanks,
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com