I am working on setting up incremental imports on a table that will update
on a Last Modified Date column as the check column.
In my testing, I have created an external table, and imported sample tables
with the job created thus:
sqoop job --create import_testing_users -- import --connect
And this seems to be picking up the updates properly.
However, when I update the timestamps in the check column, I end up with
new rows that have the new data, and old rows that still have the old data,
and my SELECT statements are picking up multiple rows. I'm sure I could
select where last_modified = max(last_modified) but it's not ideal.
I've found mention of this as a problem, and have not found a solution
other than to put the new values into a side directory and run sqoop merge
on the data to flatten it.
I'd like to automate this in a shell script, and was wondering if there is
some better way than to run a merge operation in the script.
Is the merge still the best solution for this situation?
Also, does importing into a external table have an impact on this operation?
Incidentally, I found it necessary to add the "--warehouse-dir
\path\to\file\ --append" argument on this job, whereas the jobs I set up
for a MySQL import (they are --incremental append" jobs) did not need it. I
was just wondering if that was a driver difference, or a job configuration
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com