Hello all,

I am working on setting up incremental imports on a table that will update
on a Last Modified Date column as the check column.

In my testing, I have created an external table, and imported sample tables
with the job created thus:

sqoop job --create import_testing_users -- import --connect

And this seems to be picking up the updates properly.

However, when I update the timestamps in the check column, I end up with
new rows that have the new data, and old rows that still have the old data,
and my SELECT statements are picking up multiple rows. I'm sure I could
select where last_modified = max(last_modified) but it's not ideal.

I've found mention of this as a problem, and have not found a solution
other than to put the new values into a side directory and run sqoop  merge
on the data to flatten it.

I'd like to automate this in a shell script, and was wondering if there is
some better way than to run a merge operation in the script.

Is the merge still the best solution for this situation?

Also, does importing into a external table have an impact on this operation?

Incidentally, I found it necessary to add the "--warehouse-dir
\path\to\file\ --append" argument on this job, whereas the jobs I set up
for a MySQL import (they are --incremental append" jobs) did not need it. I
was just wondering if that was a driver difference, or a job configuration

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB