Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Incremental lastmodified with external tables


Copy link to this message
-
Incremental lastmodified with external tables
Hello all,

I am working on setting up incremental imports on a table that will update
on a Last Modified Date column as the check column.

In my testing, I have created an external table, and imported sample tables
with the job created thus:

sqoop job --create import_testing_users -- import --connect
"jdbc:sqlserver://11.111.11.11;database=Testing;username=blah;password=blahblah"
Last_Modified

And this seems to be picking up the updates properly.

However, when I update the timestamps in the check column, I end up with
new rows that have the new data, and old rows that still have the old data,
and my SELECT statements are picking up multiple rows. I'm sure I could
select where last_modified = max(last_modified) but it's not ideal.

I've found mention of this as a problem, and have not found a solution
other than to put the new values into a side directory and run sqoop  merge
on the data to flatten it.

I'd like to automate this in a shell script, and was wondering if there is
some better way than to run a merge operation in the script.

Is the merge still the best solution for this situation?

Also, does importing into a external table have an impact on this operation?

Incidentally, I found it necessary to add the "--warehouse-dir
\path\to\file\ --append" argument on this job, whereas the jobs I set up
for a MySQL import (they are --incremental append" jobs) did not need it. I
was just wondering if that was a driver difference, or a job configuration
difference?

Thanks,
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB